Interent Archive content in “discovery” systems

This quick posting describes how Internet Archive content, specifically, content from the Open Content Alliance can be quickly and easily incorporated into local library “discovery” systems. VuFind is used here as the particular example:

  1. Get keys – The first step is to get a set of keys describing the content you desire. This can be acquired through the Internet Archive’s advanced search interface.
  2. Convert keys – The next step is to convert the keys into sets of URLs pointing to the content you want to download. Fortunately, all the URLs have a similar shape: http://www.archive.org/download/KEY/KEY.pdf, http://www.archive.org/download/KEY/KEY_meta.mrc, or http://www.archive.org/download/KEY/KEY__djvu.txt.
  3. Download – Feed the resulting URLs to your favorite spidering/mirroring application. I use wget.
  4. Update – Enhance the downloaded MARC records with 856$u valued denoting the location of your local PDF copy as well as the original (cononical) version.
  5. Index – Add the resulting MARC records to your “discovery” system.

Linked here is a small distribution of shell and Perl scripts that do this work for me and incorporate the content into VuFind. Here is how they can be used:

  $ getkeys.sh > catholic.keys
  $ keys2urls.pl catholic.keys > catholic.urls
  $ mirror.sh catholic.urls
  $ updatemarc.pl
  $ find /usr/var/html/etexts -name '*.marc' /
  -exec cat {} >> /usr/local/vufind/marc/archive.marc \;
  $ cd /usr/local/vufind
  $ ./import.sh marc/archive.marc
  $ sudo ./vufind.sh restart

Cool next steps would be use text mining techniques against the downloaded plain text versions of the documents to create summaries, extract named entities, and identify possible subjects. These items could then be inserted into the MARC records to enhance retrieval. Ideally the full text would be indexed, but alas, MARC does not accomodate that. “MARC must die.”

VUFind at PALINET

I attended a VUFind meeting at PALINET in Philadelphia today, November 6, and this posting summarizes my experiences there.

As you may or may not know, VUFind is a “discovery layer” intended to be applied against a traditional library catalog. Originally written by Andrew Nagy of Villanova University, it has been adopted by a handful of libraries across the globe and is being investigated by quite a few more. Technically speaking, VUFind is an open source project based on Solr/Lucene. Extract MARC records from a library catalog. Feed them to Solr/Lucene. Provide access to the index as well as services against the search results.

The meeting was attended by about thirty people. The three people from Tasmania won the prize for coming the furthest, but there were also people from Stanford, Texas A&M, and a number of more regional libraries. The meeting had a barcamp-like agenda. Introduce ourselves. Brainstorm topics for discussion. Discuss. Summarize. Go to bar afterwards. Alas, I didn’t get to go to the bar, but I was there for the balance. The following bullet points summarize each discussion topic:

  • Jangle – A desire was expressed to implement some sort of API (application programmer interface) to VUFind in order to ensure a greater degree of interoperability. The DLF-DI was mentioned quite a number of times, but Jangle was the focus of the discussion. Unfortunately, not a whole lot of people around the room knew about Jangle, the ATOM Publishing Protocol, nor REST-ful computing techniques in general. Because creating an API was desired there was some knowledge of the XC (eXtensible Catalog) project around the room, and there was curiosity/frustration as to why more collaboration could not be done with XC. Apparently the XC process and their software is not as open and transparent has I had thought. (Note to self: ping the folks at XC and bring this issue to their attention.) In the end, implementing something like Jangle was endorsed.
  • Non-MARC content – It was acknowledged that non-MARC content ought to be included in any sort of “discovery layer”. A number of people had experimented with including content from their local institutional repositories, digital libraries, and/or collection of theses & dissertations. The process is straight-forward. Get set of metadata. Map it to VUFind/Solr fields. Feed it to the indexer. Done. Other types of data people expressed an interest in incorporating included: EAD, TEI, images, various types of data sets, and mathematical models. From here the discussion quickly evolved into the next topic…
  • Solrmarc – Through the use of a Java class called MARC4J, a Solr plug-in has been created by the folks at the University of Virginia. This plug-in — Solrmarc — makes it easier to read MARC data and feed it to Solr. There was a lot of discussion whether or not this plug-in should be extended to include other data types, such as the ones outlined above, or to distribute Solrmarc as-is, more akin to a GNU “do one thing and one thing well” type of tool. From my perspective, no specific direction was articulated.
  • Authority control – We all knew the advantage of incorporating authority lists (names, authors, titles) into VUFind. The general ideas was to acquire authority lists. Incorporate this data into the underlying index. Implement “find more like this one” types of services against search results based on the related records linked through authorities. There was then much discussion on how to initially acquire the necessary authority data. We were a bit stymied. After lunch a slightly different tack was taken. Acquire some authority data, say about 1,000 records. Incorporate it into an implementation of VUFind. Demonstrate the functionality to wider audiences. Tackle the problem of getting more complete and updated authority data later.
  • De-duplication/FRBR – This was probably the shortest discussion point, and it really surrounded FRBR. We ended up asking ourselves, “To what degree do we want to incorporate Web Services such as xISBN into VUFind to implement FRBR-like functionality, or to what degree should ‘real’ FRBRization take place?” Compared to other things, de-duplication/FRBR seemed to be taking a lower priority.
  • Serials holdings – This discussion was around indexing and/or displaying serials holdings information. There was much talk about the ways various integrated library systems allow libraries to export holdings information, whether or not it was merged with bibliographic information, and how consistent it was from system to system. In general it was agreed that this holdings information ought to be indexed to enable searches such as “Time Magazine 2004”, but displaying the results was seen as problematic. “Why not use your link resolver to address this problem?” was asked. This whole issue too was given a lower priority since more and more serial holdings are increasingly electronic in nature.
  • Federated search – It was agreed that federated search s?cks, but it is a necessary evil. Techniques for incorporating it into VUFind ranged from: 1) side-stepping the problem by licensing bibliographic data from vendors, 2) side-stepping the problem by acquiring binary Lucene indexes of bibliographic data from vendors, 3) creating some sort of “smart” interface that looks at VUFind search results to automatically select and search federated search targets whose results are hidden behind a tab until selected by the user, or 4) allow the user to assume some sort of predefined persona (Thomas Jefferson, Isaac Newton, Kurt Godel, etc.) to point toward the selection of search targets. LibraryFind was mentioned as a store for federated search targets. Pazpar2 was mentioned as tool to do the actual searching.
  • Development process – The final discussion topic regarded the on-going development process. To what degree should the whole thing be more formalized? Should VUFind be hosted by a third party? Code4Lib? PALINET? A newly created corporation? Is it a good idea to partner with similar initiative such as OLE (Open Library Environment), XC, ILF-DI, or BlackLight? On one hand, such formalization would give the process more credibility and open more possibilities for financial support, but on the other hand the process would also become more administratively heavy. Personally, I liked the idea of allowing PALINET to host the system. It seems to be an excellent opportunity for such an library-support organization.

The day was wrapped up by garnering volunteers to see after each of the discussion points in the hopes of developing them further.

I appreciated the opportunity to attend the meeting, especially since it is quite likely I will be incorporating VUFind into a portal project called the Catholic Research Resources Alliance. I find it amusing the way many “next generation” library catalog systems — “discovery layers” — are gravitating toward indexing techniques and specifically Lucene. Currently, these systems include VUFind, XC, BlackLight, and Primo. All of them provide a means to feed an indexer data, and then user access to the index.

Of all the discussions, I enjoyed the one on federated search the most because it toyed with the idea of making the interfaces to our indexes smarter. While this smacks of artificial intelligence, I sincerely think this is an opportunity to incorporate library expertise into search applications.