Interent Archive content in “discovery” systems

This quick posting describes how Internet Archive content, specifically, content from the Open Content Alliance can be quickly and easily incorporated into local library “discovery” systems. VuFind is used here as the particular example:

  1. Get keys – The first step is to get a set of keys describing the content you desire. This can be acquired through the Internet Archive’s advanced search interface.
  2. Convert keys – The next step is to convert the keys into sets of URLs pointing to the content you want to download. Fortunately, all the URLs have a similar shape: http://www.archive.org/download/KEY/KEY.pdf, http://www.archive.org/download/KEY/KEY_meta.mrc, or http://www.archive.org/download/KEY/KEY__djvu.txt.
  3. Download – Feed the resulting URLs to your favorite spidering/mirroring application. I use wget.
  4. Update – Enhance the downloaded MARC records with 856$u valued denoting the location of your local PDF copy as well as the original (cononical) version.
  5. Index – Add the resulting MARC records to your “discovery” system.

Linked here is a small distribution of shell and Perl scripts that do this work for me and incorporate the content into VuFind. Here is how they can be used:

  $ getkeys.sh > catholic.keys
  $ keys2urls.pl catholic.keys > catholic.urls
  $ mirror.sh catholic.urls
  $ updatemarc.pl
  $ find /usr/var/html/etexts -name '*.marc' /
  -exec cat {} >> /usr/local/vufind/marc/archive.marc \;
  $ cd /usr/local/vufind
  $ ./import.sh marc/archive.marc
  $ sudo ./vufind.sh restart

Cool next steps would be use text mining techniques against the downloaded plain text versions of the documents to create summaries, extract named entities, and identify possible subjects. These items could then be inserted into the MARC records to enhance retrieval. Ideally the full text would be indexed, but alas, MARC does not accomodate that. “MARC must die.”

Published by

Eric Lease Morgan

Artist- and Librarian-At-Large

One thought on “Interent Archive content in “discovery” systems”

Comments are closed.