I’ve been having some fun with Internet Archive content.
The process
More specifically, I have created a tiny system for copying scanned materials locally, enhancing it with a word cloud, indexing it, and providing access to whole thing. There is how it works:
- Identify materials of interest from the Archive and copy their URLs to a text file.
- Feed the text file to a wget (wget.sh) which copies the plain text, PDF, XML metadata, and GIF cover art locally.
- Create a rudumentary word cloud (cloud.pl) against each full text version of a document in an effort to suppliment the MARC metadata.
- Index each item using the MARC metadata and full text (index.pl). Each index entry also includes the links to the word cloud, GIF image, PDF file, and MARC data.
- Provide a simple one-box, one-button interface to the index (search.pl & search.cgi). Search results appear much like the Internet Archive’s but also include the word cloud.
- Go to Step #1; rinse, shampoo, and repeat.
The demonstration
Attached are all the scripts I’ve written for the as-of-yet-unamed process, and you can try the demonstration at http://dewey.library.nd.edu/hacks/ia/search.cgi, but remember, there are only about two dozen items presently in the index.
The possibilities
There are many ways the system can be improved, and they can be divided into two types: 1) servcies against the index, and 2) services against the items. Services against the index include things like paging search results, making the interface “smarter”, adding things like faceted browse, implementing an advaced search, etc.
Services against the items interest me more. Given the full text it might be possible to do things like: compare & contrast documents, cite documents, convert documents into many formats, trace idea forward & backward, do morphology against words, add or subtract from “my” collection, search “my” collection, share, annotate, rank & review, summarize, create relationships between documents, etc. These sort of features I believe to be a future direction for the library profession. It is more than just get the document; it is also about doing things with them once they are acquired. The creation of the word clouds is a step in that direction. It assists in the compare & contrast of documents.
The Internet Archive makes many of these things possible because they freely distribute their content — including the full text.
InternetArchive++