I have been having a great deal of success extracting keywords and two-word phrases from documents and assigning them as “subject headings” to electronic texts — automatic metadata generation. In many cases but not all, the set of assigned keywords I’ve created are just as good if not better as the controlled vocabulary terms assigned by librarians.
The Alex Catalogue is a collection of roughly 14,000 electronic texts. The vast majority come from Project Gutenberg. Some come from the Internet Archive. The smallest number come from a defunct etext collection of Virginia Tech. All of the documents are intended to surround the themes of American and English literature and Western philosophy.
With the exception of the non-fiction works from the Internet Archive, none of the electronic texts were associated with subject-related metadata. With the exception of author names (which are yet to be “well-controlled”), it has been difficult learn the “aboutness” of each of the documents. Such a thing is desirable for two reasons: 1) to enable the reader to evaluate the relevance of document, and 2) to provide a browsable interface to the collection. Without some sort of tags, subject headings, or application of clustering techniques, browsability is all but impossible. My goal was to solve this problem in an automated manner.
A couple of years ago I used tools such as Lingua::EN::Summarize and Open Text Summarizer to extract keywords and summaries from the etexts and assign them as subject terms. The process worked, but not extraordinarily well. I then learned about Term Frequency Inverse Document Frequency (TFIDF) to calculate “relevance”, and T-Score to calculate the probability of two words appearing side-by-side — bi-grams or two-word phrases. Applying these techniques to the etexts of the Alex Catalogue I have been able to create and add meaningful subject “tags” to each of my documents which then paves the way to browsability. Here is the algorithm I used to implement the solution:
- Collect documents – This was done through various harvesting techniques. Etexts are saved to the local file system and what metadata does exist gets saved to a database.
- Index the collection – Each of the documents is full-text indexed. Not only does this facilitate Steps #3 and #4, below, it makes the collection searchable.
- Calculate a relevancy score (TFIDF) for each word – With the exception of parsing each etext into a set of “words”, counting the number of words in a document and the frequency of each word is easy. Determining the total number of documents in the collection is trivial. By searching the index for each word and getting back the number of documents in which it appears is the work of the indexer. With these four values (number of words in a document, frequency of a word in a document, the number of total documents, and the number of documents where the word appears) TFIDF can be calculated for each word.
- Calculate a relevancy score for each bi-gram – Instead of extracting words from an etext, bi-grams (two-word phrases) were extracted and TFIDF is calculated for each of them, just like Step #3.
- Save – If the score for each word or bi-gram is greater than an arbitrarily denoted lower bounds, and if the word or bi-gram is not a stop word, then assign the word or bi-gram to the etext. This step was the most time-consuming. It required many dry runs of the algorithm to determine an optimal lower-bounds as well as set of stop words. The lower the bounds the greater number of words and phrases are returned, but as the number of words and phrases increases their apparent usefulness decreases. The words become too common among the controlled vocabulary. At the other end of the scale, a stop word list needed to be created to remove meaningless words and phrases. The stop word problem was complicated in Project Gutenberg texts because of the “fine print” and legalese in most of the documents, and by the OCRed (optical character recognized) text from the Internet Archive. Words like “thofe” where the “f” was really an “s” needed to be removed.
- Go to Step #3 for each document in the collection.
Through this process I discovered a number of things.
First, in regards to fictional works, the words or phrases returned are often pronouns, and these were usually the names of characters from the work. An excellent example is Mark Twain’s Adventures of Huckleberry Finn whose currently assigned terms include: huck, tom, joe, injun joe, aunt polly, tom sawyer, muff potter, and injun joe’s.
Second, in regards to works of non-fiction, the words and phrases returned are also nouns, and these are objects referred to often in the etext. A good example includes John Stuart Mill’s Auguste Comte and Positivism where the assigned words are: comte, phaenomena, metaphysical, science, mankind, social, scientific, philosophy, and sciences.
Third, automatically generated keywords and phrases were many times just as useful as the librarian-assigned Library of Congress Subject headings. Many of the items harvested from the Internet Archive were complete with MARC records. Some of those records included subject headings. During Step #5 (above), I spent time observing the output and comparing it to previously assigned terms. Take for example a work called Universalism in America: A History by Richard Eddy. Its assigned headings included:
- Universalism United States History
- Unitarian Universalist churches United States
My automatically generated terms/phrases are:
- hosea ballou
- universalist church
- first universalist
- universalist quarterly
- universalist society
- restorationist controversy
- thomas whittemore
- abner kneeland
- sermon delivered
- universalist meeting
- universalist magazine
- universal salvation
- hosea ballon
- vers alism
- edward turner
- general convention
Granted, the generated list is not perfect. For example, Hosea Ballou is mentioned twice, and the second was probably caused by an OCR error. On the other hand, how was a person to know that Hosea Ballou was even a part of the etext if it weren’t for this process? The same goes for the other people: Thomas Whittemore, Abner Kneeland, and Edward Turner. In defense of controlled vocabulary, the terms “church”, “sermon”, “doctrine”, and “american” could all be assumed from the (rather) hierarchal nature of LCSH, but unless a person understands the nature of LCSH such a thing is not obvious.
As a librarian I understand the power of a controlled vocabulary, but since I am not limited to three to five subject headings per entry, and because controlled vocabularies are often very specific, I have retained the LCSH in each record whenever possible. The more the merrier.
Now that the collection has richer metadata, the next steps will be to exploit it. Some of those nexts steps include:
- Normalize the data – Each of the subjects are currently saved in a single database field. They need to be normalized across the database to enable database joins and make it easier to generate reports.
- Create a browsable interface – Write a set of static Web pages linking keywords and phrases to etexts. This will make it easier to see at a glance the type of content in the collection.
- Re-index – Trivial. Send all the data and metadata back to the indexer ultimately improving the precision/recall ratio.
- Enhance search experience – Extract the keywords and phrases from search results and display them to the user. Make them linkable to easily “find more like this one.” Extract the same keywords and phrases and use them to implement the increasingly popular browsable facets feature.
- Enhance linked data – Generate a report against the database to create (better) RDF files complete with more meaningful (subject) tags. Link these tags to external vocabularies such as WordNet through the use of linked data thus contributing to the Semantic Web and enabling others to benefit from my labors. (Infomotions Man says, ‘Give back to the ‘Net”.)
Fun! Combining traditional librarianship with computer applications; not automating existing workflows as much as exploiting the inherent functions of a computer. Using mathematics to solve large-scale problems. Making it easier to do learning and research. It is the not what of librarianship that needs to change as much as the how.