Next-generation library catalogs are really indexes, not catalogs, and increasingly the popular name for such things is “discovery system”. Examples include VuFind, Primo combined with Primo Central, Blacklight, Summon, and to a lesser extent Koha, Evergreen, OLE, and XC. While this may be a well-accepted summary of the situation, I really do not think it goes far enough. Indexers address the problem of find, but in my opinion, find is not the problem to be solved. Everybody can find. Most people believe Google has all but solved that problem. Instead, the problem to solve is use. Just as much as people want to find information, they want to use it, to put it into context, and to understand it. With the advent of so much full text content, the problem of find is much easier to solve than it used to be. What is needed is a “next-generation” library catalog including tools and interfaces designed to make the use and understanding of information easier. Both the “Catholic Portal” and the discovery systems of the Hesburgh Libraries at the University of Notre Dame are beginning to implement some of these ideas. When it comes to “next-generation” library catalogs we might ask the question, “Are we there yet?”. I think the answer is, “No, not yet.”
This text was originally written for a presentation to the Rare Books and Manuscripts Section of the American Library Association during a preconference meeting, June 23, 2011. It is available in a number of formats including this blog posting, a one-page PDF document intended as a handout, and an ePub file.
Numbers of choices
There are currently a number of discovery systems from which a library can choose, and it is very important to note that they have more things in common than differences. VuFind, Primo combined with Primo Central, Summon, and Blacklight are all essentially indexer/search engine combinations. Even more, they all use same “free” and open source software — Lucene — at their core. All of them take some sort of bibliographic data (MARC, EAD, metadata describing journal articles, etc.), stuff it into a data structure (made up authors, titles, key words, and control numbers), index it in the way the information retrieval community has been advocating for at least the past twenty years, and finally, provide a way to query the index with either one-box-one-button or fielded interfaces. Everything else — facets, cover art, reviews, favorites, etc. — is window dressing. When and if any sort of OCLC/EBSCOHost combination manifests itself, I’m sure the underlying technology will be very similar.
Koha, Evergreen, and OLE (Open Library Environment) are more traditional integrated library systems. They automate traditional library processes. Acquisitions. Cataloging. Serials Control. Circulation. Etc. They are database applications, not indexers, designed to manage an inventory. Search — the “OPAC” — is one of these processes. The primary difference between these applications and the integrated library systems of the recent past is their distribution mechanism. Koha and Evergreen are open source software, and therefore as “free as a free kitten”. OLE is still in development, but will be distributed as open source. Everything else is/was licensed for a fee.
When talking about “next-generation” library catalogs and “discovery systems”, many people allude to the Extensible Catalog (XC) which is not catalog nor an index. More accurately, it is system enabling and empowering the library community to manage and transform its bibliographic data on a massive scale. It offer ways for a library to harvest content from OAI-PMH data repositories (such as library catalogs), do extensive find/replace or enhancement operations against the harvested data, expose the result via OAI-PMH again, and finally, support the NCIP protocol so the circulation status of items found in an index can be determined. XC is middleware designed to provide functionality between an integrated library system and discovery system.
Find is not the problem
With the availability of wide-spread full text indexing, the need to organize content according to a classification system — to catalog items — has diminished. This need is not negated, but it is not as necessary as it used to be. In the past, without the availability of wide-spread full text indexing, classification systems provided two functions: 1) to organize the collection into a coherent whole with sub-parts, and 2) to surrogate physical items enumerated in a list. The aggregate of metadata elements — whether they be titles, authors, contributors, key words, subject terms, etc. — acted as “dummies” for the physical item containing the information. They are/were pointers to the book, the journal article, the piece of sheet music, etc. With the advent of wide-spread full text indexing, these two functions are not needed as much as they were in the past. Through the use of statistical analysis and direct access to the thing itself, indexers/search engines make the organization and discovery of information easier and less expenses. Note, I did not say “better”, just simpler and with greater efficiency.
Because wide-spread full text indexing abounds, the problem of find is not as acute as it used to be. In my opinion, it is time to move away from the problem of find and towards the problem of use. What does a person do with the information once they find and acquire it? Does it make sense? Is it valid? Does it have a relationship other things, and if so, then what is that relationship and how does it compare? If these relationships are explored, then what new knowledge might one uncover, or what existing problem might be solved? These are the questions of use. Find is a means to an end, not the end itself. Find is a library problem. Use the problem everybody else wants to solve.
True, classification systems provide a means to discover relationships between information objects, but the predominate classification systems and processes employed today are pre-coordinated and maintained by institutions. As such they posit realities that may or may not match the cognitive perception of today’s readers. Moreover, they are manually applied to information objects. This makes the process literally slow and laborious. Compared to post-coordinated and automated techniques, the manual process of applying classification to information objects is deemed expensive and of diminishing practical use. Put another way, the application of classification systems against information objects today is like icing on a cake, leather trim in a car, or a cherry on a ice cream sundae. They make their associated things richer, but they are not essencial their core purpose. They are extra.
Through the use of a process called text mining, it is possible to provide new services against individual items in a collection as well as to collections as a whole. Such services can make information more useful.
Broadly defined, text mining is an automated process for analyzing written works. Rooted in linguistics, it makes the assumption that language — specifically written language — adheres to sets of loosely defined norms, and these norms are manifested in combinations of words, phrases, sentences, lines of a poem, paragraphs, stanzas, chapters, works, corpora, etc. Additionally, linguistics (and therefore text mining) also assumes these manifestations embody human expressions, meanings, and truth. By systematically examining the manifestations of written language as if they were natural objects, the expressions, meanings, and truths of a work may be postulated. Such is the art and science of text mining.
The process of text mining begins with counting, specifically, counting the number of words (n) in a document. This results in a fact — a given document is n words long. By comparing n across a given corpus of documents, new facts can be derived, such as one document is longer than another, shorter than another, or close to an average length. Once words have been counted they can be tallied. The result is a list of words and their associated frequencies. Some words occur often. Others occur infrequently. The examination of such a list tells a reader something about the given document. The comparison of frequency lists between documents tells the reader even more. By comparing the lengths of documents, the frequency of words, and their existence in an entire corpus a reader can learn of the statistical significance of given words. Thus, the reader can begin to determine the “aboutness” of a given document. This rudimentary counting process forms the heart of most relevancy ranking algorithms of indexing applications and is called “term frequency inverse document frequency” or TFIDF.
Not only can words be tallied but they can be grouped into different parts-of-speech (POS): nouns, pronouns, verbs, adjectives, adverbs, prepositions, function (“stop”) words, etc. While it may be interesting to examine the proportional use of each POS, it may be more interesting to examine the individual words in each POS. Are the personal pronouns singular or plural? Are they feminine or masculine? Are the names of places centered around a particular geographic location? Do these places exist in the current time, a time in the past, or a time in future? Compared to other documents, is there a relatively higher or lower use of color words, action verbs, names of famous people, or sets of words surrounding a particular theme? Knowing the answers to these questions can be quite informative. Just as these processes can be applied to words they can be applied to phrases, sentences, paragraphs, etc. The results can be charted, graphed, and visualized. They can be used to quickly characterize single documents or collections of documents.
The results of text mining processes are not to be taken as representations of truth, any more than the application of Library of Congress Subject Headings completely denote the aboutness of text. Text mining builds on the inherent patterns of language, but language is fluid and ambiguous. Therefore the results of text mining lend themselves to interpretation. The results of text mining are intended to be indicators, guides, and points of reference, and all of these things are expected to be interpreted and then used to explain, describe, and predict. Nor is text mining intended to be a replacement for the more traditional process of close reading. The results of text mining are akin to a book’s table of contents and back-of-the-book index. They outline, enumerate, and summarize. Text mining does the same. It is a form of analysis and a way to deal with information overload.
Assuming the availability of increasing numbers of full text information objects, a library’s “discovery system” could easily incorporate text mining for the purposes of enhancing the traditional cataloging process as well as increasing the usefulness of found material. In my opinion, this is the essence of a true “next-generation” library catalog.
An organization called the Catholic Research Resources Alliance (CRRA) brings together rare, uncommon, and infrequently held materials into a thing colloquially called the “Catholic Portal”. The content for the Portal comes from a variety of metadata formats (MARC, EAD, and Dublin Core) harvested from participating member institutions. Besides supporting the Web 2.0 features we have all come to expect, it also provides item level indexing of finding aids, direct access to digitized materials, and concordancing services. The inclusion of concordance features makes the Portal more than the usual discovery system.
For example, the St. Michael’s College at the University of Toronto is a member of the CRRA. They have been working with the Internet Archive for a number years, and consequently measurable portions of their collection have been digitized. After being given hundreds of Internet Archive unique identifiers, a program was written which mirrored digital content and bibliographic descriptions (MARC records) locally. The MARC records were ingested into the Portal (an implementation of VuFind), and search results were enhanced to include links to both the locally mirrored content as well as the original digital surrogate. In this way, the Portal is pretty much just like any other discovery system. But the bibliographic displays go further because they contain links to text mining interfaces.
The “Catholic Portal”
Through these interfaces, the reader can learn many things. For example, in a book called Letters Of An Irish Catholic Layman the word “catholic” is one of the most frequently used. Using the concordance, the reader can see that “Protestants and Roman Catholics are as wide as the poles asunder”, and “good Catholics are not alarmed, as they should be, at the perverseness with which wicked men labor to inspire the minds of all, but especially of youth, with notions contrary to Catholic doctrine”. This is no big surprise, but instead a confirmation. (No puns intended.) On the other hand, some of the statistically most significant two-word phrases are geographic identities (“upper canada”, “new york”, “lake erie”, and “niagara falls”) . This is interesting because such things are not denoted in the bibliographic metadata. Moreover, a histogram plotting where in the document “niagra fals” occurs can be juxtaposed with a similar histogram for the word “catholic”. Why does the author talk about Catholics when they do not talk about upstate New York? Text mining makes it easier to bring these observations to light in a quick and easy-to-use manner.
Concordance highlighting geographic two-word phrases
Where the word “catholic” is located in the text
Where “niagra falls” is located in the text
Some work being done in the The Hesburgh Libraries at the University of Notre Dame is in the same vein. Specifically, the Libraries is scanning Catholic pamphlets, curating the resulting TIFF images, binding them together to make PDF documents, embedding the results of OCR (optical character recognition) into the PDFs, saving the PDFs on a Web server, linking to the PDFs from the catalog and discovery system, and finally, linking to text mining services from the catalog and discovery system. Consequently, once found, the reader will be able to download a digitized version of a pamphlet, print it, read it in the usual way, and analyze it for patterns and meanings in ways that may have been overlooked through the use of traditional analytic methods.
Are we there yet?
Are we there yet? Has the library profession solved the problem of “next-generation” library catalogs and discovery systems? In my opinion, the answer is, “No.” To date the profession continues to automate its existing processes without truly taking advantage of computer technology. The integrated library systems are more open than they used to be. Consequently control over the way they operate is being transfered from vendors to the library community. The OPACs of yesterday are being replaced with the discovery systems of today. They are easier to use and better meet readers’ desires. They are not perfect. They are not catalogs. But they do make the process of find more efficient.
On the other hand, our existing systems do not take advantage of the current environment. They do not exploit the wide array and inherent functionality of available full text literature. Think of the millions of books freely available from the Internet Archive, Google Books, the HathiTrust, and Project Gutenberg. Think of the thousands of open access journal titles. Think about all the government documents, technical reports, theses & dissertations, conference proceedings, blogs, wikis, mailing list archives, and even “tweets” freely available on the Web. Even without the content available through licensing, this content has the makings of a significant library of any type. The next step is to provide enhanced services against this content — services that go beyond discovery and access. Once done, the library profession moves away from being a warehouse to an online place where data and information can be put into context, used to address existing problems, and/or create new knowledge.
The problem of find as reached the point of diminishing returns. The problem of use is now the problem requiring a greater amount of the profession’s attention.