This posting describes how the extraction of ngrams and the implementation of concordances are integrated into the Alex Catalogue of Electronic Texts. Given the increasing availability of full-text content in libraries, the techniques described here could easily be incorporated into traditional library “discovery systems” and/or catalogs, if and only if the library profession were to shift its definition of what it means to practice librarianship.
During the past couple of weeks, in fits of creativity, one of the things I spent some of my time on was a Perl module named Lingua::EN::Bigram. At version 0.03, it now supports not only bigrams, trigrams, and quadgrams (two-, three-, and four-word phrases, respectively), but also ngrams — multi-word phrases of an arbitrary length.
Given this enhanced functionality, and through the use of a script called ngrams.pl, I learned that the 10 most frequently used 5-word phrases and the number of times they occur in Henry David Thoreau’s Walden seem to surround spacial references:
- a quarter of a mile (6)
- i have no doubt that (6)
- as if it were a (6)
- the other side of the (5)
- the surface of the earth (4)
- the greater part of the (4)
- in the midst of a (4)
- in the middle of the (4)
- in the course of the (3)
- two acres and a half (3)
Whereas the same process applied to Thoreau’s A Week on the Concord and Merrimack Rivers returns lengths and references to flowing water, mostly:
- a quarter of a mile (8)
- on the bank of the (7)
- the surface of the water (6)
- the middle of the stream (6)
- as if it were the (5)
- as if it were a (4)
- is for the most part (4)
- for the most part we (4)
- the mouth of this river (4)
- in the middle of the (4)
While not always as clear cut as the examples outlined above, the extraction and counting of ngrams usually supports the process of “distant reading” — a phrase coined by Franco Moretti in Graphs, Maps, Trees: Abstract Models for Literary History (2007) to denote the counting, graphing, and mapping of literary texts. With so much emphasis on reading in libraries, I ask myself, “Ought the extraction of ngrams be applied to library applications?”
Concordances are literary tools used to evaluate texts. Dating back to as early as the 12th or 13th centuries, they were first used to study religious materials. Concordances take many forms, but they usually list all the words in a text, the number of times each occurs, and most importantly, places where each word within the context of its surrounding text — a key-word in context (KWIC) index. Done by hand, the creation of concordances is tedious and time consuming, and therefore very expensive. Computers make the work of creating a concordance almost trivial.
Each of the full text items in the Alex Catalogue of Electronic Texts (close to 14,000 of them) is accompanied with a concordance. They support the following functions:
- list of all the words in the text starting with a given letter and the number of times each occurs
- list the most frequently used words in the text and the number of times each occurs
- list the most frequently used ngrams in a text and the number of times each occurs
- display individual items from the lists above in a KWIC format
- enable the student or scholar to search the text for arbitrary words or phrases (regular expressions) and have them displayed in a KWIC format
Such functionality allows people to answer many questions quickly and easily, such as:
- Does Mark Twain’s Adventures of Huckleberry Finn contain many words beginning with the letter z, and if so, how many times and in what context?
- To what extent does Aristotle’s Metaphysics use the word “good”, and maybe just as importantly, how is the word “evil” used in the same context?
- In Jack London’s Call of the Wild the phrase “man in the red sweater” is one of the more frequently used. Who was this man and what role does he play in the story?
- Compared to Shakespeare, to what extent does Plato discuss love, and how do the authors’ expositions differ?
The counting of words, the enumeration of ngrams, and the use of concordances are not intended to short-circuit traditional literary studies. Instead, they are intended to supplement and enhance the process. Traditional literary investigations, while deep and nuanced, are not scalable. A person is not able to read, compare & contrast, and then comprehend the essence of all of Shakespeare, all of Plato, and all of Charles Dickens through “close reading”. An individual simply does not have enough time. In the words of Gregory Crane, “What do you do with a million books?” Distant reading, akin to the proceses outlined above, make it easier to compare & contrast large corpora, discover patterns, and illustrate trends. Moreover, such processes are reproducible, less prone to subjective interpretation, and not limited to any particular domain. The counting, graphing, and mapping of literary texts makes a lot of sense.
Library “discovery systems” and/or catalogs
The amount of full text content available to libraries has never been greater than it is today. Millions of books have been collectively digitized through Project Gutenberg, the Open Content Alliance, and the Google Books Project. There are thousands of open access journals with thousands upon thousands of freely available scholarly articles. There are an ever-growing number of institutional repositories both subject-based as well as institutional-based. These too are rich with full text content. None of this even considers the myriad of grey literature sites like blogs and mailing list archives.
Library “discovery systems” and/or catalogs are designed to organize and provide access to the materials outlined above, but they need to do more. First of all, the majority of the profession’s acquisitions processes assume collections need to be paid for. With the increasing availability of truly free content on the Web, greater emphasis needs to be placed on harvesting content as opposed to purchasing or licensing it. Libraries are expected to build collections designed to stand the test of time. Brokering access to content through licensing agreements — one of the current trends in librarianship — will only last as long as the money lasts. Licensing content makes libraries look like cost centers and negates the definition of “collections”.
Second, library “discovery systems” and/or catalogs assume an environment of sacristy. They assume the amount of accessible, relevant data and information needed by students, teachers, and researchers is relatively small. Thus, a great deal of the profession’s efforts go into enabling people to find their particular needle in one particular haystack. In reality, current indexing technology makes the process of finding relavent materials trivial, almost intelligent. Implemented correctly, indexers return more content than most people need, and consequently they continue to drink from the proverbial fire hose.
Let’s turn these lemons into lemonade. Let’s redirect some of the time and money spent on purchasing licenses towards the creation of full text collections by systematic harvesting. Let’s figure out how to apply “distant reading” techniques to the resulting collections thus making them, literally, more useful and more understandable. These redirections represent a subtle change in the current direction of librarianship. At the same time, they retain the core principles of the profession, namely: collection, organization, preservation, and dissemination. The result of such a shift will result in an increased expertise on our part, the ability to better control our own destiny, and contribute to the overall advancement of our profession.
What can we do to make these things come to fruition?