The Next Next-Generation Library Catalog

With the advent of the Internet and wide-scale availability of full-text content, people are overwhelmed with the amount of accessible data and information. Library catalogs can only go so far when it comes to delimiting what is relevant and what is not. Even when the most exact searches return 100’s of hits what is a person to do? Services against texts — digital humanities computing techniques — represent a possible answer. Whether the content is represented by novels, works of literature, or scholarly journal articles the methods of the digital humanities can provide ways to compare & contrast, analyze, and make more useful any type of content. This essay elaborates on these ideas and describes how they can be integrated into the “next, next-generation library catalog”.

(Because this essay is the foundation for a presentation at the 2010 ALA Annual Meeting, this presentation is also available as a one-page handout designed for printing as well as bloated set of slides.)

Find is not the problem

Find is not the problem to be solved. At most, find is a means to an end and not the end itself. Instead, the problem to solve surrounds use. The profession needs to implement automated ways to make it easier users do things against content.

The library profession spends an inordinate amount of time and effort creating catalogs — essentially inventory lists of things a library owns (or licenses). The profession then puts a layer on top of this inventory list — complete with authority lists, controlled vocabularies, and ever-cryptic administrative data — to facilitate discovery. When poorly implemented, this discovery layer is seen by the library user as an impediment to their real goal. Read a book or article. Verify a fact. Learn a procedure. Compare & contrast one idea with another idea. Etc.

In just the past few years the library profession has learned that indexers (as opposed to databases) are the tools to facilitate find. This is true for two reasons. First, indexers reduce the need for users to know how the underlying data is structured. Second, indexers employ statistical analysis to rank it’s output by relevance. Databases are great for creating and maintaining content. Indexers are great for search. Both are needed in equal measures in order to implement the sort of information retrieval systems people have come to expect. For example, many of the profession’s current crop of “discovery” systems (VUFind, Blacklight, Summon, Primo, etc.) all use an open source indexer called Lucene to drive search.

This being the case, we can more or less call the problem of find solved. True, software is never done, and things can always be improved, but improvements in the realm of search will only be incremental.

Instead of focusing on find, the profession needs to focus on the next steps in the process. After a person does a search and gets back a list of results, what do they want to do? First, they will want to peruse the items in the list. After identifying items of interest, they will want to acquire them. Once the selected items are in hand users may want to print, but at the very least they will want to read. During the course of this reading the user may be doing any number of things. Ranking. Reviewing. Annotating. Summarizing. Evaluating. Looking for a specific fact. Extracting the essence of the author’s message. Comparing & contrasting the text to other texts. Looking for sets of themes. Tracing ideas both inside and outside the texts. In other words, find and acquire are just a means to greater ends. Find and acquire are library goals, not the goals of users.

People want to perform actions against the content they acquire. They want to use the content. They want to do stuff with it. By expanding our definition of “information literacy” to include things beyond metadata and bibliography, and by combining it with the power of computers, librarianship can further “save the time of the reader” and thus remain relevant in the current information environment. Focusing on the use and evaluation of information represents a growth opportunity for librarianship.

It starts with counting

The availability of full text content in the form of plain text files combined with the power of computing empowers one to do statistical analysis against corpora. Put another way, computers are great at counting words, and once sets of words are counted there are many things one can do with the results, such as but not limited to:

  • measuring length
  • measuring readability, “greatness”, or any other index
  • measuring frequency of unigrams, n-grams, parts-of-speech, etc.
  • charting & graphing analysis (word clouds, scatter plots, histograms, etc.)
  • analyzing measurements and looking for patterns
  • drawing conclusions and making hypotheses

For example, suppose you did the perfect search and identified all of the works of Plato, Aristotle, and Shakespeare. Then, if you had the full text, you could compute a simple table such as Table 1.

Author Works Words Average Grade Flesch
Plato 25 1,162,46 46,499 12-15 54
Aristotle 19 950,078 50,004 13-17 50
Shakespeare 36 856,594 23,794 7-10 72

The table lists who wrote how many works. It lists the number of words in each set of works and the average number of words per work. Finally, based on things like sentence length, it estimates grade and reading levels for the works. Given such information, a library “catalog” could help the patron could answer questions such as:

  • Which author has the most works?
  • Which author has the shortest works?
  • Which author is the most verbose?
  • Is the author of most works also the author who is the most verbose?
  • In general, which set of works requires the higher grade level?
  • Does the estimated grade/reading level of each authors’ work coincide with one’s expectations?
  • Are there any authors whose works are more or less similar in reading level?

Given the full text, a trivial program can then be written to count the number of words existing in a corpus as well as the number of times each word occurs, as shown in Table 2.

Plato Aristotle Shakespeare
will one thou
one will will
socrates must thy
may also shall
good things lord
said man thee
man may sir
say animals king
true thing good
shall two now
like time come
can can well
must another enter
another part love
men first let
now either hath
also like man
things good like
first case one
let nature upon
nature motion know
many since say
state others make
knowledge now may
two way yet

Table 2, sans a set of stop words, lists the most frequently used words in the complete works of Plato, Aristotle, and Shakespeare. The patron can then ask and answer questions like:

  • Are there words in one column that appear frequently in all columns?
  • Are there words that appear in only one column?
  • Are the rankings of the words similar between columns?
  • To what degree are the words in each column a part of larger groups such as: nouns, verbs, adjectives, etc.?
  • Are there many synonyms or antonyms shared inside or between the columns?

Notice how the words “one”, “good” and “man” appear in all three columns. Does that represent some sort of shared quality between the works?

If one word contains some meaning, then do two words contain twice as much meaning? Here is a list of the most common two-word phrases (bigrams) in each author corpus, Table 3.

Plato Aristotle Shakespeare
let us one another king henry
one another something else thou art
young socrates let uses thou hast
just now takes place king richard
first place one thing mark antony
every one without qualification prince henry
like manner middle term let us
every man first figure king lear
quite true b belongs thou shalt
two kinds take place duke vincentio
human life essential nature dost thou
one thing every one sir toby
will make practical wisdom art thou
human nature will belong henry v
human mind general rule richard iii
quite right anything else toby belch
modern times one might scene ii
young men first principle act iv
can hardly good man iv scene
will never two things exeunt king
will tell two kinds don pedro
dare say first place mistress quickly
will say like manner act iii
false opinion one kind thou dost
one else scientific knowledge sir john

Notice how the names of people appear frequently in Shakespeare’s works, but very few names appear in the lists of Plato and Aristotle. Notice how the word “thou” appears a lot in Shakespeare’s works. Ask yourself the meaning of the word “thou”, and decide whether or not to update the stop word list. Notice how the common phrases of Plato and Aristotle are akin to ideas, not tangible things. Examples include: human nature, practical wisdom, first principle, false opinion, etc. Is there a pattern here?

If “a picture is worth a thousand words”, then there are about six thousand words represented by Figures 1 through 6.

Words used by Plato
words used by Plato
Phrases used by Plato
phrases used by Plato
Words used by Aristotle
words used by Aristotle
Phrases used by Aristotle
phrases used by Aristotle
Words used by Shakespeare
words used by Shakespeare
Phrases used by Shakespeare
phrases used by Shakespeare

Word clouds — “tag clouds” — are an increasingly popular way to illustrate the frequency of words or phrases in a corpus. Because a few of the phrases in a couple of the corpuses were considered outliers, phrases such as “let us”, “one another”, and “something else” are not depicted.

Even without the use of statistics, it appears the use of the phrase “good man” by each author might be interestingly compared & contrasted. A concordance is an excellent tool for such a purpose, and below are a few of the more meaty uses of “good man” by each author.

List 1 – “good man” as used by Plato
  ngth or mere cleverness. To the good man, education is of all things the most pr
   Nothing evil can happen to the good man either in life or death, and his own de
  but one reply: 'The rule of one good man is better than the rule of all the rest
   SOCRATES: A just and pious and good man is the friend of the gods; is he not? P
  ry wise man who happens to be a good man is more than human (daimonion) both in 
List 2 – “good man” as used by Aristotle
  ons that shame is felt, and the good man will never voluntarily do bad actions. 
  reatest of goods. Therefore the good man should be a lover of self (for he will 
  hat is best for itself, and the good man obeys his reason. It is true of the goo
  theme If, as I said before, the good man has a right to rule because he is bette
  d prove that in some states the good man and the good citizen are the same, and 
List 3 – “good man” as used by Shakespeare
  r to that. SHYLOCK Antonio is a good man. BASSANIO Have you heard any imputation
  p out, the rest I'll whistle. A good man's fortune may grow out at heels: Give y
  t it, Thou canst not hit it, my good man. BOYET An I cannot, cannot, cannot, An 
  hy, look where he comes; and my good man too: he's as far from jealousy as I am 
   mean, that married her, alack, good man! And therefore banish'd -- is a creatur

What sorts of judgements might the patron be able to make based on the snippets listed above? Are Plato, Aristotle, and Shakespeare all defining the meaning of a “good man”? If so, then what are some of the definitions? Are there qualitative similarities and/or differences between the definitions?

Sometimes being as blunt as asking a direct question, like “What is a man?”, can be useful. Lists 4 through 6 try to answer it.

List 4 – “man is” as used by Plato
  stice, he is met by the fact that man is a social being, and he tries to harmoni
  ption of Not-being to difference. Man is a rational animal, and is not -- as man
  ss them. Or, as others have said: Man is man because he has the gift of speech; 
  wise man who happens to be a good man is more than human (daimonion) both in lif
  ied with the Protagorean saying, 'Man is the measure of all things;' and of this
List 5 – “man is” as used by Aristotle
  ronounced by the judgement 'every man is unjust', the same must needs hold good 
  ts are formed from a residue that man is the most naked in body of all animals a
  ated piece at draughts. Now, that man is more of a political animal than bees or
  hese vices later. The magnificent man is like an artist; for he can see what is 
  lement in the essential nature of man is knowledge; the apprehension of animal a
List 6 – “man is” as used by Shakespeare
   what I have said against it; for man is a giddy thing, and this is my conclusio
   of man to say what dream it was: man is but an ass, if he go about to expound t
  e a raven for a dove? The will of man is by his reason sway'd; And reason says y
  n you: let me ask you a question. Man is enemy to virginity; how may we barricad
  er, let us dine and never fret: A man is master of his liberty: Time is their ma

In the 1950s Mortimer Adler and a set of colleagues created a set of works they called The Great Books of the Western World. This 80-volume set included all the works of Plato, Aristotle, and Shakespeare as well as some of the works of Augustine, Aquinas, Milton, Kepler, Galileo, Newton, Melville, Kant, James, and Frued. Prior to the set’s creation, Adler and colleagues enumerated 102 “greatest ideas” including concepts such as: angel, art, beauty, honor, justice, science, truth, wisdom, war, etc. Each book in the series was selected for inclusion by the committee because of the way the books elaborated on the meaning of the “great ideas”.

Given the full text of each of the Great Books as well as a set of keywords (the “great ideas”), it is relatively simple to calculate a relevancy ranking score for each item in a corpus. Love is one of the “great ideas”, and it just so happens it is used most significantly by Shakespeare compared to the use of the other authors in the set. If Shakespeare has the highest “love quotient”, then what does Shakespeare have to say about love? List 7 is a brute force answer to such a question.

List 7 – “love is” as used by Shakespeare
  y attempted? Love is a familiar; Love is a devil: there is no evil angel but Lov
  er. VALENTINE Why? SPEED Because Love is blind. O, that you had mine eyes; or yo
   that. DUKE This very night; for Love is like a child, That longs for every thin
  n can express how much. ROSALIND Love is merely a madness, and, I tell you, dese
  of true minds Admit impediments. Love is not love Which alters when it alteratio

Do these definitions coincide with expectations? Maybe further reading is necessary.

Digital humanities, library science, and “catalogs”

The previous section is just about the most gentle introduction to digital humanities computing possible, but can also be an introduction to a new breed of library science and library catalogs.

It began by assuming the existence of full text content in plain text form — an increasingly reasonable assumption. After denoting a subset of content, it compared & contrasted the sizes and reading levels of the content. By counting individual words and phrases, patterns were discovered in the texts and a particular idea was loosely followed — specifically, the definition of a good man. Finally, the works of a particular author were compared to the works of a larger whole to learn how the author defined a particular “great idea”.

The fundamental tools used in this analysis were a set of rudimentary Perl modules: Lingua::EN::Fathom for calculating the total number of words in a document as well as a document’s reading level, Lingua::EN::Bigram for listing the most frequently occurring words and phrases, and Lingua::Concordance for listing sentence snippets. The Perl programs built on top of these modules are relatively short and include: fathom.pl, words.pl, bigrams.pl and concordance.pl. (If you really wanted to download the full text versions of Plato, Aristotle, and Shakespeare‘s works used in this analysis.) While the programs themselves are really toys, the potential they represent are not. It would not be too difficult to integrate their functionality into a library “catalog”. Assume the existence of significant amount of full text content in a library collection. Do a search against the collection. Create a subset of content. Click a few buttons to implement statistical analysis against the result. Enable the user to “browse” the content and follow a line of thought.

The process outlined in the previous section is not intended to replace rigorous reading, but rather to supplement it. It enables a person to identify trends quickly and easily. It enables a person to read at “Web scale”. Again, find is not the problem to be solved. People can find more information than they require. Instead, people need to use and analyze the content they find. This content can be anything from novels to textbooks, scholarly journal articles to blog postings, data sets to collections of images, etc. The process outlined above is an example of services against texts, a way to “Save the time of the reader” and empower them to make better and more informed decisions. The fundamental processes of librarianship (collection, preservation, organization, and dissemination) need to be expanded to fit the current digital environment. The services described above are examples of how processes can be expanded.

The next “next generation library catalog” is not about find, instead it is about use. Integrating digital humanities computing techniques into library collections and services is just one example of how this can be done.

Tags: , ,

5 Responses to “The Next Next-Generation Library Catalog”

  1. Thank you for this. Entertaining, diverting and pointing to the future that libraries need to apprehend…all with a spoon full of sugar.

    Has me thinking about other disciplines and what kinds of tools that exist to mine their data. It’s been my problem with OpenCalias – it has such a business focus that it is great for an analysis of data where the user wants to use the results in business.. but not so great for other disciplines. I think librarians need to look toward using textual analysis tools from the disciplines they serve and incorporate it into their “beyond discovery” layers.

    It’s a pity that often subject liaison librarians in universities do this job because they are not so interested in “techie things”….How to get them to understand that this is their role will be interesting.

    This kind of discovery also depends obviously on full text. It creates a population bias that would need to be somehow overcome. For example if fewer books by female authors are digitised (as fewer books by female authors were added to canons of “great works” in the past) then an attempt to discover what “love is” may produce a very skewed view …

  2. Loren says:

    Thank you for saying so charmingly what I’ve been thinking, feeling and experiencing for the past few years. The subtext I see hear, with all the different apps you’ve used in saying it speaks of process and a world of libraries where we are only limited by our own lack of imagination.

  3. [...] The Next Next-Generation Library Catalog [...]

  4. [...] Eric Morgan talked for a few moments about pulling quantitative data from bibliographic and full text information to enable post-discovery analysis of resources. He has a great overview of his experiments at Eric’s blog. [...]

  5. [...] their locations across the entire text. In a previous blog posting I used Lingua::Concordance to compare & contrast the use of the phrase “good man” in the works of Aristotle, Plato, and Shakespeare. Lingua::Concordance too is available from [...]