Archive for August, 2010

Ngrams, concordances, and librarianship

Monday, August 30th, 2010

This posting describes how the extraction of ngrams and the implementation of concordances are integrated into the Alex Catalogue of Electronic Texts. Given the increasing availability of full-text content in libraries, the techniques described here could easily be incorporated into traditional library “discovery systems” and/or catalogs, if and only if the library profession were to shift its definition of what it means to practice librarianship.

Lingua::EN::Bigram

During the past couple of weeks, in fits of creativity, one of the things I spent some of my time on was a Perl module named Lingua::EN::Bigram. At version 0.03, it now supports not only bigrams, trigrams, and quadgrams (two-, three-, and four-word phrases, respectively), but also ngrams — multi-word phrases of an arbitrary length.

Given this enhanced functionality, and through the use of a script called ngrams.pl, I learned that the 10 most frequently used 5-word phrases and the number of times they occur in Henry David Thoreau’s Walden seem to surround spacial references:

  • a quarter of a mile (6)
  • i have no doubt that (6)
  • as if it were a (6)
  • the other side of the (5)
  • the surface of the earth (4)
  • the greater part of the (4)
  • in the midst of a (4)
  • in the middle of the (4)
  • in the course of the (3)
  • two acres and a half (3)

Whereas the same process applied to Thoreau’s A Week on the Concord and Merrimack Rivers returns lengths and references to flowing water, mostly:

  • a quarter of a mile (8)
  • on the bank of the (7)
  • the surface of the water (6)
  • the middle of the stream (6)
  • as if it were the (5)
  • as if it were a (4)
  • is for the most part (4)
  • for the most part we (4)
  • the mouth of this river (4)
  • in the middle of the (4)

While not always as clear cut as the examples outlined above, the extraction and counting of ngrams usually supports the process of “distant reading” — a phrase coined by Franco Moretti in Graphs, Maps, Trees: Abstract Models for Literary History (2007) to denote the counting, graphing, and mapping of literary texts. With so much emphasis on reading in libraries, I ask myself, “Ought the extraction of ngrams be applied to library applications?”

Concordances

Concordances are literary tools used to evaluate texts. Dating back to as early as the 12th or 13th centuries, they were first used to study religious materials. Concordances take many forms, but they usually list all the words in a text, the number of times each occurs, and most importantly, places where each word within the context of its surrounding text — a key-word in context (KWIC) index. Done by hand, the creation of concordances is tedious and time consuming, and therefore very expensive. Computers make the work of creating a concordance almost trivial.

Each of the full text items in the Alex Catalogue of Electronic Texts (close to 14,000 of them) is accompanied with a concordance. They support the following functions:

  • list of all the words in the text starting with a given letter and the number of times each occurs
  • list the most frequently used words in the text and the number of times each occurs
  • list the most frequently used ngrams in a text and the number of times each occurs
  • display individual items from the lists above in a KWIC format
  • enable the student or scholar to search the text for arbitrary words or phrases (regular expressions) and have them displayed in a KWIC format

Such functionality allows people to answer many questions quickly and easily, such as:

  • Does Mark Twain’s Adventures of Huckleberry Finn contain many words beginning with the letter z, and if so, how many times and in what context?
  • To what extent does Aristotle’s Metaphysics use the word “good”, and maybe just as importantly, how is the word “evil” used in the same context?
  • In Jack London’s Call of the Wild the phrase “man in the red sweater” is one of the more frequently used. Who was this man and what role does he play in the story?
  • Compared to Shakespeare, to what extent does Plato discuss love, and how do the authors’ expositions differ?

The counting of words, the enumeration of ngrams, and the use of concordances are not intended to short-circuit traditional literary studies. Instead, they are intended to supplement and enhance the process. Traditional literary investigations, while deep and nuanced, are not scalable. A person is not able to read, compare & contrast, and then comprehend the essence of all of Shakespeare, all of Plato, and all of Charles Dickens through “close reading”. An individual simply does not have enough time. In the words of Gregory Crane, “What do you do with a million books?” Distant reading, akin to the proceses outlined above, make it easier to compare & contrast large corpora, discover patterns, and illustrate trends. Moreover, such processes are reproducible, less prone to subjective interpretation, and not limited to any particular domain. The counting, graphing, and mapping of literary texts makes a lot of sense.

The home page for the concordances is complete with a number of sample texts. Alternatively, you can search the Alex Catalogue and find an item on your own.

Library “discovery systems” and/or catalogs

The amount of full text content available to libraries has never been greater than it is today. Millions of books have been collectively digitized through Project Gutenberg, the Open Content Alliance, and the Google Books Project. There are thousands of open access journals with thousands upon thousands of freely available scholarly articles. There are an ever-growing number of institutional repositories both subject-based as well as institutional-based. These too are rich with full text content. None of this even considers the myriad of grey literature sites like blogs and mailing list archives.

Library “discovery systems” and/or catalogs are designed to organize and provide access to the materials outlined above, but they need to do more. First of all, the majority of the profession’s acquisitions processes assume collections need to be paid for. With the increasing availability of truly free content on the Web, greater emphasis needs to be placed on harvesting content as opposed to purchasing or licensing it. Libraries are expected to build collections designed to stand the test of time. Brokering access to content through licensing agreements — one of the current trends in librarianship — will only last as long as the money lasts. Licensing content makes libraries look like cost centers and negates the definition of “collections”.

Second, library “discovery systems” and/or catalogs assume an environment of sacristy. They assume the amount of accessible, relevant data and information needed by students, teachers, and researchers is relatively small. Thus, a great deal of the profession’s efforts go into enabling people to find their particular needle in one particular haystack. In reality, current indexing technology makes the process of finding relavent materials trivial, almost intelligent. Implemented correctly, indexers return more content than most people need, and consequently they continue to drink from the proverbial fire hose.

Let’s turn these lemons into lemonade. Let’s redirect some of the time and money spent on purchasing licenses towards the creation of full text collections by systematic harvesting. Let’s figure out how to apply “distant reading” techniques to the resulting collections thus making them, literally, more useful and more understandable. These redirections represent a subtle change in the current direction of librarianship. At the same time, they retain the core principles of the profession, namely: collection, organization, preservation, and dissemination. The result of such a shift will result in an increased expertise on our part, the ability to better control our own destiny, and contribute to the overall advancement of our profession.

What can we do to make these things come to fruition?

Lingua::EN::Bigram (version 0.03)

Monday, August 23rd, 2010

I uploaded version 0.03 of Lingua::EN::Bigram to CPAN today, and it now supports not just bigrams, trigrams, quadgrams, but ngrams — an arbitrary phrase length.

In order to test it out, I quickly gathered together some of my more recent essays, concatonated them together, and applied Lingua::EN::Bigram against the result. Below is a list of the top 10 most common bigrams, trigrams, and quadgrams:

  bigrams                 trigrams                  quadgrams
  52  great ideas         36  the number of         25  the number of times
  43  open source         36  open source software  13  the total number of
  38  source software     32  as well as            10  at the same time
  29  great books         28  number of times       10  number of words in
  24  digital humanities  27  the use of            10  when it comes to
  23  good man            25  the great books       10  total number of documents
  22  full text           23  a set of              10  open source software is
  22  search results      20  eric lease morgan      9  number of times a
  20  lease morgan        20  a number of            9  as well as the
  20  eric lease          19  total number of        9  through the use of

Not surprising since I have been writing about the Great Books, digital humanities, indexing, and open source software. Re-affirming.

Lingu::EN::Bigram is available locally as well as from CPAN.

Lingua::EN::Bigram (version 0.02)

Sunday, August 22nd, 2010

I have written and uploaded to CPAN version 0.02 of my Perl module Lingua::EN::Bigram. From the README file:

This module is designed to: 1) pull out all of the two-, three-, and four-word phrases in a given text, and 2) list these phrases according to their frequency. Using this module is it possible to create lists of the most common phrases in a text as well as order them by their probable occurrence, thus implying significance. This process is useful for the purposes of textual analysis and “distant reading”.

Using this module I wrote a script called n-grams.pl. Feed it a plain text file, and it will return the top 10 most significant bigrams (as calculated by T-Score) as well as the top 10 most common trigrams and quadgrams. For example, here is the output of n-grams.pl when Henry David Thoreau’s Walden is input:

  Bi-grams (T-Score, count, bigram)
  4.54348783312048  22  one day  
  4.35133234596553  19  new england  
  3.705427371426    14  walden pond  
  3.66575742655033  14  one another  
  3.57857056272537  13  many years  
  3.55592136768501  13  every day  
  3.46339791276118  12  fair haven  
  3.46101939872834  12  years ago  
  3.38519781332654  12  every man  
  3.29818626191729  11  let us  
  
  Tri-grams (count, trigram)
  41  in the woods
  40  i did not
  28  i do not
  28  of the pond
  27  as well as
  27  it is a
  26  part of the
  25  that it was
  25  as if it
  25  out of the
  
  Quad-grams (count, quadgram)
  20  for the most part
  16  from time to time
  15  as if it were
  14  in the midst of
  11  at the same time
   9  the surface of the
   9  i think that i
   8  in the middle of
   8  worth the while to
   7  as if they were

The whole thing gets more interesting when you compare that output to another of Thoreau’s works — A Week on the Concord and Merrimack Rivers:

  Bi-grams (T-Score, count, bi-gram)
  4.62683939320543  22  one another  
  4.57637831535376  21  new england  
  4.08356124174142  17  let us  
  3.86858364314677  15  new hampshire  
  3.43311180449584  12  one hundred  
  3.31196701774012  11  common sense  
  3.25007069543896  11  can never  
  3.15955504269006  10  years ago  
  3.14821552996352  10  human life  
  3.13793008615654  10  told us  
  
  Tri-grams (count, tri-gram)
  41  as well as
  38  of the river
  34  it is a
  30  there is a
  30  one of the
  28  it is the
  27  as if it
  26  it is not
  26  if it were
  24  it was a
  
  Quad-grams (count, quad-gram)
  21  for the most part
  20  as if it were
  17  from time to time
   9  on the bank of
   8  the bank of the
   8  in the midst of
   8  a quarter of a
   8  the middle of the
   8  quarter of a mile
   7  at the same time

Ask yourself, “Are their similarities between the outputs? How about differences? Do you notice any patterns or anomalies? What sorts of new discoveries might be made if n-grams.pl where applied to the entire corpus of Thoreau’s works? How might the output be different if a second author’s works were introduced?” Such questions are the core of digital humanities research. With the increasing availability of full text content in library collections, such are the questions the library profession can help answer if the profession were to expand it’s definition of “service”.

Search and retrieve are not the pressing problems to solved. People can find more data and information than they know what to do with. Instead, the pressing problems surround use and understanding. Lingua::EN::Bigram is an example of how these newer and more pressing problems can be addressed. The module is available for downloading (locally as well as from CPAN). Also for your perusal is n-grams.pl.

Cool URIs

Sunday, August 22nd, 2010

I have started implementing “cool” URIs against the Alex Catalogue of Electronic Texts.

As outlined in Cool URIs for the Semantic Web, “The best resource identifiers… are designed with simplicity, stability and manageability in mind…” To that end I have taken to creating generic URIs redirecting user-agents to URLs based on content negotiation — 303 URI forwarding. These URIs also provide a means to request specific types of pages. The shapes of these URIs follow, where “key” is a foreign key in my underlying (MyLibrary) database:

  • http://infomotions.com/etexts/id/key – generic; redirection based on content negotiation
  • http://infomotions.com/etexts/page/key – HTML; the text itself
  • http://infomotions.com/etexts/data/key – RDF; data about the text
  • http://infomotions.com/etexts/concordance/key – concordance; a means for textual analysis

For example, the following URIs return different versions/interfaces of Henry David Thoreau’s Walden:

This whole thing makes my life easier. No need to remember complicated URLs. All I have to remember is the shape of my URI and the foreign key. Through the process this also makes the URLs easier to type, shorten, distribute, and display.

The downside of this implementation is the need for an always-on intermediary application doing the actual work. The application, implemented as mod_perl module, is called Apache2::Alex::Dereference and available for your perusal. Another downside is the need for better, more robust RDF, but that’s for later.

rsync, a really cool utility

Wednesday, August 18th, 2010

Without direct physical access to my co-located host, backing up and preserving the Infomotions’ 150 GB of website is challenging, but through the use of rsync things are a whole lot easier. rsync is a really cool utility, and thanks go to Francis Kayiwa who recommended it to me in the first place. “Thank you!”

Here is my rather brain-dead back-up utility:

# rsync.sh - brain-dead backup of wilson

# change directories to the local store
cd /Users/eric/wilson

# get rid of any weird Mac OS X filenames
find ./ -name '.DS_Store' -exec rm -rf {} \;

# do the work for one remote file system...
rsync --exclude-from=/Users/eric/etc/rsync-exclude.txt \
    -avz wilson:/disk01/ \
    ./disk01/

# ...and then another
rsync --exclude-from=/Users/eric/etc/rsync-exclude.txt \
    -avz wilson:/home/eric/ \
    ./home/eric/

After I run this code my local Apple Macintosh Time Capsule automatically copies my content to yet a third spinning disk. I feel much better about my data now that I have started using rsync.

WiLSWorld, 2010

Friday, August 6th, 2010

WiLS logoI had the recent honor, privilege, and pleasure of attending WiLSWorld (July 21-22, 2010 in Madison, Wisconsin), and this posting outlines my experiences there. In a sentence, I was pleased so see the increasing understanding of “discovery” interfaces defined as indexes as opposed to databases, and it is now my hope we — as a profession — can move beyond search & find towards use & understand.

Wednesday, July 21

With an audience of about 150 librarians of all types from across Wisconsin, the conference began with a keynote speech by Tim Spalding (LibraryThing) entitled “Social cataloging and the future”. The heart of his presentation was a thing he called the Ladder of Social Cataloging which has six “rungs”: 1) personal cataloging, 2) sharing, 3) implicit social cataloging, 4) social networking, 5) explicitly social cataloging, and 6) collaboration. Much of what followed were demonstrations of how each of these things are manifested in LibraryThing. There were a number meaty quotes sprinkled throughout the talk:

…We [LibraryThing] are probably not the biggest book club anymore… Reviews are less about buying books and more about sharing minds… Tagging is not about something for everybody else, but rather about something for yourself… LibraryThing was about my attempt to discuss the things I wanted to discuss in graduate school… We have “flash mobs” cataloging peoples’ books such as the collections of Thomas Jefferson, John Adams, Ernest Hemingway, etc… Traditional subject headings are not manifested in degrees; all LCSH are equally valid… Library data can be combined but separate from patron data.

I was duly impressed with this presentation. It really brought home the power of crowd sourcing and how it can be harnessed in a library setting. Very nice.

Peter Gilbert (Lawrence University) then gave a presentation called “Resource discovery: I know it when I see it”. In his words, “The current problem to solve is to remove all of the solos: books, articles, digitized content, guides to subjects, etc.” The solution, in his opinion, is to implement “discovery systems” similar to Blacklight, eXtensible Catalog, Primo & Primo Central, Summon, VUFind, etc. I couldn’t have said it better myself. He gave a brief overview of each system.

Ken Varnum (University of Michigan Library) described a website redesign process in “Opening what’s closed: Using open source tools to tear down vendor silos”. As he said, “The problem we tried to solve in our website redesign was the overwhelming number of branch library websites. All different. Almost schizophrenic.” The solution grew out of a different premise for websites. “Information not location.” He went on to describe a rather typical redesign process complete with focus group interviews, usability studies, and advisory groups, but there were a couple of very interesting tidbits. First, inserting the names and faces of librarian in search results has proved popular with students. Second, I admired the “participatory design” process he employed. Print a design. Allow patrons to use pencils to add, remove, or comment on aspects of the layout. I also think the addition of a professional graphic designer helped their process.

I then attended Peter Gorman‘s (University of Wisconsin-Madison) “Migration of digital content to Fedora”. Gorman had the desire to amalgamate institutional content, books, multimedia and finding aids (EAD files) into a single application… yet another “discovery system” description. His solution was to store content into Fedora, index the content, and provide services against the index. Again, a presenter after my own heart. Better than anyone had done previously, Gorman described Fedora’s content model complete with identifiers (keys), a sets of properties (relationships, audit trails, etc.), and a data streams (JPEG, XML, TIFF, etc.). His description was clear and very easy to digest. The highlight was a description of Fedora “behaviors”. These are things people are intended to do with data streams. Examples include enlarging a thumbnail image or transforming a online finding aid into something designed for printing. These “behaviors” are very much akin — if not exactly like — the “services against texts” I have been advocating for a few years.

Thursday, July 22

The next day I gave a presentation called “Electronic texts and the evolving definition of librarianship”. This was an extended version of my presentation at ALA given a few weeks ago. To paraphrase, “As we move from databases towards indexes to facilitate search, the problems surrounding find are not as acute. Given the increasing availability of digitized full text content, library systems have the opportunity to employ ‘digital humanities computing techniques’ against collections and enable people to do ‘distant reading’.” I then demonstrated how the simple counting of words and phrases, the use of concordances, and the application of TFIDF can facilitate rudimentary comparing & contrasting of corpora. Giving this presentation was an enjoyable experience because it provided me the chance to verbalize and demonstrate much of my current “great books” research.

Later in the morning helped facilitate a discussion on the process a library could go through to implement the ideas outlined in my presentation, but the vast majority of people attended the presentation by Keith Mountin (Apple Computer, Inc.) called “The iPad and its application in libraries”.

Conclusion

Madison was just as nice as I remember. Youthful. Liberal. Progressive. Thanks go to Deb Shapiro and Mark Beatty. They invited me to sit with them on the capitol lawn and listen to the local orchestra play Beatles music. The whole thing was very refreshing.

The trip back from the conference was a hellacious experience in air travel, but it did give me the chance to have an extended chat with Tim Spalding in the airport. We discussed statistics and statistical measures that can be applied to content we are generating. Many of the things he is doing with metadata I may be able to do with full text. The converse is true as well. Moreover, by combining our datasets we may find that the sum is greater than the parts — all puns intended. Both Tim and I agreed this is something we should both work towards. Afterwards I ate macaroni & cheese with a soft pretzel and a beer. It seemed apropos for Wisconsin.

This was my second or third time attending WiLSWorld. Like the previous meetings, the good folks at WiLS — specifically Tom Zilner, Mark Beatty, and Shirley Schenning — put together a conference providing librarians from across Wisconsin with a set of relatively inexpensive professional development opportunities. Timely presentations. Plenty of time for informal discussions. All in a setting conducive to getting away and thinking a bit outside the box. “Thank you.”