This posting describes how I am assigning quantitative characteristics to texts in an effort to answer the question, “How ‘great’ are the Great Books?” In the end I make a plea for library science.
With the advent of copious amounts of freely available plain text on the ‘Net comes the ability of “read” entire corpora with a computer and apply statistical processes against the result. In an effort to explore the feasibility of this idea, I am spending time answering the question, “How ‘great’ are the Great Books?”
More specifically, want to assign quantitative characteristics to each of the “books” in the Great Books set, look for patterns in the result, and see whether or not I can draw any conclusions about the corpus. If such processes are proven effective, then the same processes may be applicable to other corpora such as collections of scholarly journal articles, blog postings, mailing list archives, etc. If I get this far, then I hope to integrate these processes into traditional library collections and services in an effort to support their continued relevancy.
On my mark. Get set. Go.
Assigning quantitative characteristics to texts
The Great Books set posits 102 “great ideas” — basic, foundational themes running through the heart of Western civilization. Each of the books in the set were selected for inclusion by the way they expressed the essence of these great ideas. The ideas are grand and ambiguous. They include words such as angel, art, beauty, courage, desire, eternity, god, government, honor, idea, physics, religion, science, space, time, wisdom, etc. (See Appendix B of “How ‘great’ are the Great Books?” for the complete list.)
In a previous posting, “Great Ideas Coefficient“, I outlined the measure I propose to use to determine the books’ “greatness” — essentially a sum of all TFIDF (term frequency / inverse document frequency) scores as calculated against the list of great ideas. TFIDF is defined as:
( c / t ) * log( d / f )
- c = number of times a given word appears in a document
- t = total number of words in a document
- d = total number of documents in a corpus
- f = total number of documents containing a given word
Thus, the problem boils down to determining the values for c, t, d, and f for a given great idea, 2) summing the resulting TFIDF scores, 3) saving the results, and 4) repeating the process for each book in the corpus. Here, more exactly, is how I am initially doing such a thing:
- Build corpus – In a previous posting, “Collecting the Great Books“, I described how I first collected 223 of the roughly 250 Great Books.
- Index corpus – The process used to calculate the TFIDF values of c and t are trivial because any number of computer programs do such a thing quickly and readily. In our case, the value of d is a constant — 223. On the other hand, trivial methods for determining the number of documents containing a given word (f) are not scalable as the size of a corpus increases. Because an index is essentially a list of words combined with the pointers to where the words can be found, an index proves to be a useful tool for determining the value of f. Index a corpus. Search the index for a word. Get back the number of hits and use it as the value for f. Lucene is currently the gold standard when it comes to open source indexers. Solr — an enhanced and Web Services-based interface to Lucene — is the indexer used in this process. The structure of the local index is rudimentary: id, author, title, URL, and full text. Each of the metadata values are pulled out of a previously created index file — great-books.xml — while the full text is read from the file system. The whole lot is then stuffed into Solr. A program called index.pl does this work. Another program called search.pl was created simply for testing the validity of the index.
- Count words and determine readability – A Perl module called Lingua::EN::Fathom does a nice job of counting the number of words in a file, thus providing me with a value for t. Along the way it also calculates a number of “readability” scores — values used to determine the necessary education level of a person needed to understand a given text. While I had “opened the patient” I figured it would be a good idea to take note of this information. Given the length of a book as well as its readability scores, I enable myself to answer questions such as, “Are longer books more difficult to read?” Later on, given my Great Ideas Coefficient, I will be able to answer questions such as “Is the length of a book a determining factor in ‘greatness’?” or “Are ‘great’ books more difficult to read?”
- Calculate TFIDF – This is the fuzziest and most difficult part of the measurement process. Using Lingua::EN::Fathom again I find all of the unique words in a document, stem them with Lingua::Stem::Snowball, and calculate the number of times each stem occurs. This gives me a value for c. I then loop through each great idea, stem them, and search the index for the stem thus returning a value for f. For each idea I now have values for c, t, d, and f enabling me to calculate TFIDF — ( c / t ) * log( d / f ).
- Calculate the Great Ideas Coefficient – This is trivial. Keep a running sum of all the great idea TFIDF scores.
- Go to Step #4 – Repeat this process for each of the 102 great ideas.
- Save – After all the various scores (number of words, readability scores, TFIDF scores, and Great Ideas Coefficient) have been calculated I save each to my pseudo database file called great-ideas.xml. Each is stored as an attribute associated with a book’s unique identifier. Later I will use the contents of this file as the basis of my statistical analysis.
- Go to Step #3 – Repeat this process for each book in the corpus, and in this case 223 times.
Of course I didn’t do all of this by hand, and the program I wrote to do the work is called measure.pl.
The result is my pseudo database file — great-books.xml. This is my data set. It keeps track all of my information in a human-readable, application- and operating system-independent manner. Very nice. If there is only one file you download from this blog posting, then it should be this file. Using it you will be able to create your own corpus and do your own analysis.
The process outlined above is far from perfect. First, there are a few false negatives. For example, the great idea “universe” returned a TFIDF value of zero (0) for every document. Obviously is is incorrect, and I think the error has something to do with the stemming and/or indexing subprocesses. Second, the word “being”, as calculated by TFIDF, is by far and away the “greatest” idea. I believe this is true because the word “being” is… being counted as both a noun as well as a verb. This points to a different problem — the ambiguity of the English language. While all of these issues will knowingly skew the final results, I do not think they negate the possibility of meaningful statistical investigation. At the same time it will be necessary to refine the measurement process to reduce the number of “errors”.
Measurment, the humanities, and library science
Measurement is one of the fundamental qualities of science. The work of Archimedes is the prototypical example. Kepler and Galileo took the process to another level. Newton brought it to full flower. Since Newton the use of measurement — the assignment of mathematical values — applied against observations of the natural world and human interactions have given rise to the physical and social sciences. Unlike studies in the humanities, science is repeatable and independently verifiable. It is objective. Such is not a value judgment, merely a statement of fact. While the sciences seem cold, hard, and dry, the humanities are subjective, appeal to our spirit, give us a sense of purpose, and tend to synthesis our experiences into a meaningful whole. Both of the scientific and humanistic thinking processes are necessary for us to make sense of the world around us. I call these combined processes “arscience“.
The library profession could benefit from the greater application of measurement. In my opinion, too much of the profession’s day-to-day as well as strategic decisions are based on antidotal evidence and gut feelings. Instead of basing our actions on data, actions are based on tradition. “This is the way we have always done it.” This is medieval, and consequently, change comes very slowly. I sincerely believe libraries are not going away any time soon, but I do think the profession will remain relevant longer if librarians were to do two things: 1) truly exploit the use of computers, and 2) base a greater number of their decisions on data — measurment — as opposed to opinion. Let’s call this library science.