Archive for November, 2010

Crowd sourcing the Great Books

Saturday, November 6th, 2010

This posting describes how crowd sourcing techniques are being used to determine the “greatness” of the Great Books.

The Great Books of the Western World is a set of books authored by “dead white men” — Homer to Dostoevsky, Plato to Hegel, and Ptolemy to Darwin. [1] In 1952 each item in the set was selected because the set’s editors thought the selections significantly discussed any number of their 102 Great Ideas (art, cause, fate, government, judgement, law, medicine, physics, religion, slavery, truth, wisdom, etc.). By reading the books, comparing them with one another, and discussing them with fellow readers, a person was expected to foster their on-going liberal arts education. Think of it as “life long learning” for the 1950s.

I have devised and implemented a mathematical model for denoting the “greatness” of any book. The model is based on term frequency inverse document frequency (TFIDF). It is far from complete, nor has it been verified. In an effort to address the later, I have created the Great Books Survey. Specifically, I am asking people to vote on which books they consider greater. If the end result is similar to the output of my model, then the model may be said to represent reality.

charts The survey itself is an implementation of the Condorcet method. (“Thanks Andreas.”) First, I randomly select one of the Great Ideas. I then randomly select two of the Great Books. Finally, I ask the poll-taker to choose the “greater” of the two books based on the given Great Idea. For example, the randomly selected Great Idea may be war, and the randomly selected Great Books may be Shakespeare’s Hamlet and Plato’s Republic. I then ask, “Which is book is ‘greater’ in terms of war?” The answer is recorded and an additional question is generated. The survey is never-ending. After 100’s of thousands of votes are garnered I hope too learn which books are the greatest because they got the greatest number of votes.

Because the survey results are saved in an underlying database, it is trivial to produce immediate feedback. For example, I can instantly return which books have been voted greatest for the given idea, how the two given books compare to the given idea, a list of “your” greatest books, and a list of all books ordered by greatness. For a good time, I am also geo-locating voters’ IP addresses and placing them on a world map. (“C’mon Antartica. You’re not trying!”)

map The survey was originally announced on Tuesday, November 2 on the Code4Lib mailing list, Twitter, and Facebook. To date it has been answered 1,247 times by 125 people. Not nearly enough. So far, the top five books are:

  1. Augustine’s City Of God And Christian Doctrine
  2. Cervantes’s Don Quixote
  3. Shakespeare’s Midsummer Nights Dream
  4. Chaucers’s Canterbury Tales And Other Poems
  5. Goethe’s Faust

There are a number of challenging aspects regarding the validity of the survey. For example, many people feel unqualified to answer some of the randomly generated questions because they have not read the books. My suggestion is, “Answer the question anyway,” because given enough votes randomly answered questions will cancel themselves out. Second, the definition of “greatness” is ambiguous. It is not intended to be equated with popularity but rather the “imaginative or intellectual content” the book exemplifies. [2] Put in terms of a liberal arts education, greatness is the degree a book discusses, defines, describes, or alludes to the given idea more than the other. Third, people have suggested I keep track of how many times people answer with “I don’t know and/or neither”. This is a good idea, but I haven’t implemented it yet.

Please answer the survey 10 or more times. It will take you less than 60 seconds if you don’t think about it too hard and go with your gut reactions. There are no such things as wrong answers. Answer the survey about 100 times, and you will may get an idea of what types of “great books” interest you most.

Vote early. Vote often.

[1] Hutchins, Robert Maynard. 1952. Great books of the Western World. Chicago: Encyclopedia Britannica.

[2] Ibid. Volume 3, page 1220.

Great Books data set

Saturday, November 6th, 2010

screenshot This posting makes the Great Books data set freely available.

As described previously, I want to answer the question, “How ‘great’ are the Great Books?” In this case I am essentially equating “greatness” with statistical relevance. Specifically, I am using the Great Books of the Western World’s list of “great ideas” as search terms and using them to query the Great Books to compute a numeric value for each idea based on term frequency inverse document frequency (TFIDF). I then sum each of the great idea values for a given book to come up with a total score — the “Great Ideas Coefficient”. The book with the largest Coefficient is then considered the “greatest” book. Along the way and just for fun, I have also kept track of the length of each book (in words) as well as two scores denoting each book’s reading level, and one score denoting each book’s readability.

The result is a canonical XML file named great-books.xml. This file, primarily intended for computer-to-computer transfer contains all the data outlined above. Since most data analysis applications (like databases, spreadsheets, or statistical packages) do not deal directly with XML, the data was transformed into a comma-separated value (CSV) file — great-books.csv. But even this file, a matrix of 220 rows and 104 columns, can be a bit unwieldily for the uninitiated. Consequently, the CSV file has been combined with a Javascript library (called DataTables) and embedded into an HTML for file general purpose use — great-books.htm.

The HTML file enables you to sort the matrix by column values. Shift click on columns to do sub-sorts. Limit the set by entering queries into the search box. For example:

  • sort by the last column (coefficient) and notice how Kant has written the “greatest” book
  • sort by the column labeled “love” and notice that Shakespeare has written seven (7) of the top ten (10) “greatest books” about love
  • sort by the column labeled “war” and notice that something authored by the United States is ranked #2 but also has very poor readability scores
  • sort by things like “angel” or “god”, then ask yourself, “Am I surprised at what I find?”

Even more interesting questions may be asked of the data set. For example, is their a correlation between greatness and readability? If a work has a high love score, then it is likely it will have a high (or low) score from one or more of the other columns? What is the greatness of the “typical” Great Book? Is this best represented as the average of the Great Ideas Coefficient or would it be better stated as the value of the mean of all the Great Ideas? In the case of the later, which books are greater than most, which books are typical, an which books are below typical? This sort of analysis, as well as the “kewl” Web-based implementation, is left up the the gentle reader.

Now ask yourself, “Can all of these sorts of techniques be applied to the principles and practices of librarianship, and if so, then how?”