How “great” is this article?

During Digital Humanities 2010 I participated in the THATCamp London Developers’ Challenge and tried to answer the question, “How ‘great’ is this article?” This posting outlines the functionality of my submission, links to a screen capture demonstrating it, and provides access to the source code.

screen captureGiven any text file — say an article from the English Women’s Journal — my submission tries to answer the question, “How ‘great’ is this article?” It does this by:

  1. returning the most common words in a text
  2. returning the most common bigrams in a text
  3. calculating a few readability scores
  4. comparing the texts to a standardized set of “great ideas”
  5. supporting a concordance for browsing

Functions #1, #2, #3, and #5 are relatively straight-forward and well-understood. Function #4 needs some explanation.

In the 1960’s a set of books was published called the Great Books. The set is based on a set of 102 “great ideas” (such as art, love, honor, truth, justice, wisdom, science, etc.). By summing the TFIDF scores of each of these ideas for each of the books, a “great ideas coefficient” can be computed. Through this process we find that Shakespeare wrote seven of the top ten books when it comes to love. Kant wrote the “greatest book”. The American State’s Articles of Confederation ranks the highest when it come to war. This “coefficient” can then be used as a standard — an index — for comparing other documents. This is exactly what this program does. (See the screen capture for a demonstration.)

The program can be improved a number of ways:

  1. it could be Web-based
  2. it could process non-text files
  3. it could graphically illustrate a text’s “greatness”
  4. it could hyperlink returned words directly to the concordance

Thanks to Gerhard Brey and the folks of the Nineteenth Century Serials Editions for providing the data. Very interesting.

Tags: , ,

3 Responses to “How “great” is this article?”

  1. Hi Eric,

    very interesting post and demonstration. One suggestion: maybe you should use a synonym dictionary (and a formal taxanomy) to transform those words in the text which has synomys for the words in the __DATA__ section. Eg. algebra is a child concept of mathematics, but if the text won’t mention math, just algebra, the text won’t match the idea.

    Regards,
    Péter

  2. Péter, yes, you are exactly correct, and in the one of the next iterations of my informal Great Books/Ideas Project I plan to implement exactly the sort of thing you suggest. Thank you.


    ELM