Digital Humanities 2010: A Travelogue

I was fortunate enough to be able to attend a conference called Digital Humanities 2010 (London, England) between July 4th and 10th. This posting documents my experiences and take-aways. In a sentence, the conference provided a set of much needed intellectual stimulation and challenges as well as validated the soundness of my current research surrounding the Great Books.

Pre-conference activities

All day Monday, July 5, I participated in a workshop called Text mining in the digital humanities facilitated by Marco Büchler, et al. of the University of Leipzig. A definition of “e-humanities” was given, “The application of computer science to do qualitative evaluation of texts without the use of things like TEI.” I learned that graphing texts illustrates concepts quickly — “A picture is worth a thousand words.” Also, I learned I should consider creating co-occurrence graphs — pictures illustrating what words co-occur with a given word. Finally, according to the Law of Least Effort, the strongest content words in a text are usually the ones that do not occur most frequently, nor the ones occurring the least, but rather the words occurring somewhere in between. A useful quote includes, “Text mining allows one to search even without knowing any search terms.” Much of this workshop’s content came from the eAQUA Project.

On Tuesday I attended the first half of a THATCamp led by Dan Cohen (George Mason University) where I learned THATCamps are expected to be: 1) fun, 2) productive, and 3) collegial. The whole thing came off as a “bar camp” for scholarly conferences. As a part of the ‘Camp I elected to participate in the Developer’s Challenge and submitted an entry called “How ‘great’ is this article?“. My hack compared texts from the English Women’s Journal to the Great Books Coefficient in order to determine “greatness”. My entry did not win. Instead the prize went to Patrick Juola with honorable mentions going to Loretta Auvil, Marco Büchler, and Thomas Eckart.

Wednesday morning I learned more about text mining in a workshop called Introduction to text analysis using JiTR and Voyeur led by Stéfan Sinclair (McMaster University) and Geoffrey Rockwell (University of Alberta). The purpose of the workshop was “to learn how to integrate text analysis into a scholar’s/researcher’s workflow.” More specifically, we learned how to use a tool called Voyeur, an evolution of the TAPoR. The “kewlest” thing I learned was the definition of word density, (U / W) 1000, where U is the total number of unique words in a text and W is the total number of words in a text. The closer the result is to 1000 the richer and more dense a text is. In general, denser documents are more difficult to read. (For a good time, I wrote — a program to compute density given an arbitrary plain text file.)

In keeping with the broad definition of humanities, I was “seduced” in the afternoon by listening to recordings of a website called CHARM (Center for History and Analysis of Recorded Music). The presentation described and presented digitized classical music from the very beginnings of recorded music. All apropos since the BBC was located just across the street from King’s College where the conference took place. When this was over we retired to the deck for tea and cake. There I learned the significant recording time differences between 10″ and 12″ 78/rpm records. Like many mediums, the recording artist needed to make accommodations accordingly.

Plenty of presentations

The conference officially began Wednesday evening and ended Saturday afternoon. According to my notes, I attended at many as eighteen sessions. (Wow!?) Listed below are summaries of most of the ones I attended:

  • Charles Henry (Council on Library and Information Resources) and Hold up a mirror – In this keynote presentation Henry compared & contrasted manifestations (oral, written, and digital) of Homer, Beowulf, and a 9-volume set of religious ceremonies compiled in the 18th century. He then asked the question, “How can machines be used to capture the interior of the working mind?” Or, in my own words, “How can computers be used to explore the human condition?” The digital versions of the items listed above were used as example answers, and a purpose of the conference was to address this question in other ways. He said, “There are many types of performance, preservation, and interpretation.”
  • Patrick Juola (Duquesne University) and Distant reading and mapping genre space via conjecture-based distance measures – Juola began by answering the question, “What do you do with a million books?”, and enumerated a number of things: 1) search, 2) summarize, 3) sample, and 4) visualize. These sorts of proceses against texts is increasingly called “distant reading” and is contrasted with the more traditional “close reading”. He then went on to describe his “Conjecturator” — a system where assertions are randomly generated and then evaluated. He demonstrated this technique against a set of Victorian novels. His presentation was not dissimilar to the presentation he gave at digital humanities conference in Chicago the previous year.
  • Jan Rybicki (Pedagogical University) and Deeper delta across genres and language: Do we really need the most frequent words? – In short Rybicki said, “Doing simple frequency counts [to do authorship analysis] does not work very well for all languages, and we are evaluating ‘deeper deltas'” — an allusion to the work for J.F. Burrows and D.L. Hoover. Specifically, using a “moving window” of stop words he looked for similarities in authorship between a number of texts and believed his technique has proved to be more or less successful.
  • David Holms (College of New Jersey) and The Diary of a public man: A Case study in traditional and non-traditional author attribution – Soon after the civil war a book called The Diary Of A Public Man was written by an anonymous author. Using stylometric techniques, Holms asserts the work really was written as a diary and was authored by William Hurlbert.
  • David Hoover (New York University) and Teasing out authorship and style with t-tests and zeta – Hoover used T-tests and Zeta tests to validated whether or not a particular author finished a particular novel from the 1800s. Using these techniques he was successfully able to illustrate writing styles and how they changed dramatically between one chapter in the book and another chapter. He asserted that such analysis would have been extremely difficult through rudimentary casual reading.
  • Martin Holmes (University of Victoria) and Using the universal similarity metric to map correspondences between witnesses – Holmes described how he was comparing the similarity between texts through the use of a compression algorithm. Compress texts. Compare their resulting lengths. The closer to lengths the greater the similarity. The process works for a variety of file types, languages, and when there there is no syntactical knowledge.
  • Dirk Roorda (Data Archiving and Networked Services) and The Ecology of longevity: The Relevance of evolutionary theory for digital preservation – Roorda drew parallels between biology and preservation. For example, biological systems use and retain biological characteristics. Preservation systems re-use and thus preserve content. Biological systems make copies and evolve. Preservation can be about migrating formats forward thus creating different forms. Biological systems employ sexual selections. “Look how attractive I am.” Repositories or digital items displaying “seals of approval” function similarly. Finally, he went on to describe how these principles could be integrated in a preservation system where fees are charged for storing content and providing access to it. He emphasized such systems would not necessarily be designed to handle intellectual property rights.
  • Lewis Ulman (Ohio State University) & Melanie Schlosser (Ohio State University) and The Specimen case and the garden: Preserving complex digital objects, sustaining digital projects – Ulman and Schlosser described a dichotomy manifesting itself in digital libraries. On one hand there is a practical need for digital library systems to be similar between each other because “boutique” systems are very expensive to curate and maintain. At the same time specialized digital library applications are needed because they represent the frontiers of research. How to accomodate both, that was their question. “No one group (librarians, information technologist, faculty) will be able to do preservation alone. They need to work together. Specifically, they need to connect, support, and curate.”
  • George Buchanan (City University) and Digital libraries of scholarly editions – Similar to Ulman/Schlosse above, Buchanan said, “It is difficult to provide library services against scholarly editions because each edition is just too much different from the next to create a [single] system.” He advocated the Greenstone digital library system.

  • Joe Raben (Queens College of the City University of New York) and Humanities computing in an age of social change – In this presentation, given after being honored with the community’s Busa Award, Raben first outlined the history of the digital humanities. It included the work done by Father Busa who collaborated with IBM in the 1960s to create a concordance against some of Thomas Aquinas‘s work. It included a description of a few seminal meetings and the formulation of the Computing in the Humanities journal. He alluded to “machine readable texts” — a term which is no longer in vogue but reminded me of “machine readable cataloging” (MARC) and how the library profession has not moved on. He advocated for a humanities wiki where ideas and objects could be shared. It sounded a lot like the website. He discussed the good work of a Dante project hosted at Princeton University, and I was dismayed because Notre Dame’s significant collection of Dante materials has not played a role in this particular digital library. A humanist through and through, he said, “Computers are increasingly controlling our lives and the humanities have not effected how we live in the same way.” To this I say, computers represent close trends compared to the more engrained values of the human condition. The former are quick to change, the later change oh so very slowly yet they are more pervasive. Compared to computer technology, I believe the humanists have had more long-lasting effects on the human condition.
  • Lynne Siemens (University of Victoria) and A Tale of two cities: Implications of the similarities in collaborative approaches within the digital libraries and digital humanities communities – Siemans reported on the results of survey in an effort to determine how and why digital librarians and digital humanists collaborate. “There are cultural differences between librarians and academics, but teams [including both] are necessary. The solution is to assume the differences rather than the similarities. Everybody brings something to the team.”
  • Fenella France (Library of Congress) and Challenges of linking digital heritage scientific data with scholarly research: From navigation to politics – France described some of the digital scanning processes of the Library of Congress, and some the consequences. For example, their technique allowed archivists to discover how Thomas Jefferson wrote, crossed out, and then replaced the word “subjects” with “citizens” in a draft of the Declaration of Independence. A couple of interesting quotes included, “We get into the optical archeology of the documents”, and “Digitization is access, not preservation.”
  • Joshua Sternfeld (National Endowment for the Humanities) and Thinking archivally: Search and metadata as building blocks for a new digital historiography – Sternfeld advocated for different sets of digital library evaluation. “There is a need for more types of reviews against digital resource materials. We need a method for doing: selection, search, and reliability… The idea of provenance — the order of document creation — needs to be implemented in the digital realm.”
  • Wendell Piez (Mulberry Technologies, Inc.) and Towards hermeneutic markup: An Architectural outline – Hermeneutic markup are annotations against a text that are purely about interpretation. “We don’t really have the ability to do hermeneutic markup… Existing schemas are fine, but every once in a while exceptions need to be made and such things break the standard.” Numerous times Piez alluded to the “overlap problem” — the inability to demarcate something crossing the essentially strict hierarchal nature of XML elements. Textual highlighting is a good example. Piez gave a few examples of how the overlap problem might be resolved and how hermeneutic markup may be achieved.
  • Jane Hunter (University of Queensland) and The Open Annotation collaboration: A Data model to support sharing and interoperability of scholarly annotations – Working with a number of other researchers, Hunter said, “The problem is that there is an extraordinarily wide variety of tools, lack of consistency, no standards, and no sharable interoperability when it comes to Web-based annotation.” Their goal is to create a data model to enable such functionality. While the model is not complete, it is being based on RDF, SANE, and OATS. See
  • Susan Brown (University of Alberta and University of Guelph) and How do you visualize a million links? – Brown described a number of ways she is exploring visualization techniques. Examples included link graphs, tag clouds, bread board searches, cityscapes, and something based on “six degrees of separation”.
  • Lewis Lancaster (University of California, Berkeley) and From text to image to analysis: Visualization of Chinese Buddhist canon – Lancaster has been doing research against a (huge) set of Korean glyphs for quite a number of years. Just like other writing techniques, the glyphs change over time. Through the use digital humanities computing techniques, he has been able to discover much more quickly patterns and bigrams that he was not able to discover previously. “We must present our ideas as images because language is too complex and takes too much time to ingest.”

In the spirit of British fast food, I have a number of take-aways. First and foremost, I learned that my current digital humanities research into the Great Books is right on target. It asks questions of the human condition and tries to answer them through the use of computing techniques. This alone was the worth the total cost of my attendance.

Second, as a relative outsider to the community, I percieved a pervasive us versus them mentality being described. Us digital humanists and those traditional humanists. Us digital humanists and those computer programmers and systems administrators. Us digital humanists and those librarians and archivists. Us digital humanists and those academic bureaucrats. If you consider yourself a digital humanist, then please don’t take this observation the wrong way. I believe communities inherently do this as a matter of fact. It is a process used to define one’s self. The heart of much of this particular differenciation seems to be yet another example of C.P. Snow‘s The Two Cultures. As a humanist myself, I identify with the perception. I think the processes of art and science complement each other, not contradict nor conflict. A balance of both are needed in order to adequantly create a cosmos out of the apparent chaos of our existance — a concept I call arscience.

Third, I had ample opportunities to enjoy myself as a tourist. The day I arrived I played frisbee disc golf with a few “cool dudes” at Lloyd Park in Croydon. On the Monday I went to the National Theater and saw Welcome to Thebes — a depressing tragedy where everybody dies. On the Tuesday I took in Windsor Castle. Another day I carried my Culver Citizen newspaper to have its photograph taken in front of Big Ben. Throughout my time there I experienced interesting food, a myriad of languages & cultures, and the almost overwhelming size of London. Embarassingly, I had forgotten how large the city really is.

Finally, I actually enjoyed reading the formally published conference abstracts — all three pounds and 400 pages of it. It was thorough, complete, and even included an author index. More importantly, I discovered more than a few quotes supporting an idea for library systems that I have been calling “services against texts”:

The challenge is to provide the researcher with a means to perceiving or specifying subsets of data, extracting the relevent information, building the nodes and edges, and then providing the means to navigate the vast number of nodes and edges. (Susan Brown in “How do you visualize a million links” on page 106)

However, current DL [digital library] systems lack critical features: they have too simple a model of documents, and lack scholarly apparatus. (George Buchanan in “Digital libraries of scholarly editions” on page 108.)

This approach takes us to the what F. Moretti (2005) has termed ‘distant reading,’ a method that stresses summarizing large bodies of text rather than focusing on a few texts in detail. (Ian Gregory in “GIS, texts and images: New approaches to landscape appreciation in the Lake District” on page 159).

And the best quote is:

In smart digital libraries, a text should not only be an object but a service: not a static entity but an interactive method. The text should be computationally exploitable so that it can be sampled and used, not simply reproduced in its entirety… the reformulation of the dictionary not as an object, but a service. (Toma Tasovac in “Reimaging the dictionary, or why lexicography needs digital humanities” on page 254)

In conclusion, I feel blessed with the ability to attended the conference. I learned a lot, and I will recommend it to any librarian or humanist.

How “great” is this article?

During Digital Humanities 2010 I participated in the THATCamp London Developers’ Challenge and tried to answer the question, “How ‘great’ is this article?” This posting outlines the functionality of my submission, links to a screen capture demonstrating it, and provides access to the source code.

screen captureGiven any text file — say an article from the English Women’s Journal — my submission tries to answer the question, “How ‘great’ is this article?” It does this by:

  1. returning the most common words in a text
  2. returning the most common bigrams in a text
  3. calculating a few readability scores
  4. comparing the texts to a standardized set of “great ideas”
  5. supporting a concordance for browsing

Functions #1, #2, #3, and #5 are relatively straight-forward and well-understood. Function #4 needs some explanation.

In the 1960’s a set of books was published called the Great Books. The set is based on a set of 102 “great ideas” (such as art, love, honor, truth, justice, wisdom, science, etc.). By summing the TFIDF scores of each of these ideas for each of the books, a “great ideas coefficient” can be computed. Through this process we find that Shakespeare wrote seven of the top ten books when it comes to love. Kant wrote the “greatest book”. The American State’s Articles of Confederation ranks the highest when it come to war. This “coefficient” can then be used as a standard — an index — for comparing other documents. This is exactly what this program does. (See the screen capture for a demonstration.)

The program can be improved a number of ways:

  1. it could be Web-based
  2. it could process non-text files
  3. it could graphically illustrate a text’s “greatness”
  4. it could hyperlink returned words directly to the concordance

Thanks to Gerhard Brey and the folks of the Nineteenth Century Serials Editions for providing the data. Very interesting.