Archive for December, 2010

MIT’s SIMILE timeline widget

Monday, December 20th, 2010

For a good time, I took a stab at learning how to implement a MIT SIMILE timeline widget. This posting describes what I learned.

Background

The MIT SIMILE Widgets are a set of cool Javascript tools. There are tools for implementing “exhibits”, time plots, “cover flow” displays a la iTunes, a couple of other things, and interactive timelines. I have always had a fondness for timelines since college when I created one to help me study for my comprehensive examinations. Combine this interest with the rise of digital humanities and my belief that library data is too textual in nature, I decided to learn how to use the timeline widget. Maybe this tool can be used in Library Land?

timeline
Screen shot of local timeline implementation

Implementation

The family of SIMILE Widgets Web pages includes a number of sample timelines. By playing with the examples you can see the potencial of the tool. Going through the Getting Started guide was completely necessary since the Widget documentation has been written, re-written, and moved to other platforms numerous times. Needless to say, I found the instructions difficult to use. In a nutshell, using the Timeline Widget requires the developer to:

  1. load the libraries
  2. create and modify a timeline object
  3. create a data file
  4. load the data file
  5. render the timeline

Taking hints from “timelines in the wild“, I decided to plot my writings — dating from 1989 to the present. Luckily, just about all of them are available via RSS (Really Simple Syndication), and they include:

Consequently, after writing my implementation’s framework, the bulk of the work was spent converting RSS files into an XML file the widget could understand. In the end I:

  • created an HTML file complete with the widget framework
  • downloaded the totality of RSS entries from all my my RSS feeds
  • wrote a slightly different XSL file for each RSS feed
  • wrote a rudimentary shell script to loop through each XSL/RSS combination and create a data file
  • put the whole thing on the Web

You can see the fruits of these labors on a page called Eric Lease Morgan’s Writings Timeline, and you can download the source code — timeline-2010-12-20.tar.gz. From there a person can scroll backwards and forwards in time, click on events, read an abstract of the writing, and hyperlink to the full text. The items from the Water Collection work in the same way but also include a thumbnail image of the water. Fun!?

Take-aways

I have a number of take-aways. First, my implementation is far from perfect. For example, the dates from the Water Collection are not correctly formatted in the data file. Consequently, different Javascript interpreters render the dates differently. Specifically, the Water Collection links to not show up in Safari, but they do show up in Firefox. Second, the timeline is quite cluttered in some places. There has got to be a way to address this. Third, timelines are a great way to visualize events. From the implementation you can readily see what how often I was writing and on what topics. The presentation makes so much more sense compared to a simple list sorted by date, title, or subject terms.

Library “discovery systems” could benefit from the implementation of timelines. Do a search. Get back a list of results. Plot them on a timeline. Allow the learner, teacher, or scholar to visualize — literally see — how the results of their query compare to one another. The ability to visualize information hinges on the ability to quantify information characteristics. In this case, the quantification is a set of dates. Alas, dates in our information systems are poorly recorded. It seems as if we — the library profession — have made it difficult for ourselves to participate in the current information environment.

Illustrating IDCC 2010

Wednesday, December 8th, 2010

This posting illustrates the “tweets” assigned to the hash tag #idcc10.

I more or less just got back from the 6th International Data Curation Conference that took place in Chicago (Illinois). Somewhere along the line I got the idea of applying digital humanities computing techniques against the conference’s Twitter feed — hash tag #idcc10. After installing a Perl module implementing the Twitter API (Net::Twitter::Lite), I wrote a quick hack, fed the results to Wordle, and got the following word cloud:

idcc10

What sorts of conclusions can you make based on the content of the graphic?

The output static and rudimentary. What I’d really like to do is illustrate the tweets over time. Get the oldest tweets. Illustrate the result. Get the newer tweets. Update the illustration. Repeat for all the tweets. Done. In the end I see some sort of moving graphic where significant words represent bubbles. The size of the bubbles grow in size depending on number of times they are used. Each bubble is attached to other bubbles with a line representing associations. The color of the bubbles might represent parts of speech. Using this technique a person could watch the ebb and flow of the virtual conversation.

For a good time time, you can also download the Perl script used to create the textual output. Called twitter.pl, it is only forty-three lines long and many of those lines are comments.

Ruler & Compass by Andrew Sutton

Sunday, December 5th, 2010

I most thoroughly enjoyed reading and recently learning from a book called Ruler & Compass by Andrew Sutton.

The other day, while perusing the bookstore for a basic statistics book, I came across Ruler & Compass by Andrew Sutton. Having always been intrigued by geometry and the use of only a straight edge and compass to describe a Platonic cosmos, I purchased this very short book, a ruler, and a compass with little hesitation. I then rushed home to draw points, lines, and circles for the purposes of constructing angles, perpendiculars, bisected angles, tangents, all sorts of regular polygons, and combinations of all the above to create beautiful geometric patterns. I was doing mathematics, but not a single number was to be seen. Yes, I did create ratios but not with integers, and instead with the inherent lengths of lines. Facinating!

triangle
square pentagon
hexagon elipse “golden” ratio

Geometry is not a lot unlike both music and computer programming. All three supply the craftsman with a set of basic tools. Points. Lines. Circles. Tones. Durations. Keys. If-then statements. Variables. Outputs. Given these “things” a person is empowered to combine, compound, synthesize, analyze, create, express, and describe. They are mediums for both the artist and scientists. Using them effectively requires thinking as well as “thinquing“. All three are arscient processes.

Anybody could benefit by reading Sutton’s book and spending a few lovely hours practicing the geometric constructions contained therein. I especially recommend this activity to my fellow librarians. The process is not only intellectually stimulating but invigorating. Librarianship is not all about service or collections. It is also about combining and reconstituting core principles — collection, organization, preservation, and dissemination. There is an analogy to be waiting to be seen here. Reading and doing the exercises in Ruler & Compass will make this plainly visible.

Text mining Charles Dickens

Saturday, December 4th, 2010

This posting outlines how a person can do a bit of text mining against three works by Charles Dickens using a set of two Perl modules — Lingua::EN::Ngram and Lingua::Concordance.

Lingua::EN::Ngram

I recently wrote a Perl module called Lingua::EN::Ngram. Its primary purpose is to count all the ngrams (two-word phrases, three-word phrases, n-word phrases, etc.) in a given text. For two-word phrases (bigrams) it will order the output according to a statistical probability (t-score). Given a number of texts, it will count the ngrams common across the corpus. As of version 0.02 it supports non-ASCII characters making it possible to correctly read and parse a greater number of Romantic languages — meaning it correctly interprets characters with diacritics. Lingua::EN::Ngram is available from CPAN.

Lingua::Concordance

Concordances are just about the oldest of textual analysis tools. Originally developed in the Late Middle Ages to analyze the Bible, they are essentially KWIC (keyword in context) indexes used to search and display ngrams within the greater context of a work. Given a text (such as a book or journal article) and a query (regular expression), Lingua::Concordance can display the occurrences of the query in the text as well as map their locations across the entire text. In a previous blog posting I used Lingua::Concordance to compare & contrast the use of the phrase “good man” in the works of Aristotle, Plato, and Shakespeare. Lingua::Concordance too is available from CPAN.

Charles Dickens

In keeping with the season, I wondered about Charles Dickens’s A Christmas Carol. How often is the word “Christmas” used in the work and where? In terms of size, how does A Christmas Carol compare to some of other Dickens’s works? Are there sets of commonly used words or phrases between those texts?

Answering the first question was relatively easy. The word “Christmas” is occurs eighty-six (86) times, and twenty-two (22) of those occurrences are in the the first ten percent (10%) of the story. The following bar chart illustrates these facts:

bar chart

The length of books (or just about any text) measured in pages in ambiguous, at best. A much more meaningful measure is number of words. The following table lists the sizes, in words, of three Dickens stories:

story size in words
A Christmas Carol 28,207
Oliver Twist 156,955
David Copperfield 355,203

For some reason I thought A Christmas Carol was much longer.

A long time ago I calculated the average size (in words) of the books in my Alex Catalogue. Once I figured this out, I discovered I could describe items in the collection based on relative sizes. The following “dial” charts bring the point home. Each one of the books is significantly different in size:

christmas carol
A Christmas Carol
oliver twist
Oliver Twist
david copperfield
David Copperfield

If a person were pressed for time, then which story would you be able to read?

After looking for common ngrams between texts, I discovered that “taken with a violent fit of” appears both David Copperfield and A Christmas Carol. Interesting!? Moreover, the phrase “violent fit” appears on all three works. Specifically, characters in these three Dickens stories have violent fits of laughter, crying, trembling, and coughing. By concatenating the stories together and applying concordancing methods I see there are quite a number of violent things in the three stories:

  n such breathless haste and violent agitation, as seemed to betoken so
  ood-night, good-night!' The violent agitation of the girl, and the app
  sberne) entered the room in violent agitation. 'The man will be taken,
  o understand that, from the violent and sanguinary onset of Oliver Twi
  one and all, to entertain a violent and deeply-rooted antipathy to goi
  eep a little register of my violent attachments, with the date, durati
  cal laugh, which threatened violent consequences. 'But, my dear,' said
  in general, into a state of violent consternation. I came into the roo
  artly to keep pace with the violent current of her own thoughts: soon 
  ts and wiles have brought a violent death upon the head of one worth m
   There were twenty score of violent deaths in one long minute of that 
  id the woman, making a more violent effort than before; 'the mother, w
   as it were, by making some violent effort to save himself from fallin
  behind. This was rather too violent exercise to last long. When they w
   getting my chin by dint of violent exertion above the rusty nails on 
  en who seem to have taken a violent fancy to him, whether he will or n
  peared, he was taken with a violent fit of trembling. Five minutes, te
  , when she was taken with a violent fit of laughter; and after two or 
  he immediate precursor of a violent fit of crying. Under this impressi
  and immediately fell into a violent fit of coughing: which delighted T
  of such repose, fell into a violent flurry, tossing their wild arms ab
   and accompanying them with violent gesticulation, the boy actually th
  ght I really must have laid violent hands upon myself, when Miss Mills
   arm tied up, these men lay violent hands upon him -- by doing which, 
   every aggravation that her violent hate -- I love her for it now -- c
   work himself into the most violent heats, and deliver the most wither
  terics were usually of that violent kind which the patient fights and 
   me against the donkey in a violent manner, as if there were any affin
   to keep down by force some violent outbreak. 'Let me go, will you,--t
  hands with me - which was a violent proceeding for him, his usual cour
  en.' 'Well, sir, there were violent quarrels at first, I assure you,' 
  revent the escape of such a violent roar, that the abused Mr. Chitling
  t gradually resolved into a violent run. After completely exhausting h
  , on which he ever showed a violent temper or swore an oath, was this 
  ullen, rebellious spirit; a violent temper; and an untoward, intractab
  fe of Oliver Twist had this violent termination or no. CHAPTER III REL
  in, and seemed to presage a violent thunder-storm, when Mr. and Mrs. B
  f the theatre, are blind to violent transitions and abrupt impulses of
  ming into my house, in this violent way? Do you want to rob me, or to

These observations simply beg other questions. Is violence a common theme in Dickens works? What other adjectives are used to a greater or lesser degree in Dickens works? How does the use of these adjectives differ from other authors of the same time period or within the canon of English literature?

Summary

The combination of the Internet, copious amounts of freely available full text, and ubiquitous as well as powerful desktop computing, it is now possible to analyze texts in ways that was not feasible twenty years ago. While the application of computing techniques against texts dates back to at least Father Busa’s concordance work in the 1960s, it has only been in the last decade that digital humanities has come into its own. The application of digital humanities to library work offers great opportunities for the profession. Their goals are similar and their tools are complementary. From my point of view, their combination is a marriage made in heaven.

A .zip file of the texts and scripts used to do the analysis is available for you to download and experiment with yourself. Enjoy.

AngelFund4Code4Lib

Thursday, December 2nd, 2010

The second annual AngelFund4Code4Lib — a $1,500 stipend to attend Code4Lib 2011 — is now accepting applications.

These are difficult financial times, but we don’t want this to dissuade people from attending Code4Lib. [1] Consequently a few of us have gotten together, pooled our resources, and made AngelFund4Code4Lib available. Applying for the stipend is easy. In 500 words or less, write what you hope to learn at the conference and email it to angelfund4code4lib@infomotions.com. We will then evaluate the submissions and select the awardee. In exchange for the financial resources, and in keeping with the idea of giving back to the community, the awardee will be expected to write a travelogue describing their take-aways and post it to the Code4Lib mailing list.

The deadline for submission is 5 o’clock (Pacific Time), Thursday, December 17. The awardee will be announced no later than Friday, January 7.

Submit your application. We look forward to helping you out.

If you would like to become an “angel” too, then drop us a line. We’re open to possibilities.

P.S. Check out the additional Code4Lib scholarships. [2]

[1] Code4Lib 2011 – http://code4lib.org/conference/2011/
[2] addtional scholarships – http://bit.ly/dLGnnx

Eric Lease Morgan,
Michael J. Giarlo, and
Eric Hellman