This posting outlines how a person can do a bit of text mining against three works by Charles Dickens using a set of two Perl modules — Lingua::EN::Ngram and Lingua::Concordance.
I recently wrote a Perl module called Lingua::EN::Ngram. Its primary purpose is to count all the ngrams (two-word phrases, three-word phrases, n-word phrases, etc.) in a given text. For two-word phrases (bigrams) it will order the output according to a statistical probability (t-score). Given a number of texts, it will count the ngrams common across the corpus. As of version 0.02 it supports non-ASCII characters making it possible to correctly read and parse a greater number of Romantic languages — meaning it correctly interprets characters with diacritics. Lingua::EN::Ngram is available from CPAN.
Concordances are just about the oldest of textual analysis tools. Originally developed in the Late Middle Ages to analyze the Bible, they are essentially KWIC (keyword in context) indexes used to search and display ngrams within the greater context of a work. Given a text (such as a book or journal article) and a query (regular expression), Lingua::Concordance can display the occurrences of the query in the text as well as map their locations across the entire text. In a previous blog posting I used Lingua::Concordance to compare & contrast the use of the phrase “good man” in the works of Aristotle, Plato, and Shakespeare. Lingua::Concordance too is available from CPAN.
In keeping with the season, I wondered about Charles Dickens’s A Christmas Carol. How often is the word “Christmas” used in the work and where? In terms of size, how does A Christmas Carol compare to some of other Dickens’s works? Are there sets of commonly used words or phrases between those texts?
Answering the first question was relatively easy. The word “Christmas” is occurs eighty-six (86) times, and twenty-two (22) of those occurrences are in the the first ten percent (10%) of the story. The following bar chart illustrates these facts:
The length of books (or just about any text) measured in pages in ambiguous, at best. A much more meaningful measure is number of words. The following table lists the sizes, in words, of three Dickens stories:
|story||size in words|
|A Christmas Carol||28,207|
For some reason I thought A Christmas Carol was much longer.
A long time ago I calculated the average size (in words) of the books in my Alex Catalogue. Once I figured this out, I discovered I could describe items in the collection based on relative sizes. The following “dial” charts bring the point home. Each one of the books is significantly different in size:
A Christmas Carol
If a person were pressed for time, then which story would you be able to read?
After looking for common ngrams between texts, I discovered that “taken with a violent fit of” appears both David Copperfield and A Christmas Carol. Interesting!? Moreover, the phrase “violent fit” appears on all three works. Specifically, characters in these three Dickens stories have violent fits of laughter, crying, trembling, and coughing. By concatenating the stories together and applying concordancing methods I see there are quite a number of violent things in the three stories:
n such breathless haste and violent agitation, as seemed to betoken so ood-night, good-night!' The violent agitation of the girl, and the app sberne) entered the room in violent agitation. 'The man will be taken, o understand that, from the violent and sanguinary onset of Oliver Twi one and all, to entertain a violent and deeply-rooted antipathy to goi eep a little register of my violent attachments, with the date, durati cal laugh, which threatened violent consequences. 'But, my dear,' said in general, into a state of violent consternation. I came into the roo artly to keep pace with the violent current of her own thoughts: soon ts and wiles have brought a violent death upon the head of one worth m There were twenty score of violent deaths in one long minute of that id the woman, making a more violent effort than before; 'the mother, w as it were, by making some violent effort to save himself from fallin behind. This was rather too violent exercise to last long. When they w getting my chin by dint of violent exertion above the rusty nails on en who seem to have taken a violent fancy to him, whether he will or n peared, he was taken with a violent fit of trembling. Five minutes, te , when she was taken with a violent fit of laughter; and after two or he immediate precursor of a violent fit of crying. Under this impressi and immediately fell into a violent fit of coughing: which delighted T of such repose, fell into a violent flurry, tossing their wild arms ab and accompanying them with violent gesticulation, the boy actually th ght I really must have laid violent hands upon myself, when Miss Mills arm tied up, these men lay violent hands upon him -- by doing which, every aggravation that her violent hate -- I love her for it now -- c work himself into the most violent heats, and deliver the most wither terics were usually of that violent kind which the patient fights and me against the donkey in a violent manner, as if there were any affin to keep down by force some violent outbreak. 'Let me go, will you,--t hands with me - which was a violent proceeding for him, his usual cour en.' 'Well, sir, there were violent quarrels at first, I assure you,' revent the escape of such a violent roar, that the abused Mr. Chitling t gradually resolved into a violent run. After completely exhausting h , on which he ever showed a violent temper or swore an oath, was this ullen, rebellious spirit; a violent temper; and an untoward, intractab fe of Oliver Twist had this violent termination or no. CHAPTER III REL in, and seemed to presage a violent thunder-storm, when Mr. and Mrs. B f the theatre, are blind to violent transitions and abrupt impulses of ming into my house, in this violent way? Do you want to rob me, or to
These observations simply beg other questions. Is violence a common theme in Dickens works? What other adjectives are used to a greater or lesser degree in Dickens works? How does the use of these adjectives differ from other authors of the same time period or within the canon of English literature?
The combination of the Internet, copious amounts of freely available full text, and ubiquitous as well as powerful desktop computing, it is now possible to analyze texts in ways that was not feasible twenty years ago. While the application of computing techniques against texts dates back to at least Father Busa’s concordance work in the 1960s, it has only been in the last decade that digital humanities has come into its own. The application of digital humanities to library work offers great opportunities for the profession. Their goals are similar and their tools are complementary. From my point of view, their combination is a marriage made in heaven.
A .zip file of the texts and scripts used to do the analysis is available for you to download and experiment with yourself. Enjoy.