Lingua::EN::Bigram (version 0.03)

I uploaded version 0.03 of Lingua::EN::Bigram to CPAN today, and it now supports not just bigrams, trigrams, quadgrams, but ngrams — an arbitrary phrase length.

In order to test it out, I quickly gathered together some of my more recent essays, concatonated them together, and applied Lingua::EN::Bigram against the result. Below is a list of the top 10 most common bigrams, trigrams, and quadgrams:

  bigrams                 trigrams                  quadgrams
  52  great ideas         36  the number of         25  the number of times
  43  open source         36  open source software  13  the total number of
  38  source software     32  as well as            10  at the same time
  29  great books         28  number of times       10  number of words in
  24  digital humanities  27  the use of            10  when it comes to
  23  good man            25  the great books       10  total number of documents
  22  full text           23  a set of              10  open source software is
  22  search results      20  eric lease morgan      9  number of times a
  20  lease morgan        20  a number of            9  as well as the
  20  eric lease          19  total number of        9  through the use of

Not surprising since I have been writing about the Great Books, digital humanities, indexing, and open source software. Re-affirming.

Lingu::EN::Bigram is available locally as well as from CPAN.

Lingua::EN::Bigram (version 0.02)

I have written and uploaded to CPAN version 0.02 of my Perl module Lingua::EN::Bigram. From the README file:

This module is designed to: 1) pull out all of the two-, three-, and four-word phrases in a given text, and 2) list these phrases according to their frequency. Using this module is it possible to create lists of the most common phrases in a text as well as order them by their probable occurrence, thus implying significance. This process is useful for the purposes of textual analysis and “distant reading”.

Using this module I wrote a script called n-grams.pl. Feed it a plain text file, and it will return the top 10 most significant bigrams (as calculated by T-Score) as well as the top 10 most common trigrams and quadgrams. For example, here is the output of n-grams.pl when Henry David Thoreau’s Walden is input:

  Bi-grams (T-Score, count, bigram)
  4.54348783312048  22  one day  
  4.35133234596553  19  new england  
  3.705427371426    14  walden pond  
  3.66575742655033  14  one another  
  3.57857056272537  13  many years  
  3.55592136768501  13  every day  
  3.46339791276118  12  fair haven  
  3.46101939872834  12  years ago  
  3.38519781332654  12  every man  
  3.29818626191729  11  let us  
  
  Tri-grams (count, trigram)
  41  in the woods
  40  i did not
  28  i do not
  28  of the pond
  27  as well as
  27  it is a
  26  part of the
  25  that it was
  25  as if it
  25  out of the
  
  Quad-grams (count, quadgram)
  20  for the most part
  16  from time to time
  15  as if it were
  14  in the midst of
  11  at the same time
   9  the surface of the
   9  i think that i
   8  in the middle of
   8  worth the while to
   7  as if they were

The whole thing gets more interesting when you compare that output to another of Thoreau’s works — A Week on the Concord and Merrimack Rivers:

  Bi-grams (T-Score, count, bi-gram)
  4.62683939320543  22  one another  
  4.57637831535376  21  new england  
  4.08356124174142  17  let us  
  3.86858364314677  15  new hampshire  
  3.43311180449584  12  one hundred  
  3.31196701774012  11  common sense  
  3.25007069543896  11  can never  
  3.15955504269006  10  years ago  
  3.14821552996352  10  human life  
  3.13793008615654  10  told us  
  
  Tri-grams (count, tri-gram)
  41  as well as
  38  of the river
  34  it is a
  30  there is a
  30  one of the
  28  it is the
  27  as if it
  26  it is not
  26  if it were
  24  it was a
  
  Quad-grams (count, quad-gram)
  21  for the most part
  20  as if it were
  17  from time to time
   9  on the bank of
   8  the bank of the
   8  in the midst of
   8  a quarter of a
   8  the middle of the
   8  quarter of a mile
   7  at the same time

Ask yourself, “Are their similarities between the outputs? How about differences? Do you notice any patterns or anomalies? What sorts of new discoveries might be made if n-grams.pl where applied to the entire corpus of Thoreau’s works? How might the output be different if a second author’s works were introduced?” Such questions are the core of digital humanities research. With the increasing availability of full text content in library collections, such are the questions the library profession can help answer if the profession were to expand it’s definition of “service”.

Search and retrieve are not the pressing problems to solved. People can find more data and information than they know what to do with. Instead, the pressing problems surround use and understanding. Lingua::EN::Bigram is an example of how these newer and more pressing problems can be addressed. The module is available for downloading (locally as well as from CPAN). Also for your perusal is n-grams.pl.