Archive for the ‘Uncategorized’ Category

DPLA Beta Sprint Submission

Monday, June 20th, 2011

I decided to give it a whirl and particpate in the DPLA Beta Sprint, and below is my submission:

DPLA Beta Sprint Submission

My DPLA Beta Sprint submission will describe and demonstrate how the digitized versions of library collections can be made more useful through the application of text mining and various other digital humanities computing techniques.

Full text content abounds, and full text indexing techniques have matured. While the problem of discovery will never be completely solved, it is much less acute than it was even a decade ago. Whether the library profession or academia believes it or not, most people do not feel as if they have a problem finding data, information, and knowledge. To them it is as easy as entering a few words or phrases into a search box and clicking Go.

It is now time to move beyond the problem of find and spend increased efforts trying to solve the problem of use. What does one do with all the information they find and acquire? How can it be put into the context of the reader? What actions can the reader apply against the content they find? How can it be compared & contrasted? What makes one piece of information — such as a book, an article, a chapter, or even a paragraph — more significant than another? How might the information at hand be used to solve problems or create new insights?

There is no single answer to these questions, but this submission will describe and demonstrate one set of possibilities. It will assume the existence of full text content of just about any type — such as books the Internet Archive, open access journals, or blog postings. It will outline how these texts can be analyzed to find patterns, extract themes, and identify anomalies. It will describe how entire corpora or search results can be post-processed to not only refine the discovery process but also make sense of the results and enable the reader to quickly grasp the essence of textual documents. Since actions speak louder than words, this submission will also present a number of loosely joined applications demonstrating how this analysis can be implemented through Web browsers and/or portable computing devices such as tablet computers.

By exploiting the current environment — full text content coupled with ubiquitous computing horsepower — the DPLA can demonstrate to the wider community how libraries can remain relevant in the current century. This submission will describe and demonstrate a facet of that vision.

Lingua::EN::Bigram (version 0.03)

Monday, August 23rd, 2010

I uploaded version 0.03 of Lingua::EN::Bigram to CPAN today, and it now supports not just bigrams, trigrams, quadgrams, but ngrams — an arbitrary phrase length.

In order to test it out, I quickly gathered together some of my more recent essays, concatonated them together, and applied Lingua::EN::Bigram against the result. Below is a list of the top 10 most common bigrams, trigrams, and quadgrams:

  bigrams                 trigrams                  quadgrams
  52  great ideas         36  the number of         25  the number of times
  43  open source         36  open source software  13  the total number of
  38  source software     32  as well as            10  at the same time
  29  great books         28  number of times       10  number of words in
  24  digital humanities  27  the use of            10  when it comes to
  23  good man            25  the great books       10  total number of documents
  22  full text           23  a set of              10  open source software is
  22  search results      20  eric lease morgan      9  number of times a
  20  lease morgan        20  a number of            9  as well as the
  20  eric lease          19  total number of        9  through the use of

Not surprising since I have been writing about the Great Books, digital humanities, indexing, and open source software. Re-affirming.

Lingu::EN::Bigram is available locally as well as from CPAN.