Archive for the ‘Hacks’ Category

TriLUG, open source software, and satisfaction

Friday, December 9th, 2011

This is posting about TriLUG, open source software, and satisfaction for doing a job well-done.

A long time ago, in a galaxy far far away, I lived in Raleigh (North Carolina), and a fledgling community was growing called the Triangle Linux User’s Group (TriLUG). I participated in a few of their meetings. While I was interested in open source software, I was not so interested in Linux. My interests were more along the lines of the application stack, not necessarily systems administration nor Internet networking.

I gave a presentation to the User’s Group on the combined use of PHP and MySQL — “Smart HTML pages with PHP“. Because of this I was recruited to write a Web-based membership application. Since flattery will get you everywhere with me, I was happy to do it. After a couple of weeks, the application was put into place and seemed to function correctly. That was a bit more than ten years ago, probably during the Spring of 2001.

The other day I got an automated email message from the User’s Group. The author of the message wanted to know if I wanted to continue my membership? I replied how that was not necessary since I had long since moved away to northern Indiana.

I then got to wondering whether or not the message I received had been sent by my application. It was a long shot, but I enquired anyway. Sure enough, I got a response from Jeff Schornick, a TriLUG board member, who told me “Yes, your application was the tool that had been used.” How satisfying! How wonderful to know that something I wrote more than ten years ago was still working.

Just as importantly, Jeff wanted to know about open source licensing. I had not explicitly licensed the software, something that I only learned was necessary from Dan Chudnov later. After a bit of back and forth, the original source code was supplemented with the GNU Public License, packaged up, and distributed from a Git repository. Over the years the User’s Group had modified it to overcome a few usability issues, and they wanted to distribute the source code using the most legitimate means possible.

This experience was extremely enriching. I originally offered my skills, and they returned benefits to the community greater than the expense of my time. The community then came back to me because they wanted to express their appreciation and give credit where credit was due.

Open source software not necessarily about computer technology. It is just as much, if not more, about people and the communities they form.

Fun with RSS and the RSS aggregator called Planet

Wednesday, May 25th, 2011

This posting outlines how I refined a number of my RSS feeds and then aggregated them into a coherent whole using Planet.

Many different RSS feeds

I have, more or less, been creating RSS (Real Simple Syndication) feeds since 2002. My first foray was not really with RSS but rather with RDF. At that time the functions of RSS and RDF were blurred. In any event, I used RDF as a way of syndicating randomly selected items from my water collection. I never really pushed the RDF, and nothing really became of it. See “Collecting water and putting it on the Web” for details.

In December of 2004 I started marking up my articles, presentations, and travelogues in TEI and saving the result in a database. The webified version of these efforts was something called Musings on Information and Librarianship. I described the database supporting the process is a specific entry called “My personal TEI publishing system“. A program — make-rss.pl — was used to make the feed.

Since then blogs have become popular, and almost by definition, blogs support RSS in a really big way. My RSS was functional, but by comparison, everybody else’s was exceptional. For many reasons I started drifting away from my personal publishing system in 2008 and started moving towards WordPress. This manifested itself in this blog — Mini-Musings.

To make things more complicated, I started blogging on other sites for specific purposes. About a year ago I started blogging for the “Catholic Portal”, and more recently I’ve been blogging about research data management/curation — Days in the Life of a Librarian — at the University of Notre Dame.

In September of 2009 I started implementing a reading list application. Print an article. Read it. Draw and scribble on it. (Read, “Annotate it.”) Scan it. Convert it into a PDF document. Do OCR against it. Save the result to a Web-accessible file system. Do data entry against a database to describe it. Index the metadata and extracted OCR. And finally, provide a searchable/browsable interface to the whole lot. The result is a fledgling system I call “What’s Eric Reading?” Since I wanted to share my wealth (after all, I am a librarian) I created an RSS feed against this system too.

I was on a roll. I went back to my water collection and created a full-fledged RSS feed against it as well. See the simple Perl script — water2rss.pl — to see how easy it is.

Ack! I now have six different active RSS feeds, not counting the feeds I can get from Flickr and YouTube:

  1. Catholic Portal
  2. Life of a Librarian
  3. Mini-musings
  4. Musings
  5. What’s Eric Reading?
  6. Water collection

That’s too many, even for an ego surfer like myself. What to do? How can I consolidate these things? How can I present my writings in a single interface? How can I make it easy to syndicate all of this content in a standards-compliant way?

Planet

The answer to my questions is/was Planet — “an awesome ‘river of news’ feed reader. It downloads news feeds published by web sites and aggregates their content together into a single combined feed, latest news first.”

A couple of years ago the Code4Lib community created an RSS “planet” called Planet Code4Lib — “Blogs and feeds of interest to the Code4Lib community, aggregated.” I think it is maintained by Jonathan Rochkind, but I’m not sure. It is pretty nice since it brings together the RSS feeds from quite a number of library “hackers”. Similarly, there is another planet called Planet Cataloging which does the same thing for library cataloging feeds. This one is maintained by Jennifer W. Baxmeyer and Kevin S. Clarke. The combined planets work very well together, except when individual blogs are in both aggregations. When this happens I end up reading the same blog postings twice. Not a big deal. You get what you pay for.

After a tiny bit of investigation, I decided to use Planet to aggregate and serve my RSS feeds. Installation and configuration was trivial. Download and unpack the distribution. Select an HTML template. Edit a configuration file denoting the location of RSS feeds and where the output will be saved. Run the program. Tweak the template. Repeat until satisfied. Run the program on a regular basis, preferably via cron. Done. My result is called Planet Eric Lease Morgan.

Planet Eric Lease Morgan

The graphic design may not be extraordinarily beautiful, but the content is not necessarily intended to be read via an HTML page. Instead the content is intended to be read from inside one’s favorite RSS reader. Planet not only aggregates content but syndicates it too. Very, very nice.

What I learned

I learned a number of things from this process. First I learned that standards evolve. “Duh!”

Second, my understanding of open source software and its benefits was re-enforced. I would not have been able to do nearly as much if it weren’t for open source software.

Third, the process provided me with a means to reflect on the processes of librarianship. My particular processes for syndicating content needed to evolve in order to remain relevant. I had to go back and modify a number of my programs in order for everything to work correctly and validate. The library profession seemingly hates to do this. We have a mindset of “Mark it and park it.” We have a mindset of “I only want to touch book or record once.” In the current environment, this is not healthy. Change is more the norm than not. The profession needs to embrace change, but then again, all institutions, almost by definition, abhor change. What’s a person to do?

Forth, the process enabled me to come up with a new quip. The written word read transcends both space and time. Fun!?

Finally, here’s an idea for the progressive librarians in the crowd. Use the Planet software to aggregate RSS fitting your library’s collection development policy. Programatically loop through the resulting links to copy/mirror the remote content locally. Curate the resulting collection. Index it. Integrate the subcollection and index into your wider collection of books, jourals, etc. Repeat.

Book reviews for Web app development

Sunday, May 15th, 2011

This is a set of tiny book reviews covering the topic of Web app development for the iPhone, iPad, and iPod Touch.

Unless you’ve been living under a rock for the past three or four years, then you know the increasing popularity of personal mobile computing devices. This has manifested itself through “smart phones” like the iPhone and “tablet computers” like the iPad and to some extent the iPod Touch. These devices, as well as other smart phones and tablet computers, get their network connections from the ether, their screens are smaller than the monitors of desktop computers, and they employ touch screens for input instead of keyboards and mice. All of these things significantly change the user’s experience and thus their expectations.

As a librarian I am interested in providing information services to my clientele. In this increasingly competitive environment where the provision of information services includes players like Google, Amazon, and Facebook, it behooves me to adapt to the wider environment of my clientele as opposed to the other way around. This means I need to learn how to provide information services through mobile computing devices. Google does it. I have to do it too.

Applications for mobile computing devices fall into two categories: 1) native applications, and 2) “Web apps”. The former are binary programs written in compiled languages like Objective-C (or quite possibly Java). These types of applications are operating system-specific, but they are also able to take full advantage of the underlying hardware. This means applications for things like iPhone or iPad can interoperate with the devices’ microphone, camera, speakers, geo-location functions, network connection, local storage, etc. Unfortunately, I don’t know any compiled languages to any great degree, and actually I have little desire to do so. After all, I’m a lazy Perl programmer, and I’ve been that way for almost twenty years.

The second class of applications are Web apps. In reality, these things are simply sets of HTML pages specifically designed for mobiles. These “applications” have the advantage of being operating system independent but are dead in the water without the existence of a robust network connection. These applications, in order to be interactive and meet user expectations, also need to take full advantage of CSS and Javascript, and when it comes to Javascript it becomes imperative to learn and understand how to do AJAX and AJAX-like data acquisition. If I want to provide information services through mobile devices, then the creation of Web apps seems much more feasible. I know how to create well-formed and valid HTML. I can employ the classic LAMP stack to do any hard-core computing. There are a growing number of CSS frameworks making it easy to implement the mobile interface. All I have to do is learn Javascript, and this is not nearly as difficult as it used to be with the emergence of Javascript debuggers and numerous Javascript libraries. For me, Web apps seem to be the way to go.

Over the past couple of years I went out and purchased the following books to help me learn how to create Web apps. Each of them are briefly described below, but first, here’s a word about WebKit. There are at least three HTML frameworks driving the majority of Web browsers these days. Gecko which is the heart of Firefox, WebKit which is the heart of Safari and Chrome, and whatever Microsoft uses as the heart of Internet Explorer. Since I do not own any devices that run the Android or the Windows operating systems, all of my development is limited to Gecko or WebKit based browsers. Luckily, WebKit seems to be increasing in popularity, and this makes it easier for me to rationalize my development in iPhone, iPad, and iPod Touch. The books reviewed below also lean in this direction.

  • Beginning iPhone And iPad Web Apps (2010, 488 pgs.) by Chris Apers and Daniel Paterson – This is one my more recent purchases and I think I like this book the best. First and foremost, it is the most agnostic of all the books, even though some of the examples use WebKit. True to its title, it describes the use of HTML5, CSS, and Javascript to implement mobile interfaces. This includes whole chapters to the use of vector graphics and fonts, audio and video content, special effects with (WebKit-specific) CSS, touch and gesture events with Javascript, location-aware programming, and client-side data storage. Moreover, this book is the best of the bunch when it comes to describing how mobile interfaces are different from browser-based interfaces. Mobile interfaces are not just smaller versions of their older siblings! If you are going to buy one book, then buy this one. I think it will serve you for the longest period of time.
  • Building iPhone Apps With HTML, CSS, and Javascript (2010, 166 pgs.) by Jonathan Stark – Being shorter than the previous book, this one is not as thorough but still covers all the bases. On the other hand, unlike the previous title, it does describe how to use a Javascript library for mobile (JQTouch), and how to use PhoneGap to convert a Web app into a native application with many of the native application benefits. This book is a quick read and a good introduction.
  • Dashcode For Dummies (2011, 436 pgs.) by Jesse Feiler – Dashcode is a development environment originally designed to facilitate the creation of Macintosh OS X dashboard widgets. As you may or may not know, these widgets are self-contained HTML/Javascript/CSS files intended to support simple utility functions. Tell the time. Display the weather. Convert currencies. Render XML files. Etc. Dashcode evolved and now enables the developer to create Web apps for the Macintosh family of i-devices. I bought this book because I own these devices, and I thought the book might help me exploit their particular characteristics. It does not. Dashcode includes no internal links to the underlying hardware. This book describes how to use Dashcode very well, but Dashcode applications are not really the kind I want to create. I suppose I could use Dashcode to create the skin of my application but the overhead may be excessive and the result may be too device dependent.
  • Developing Hybrid Applications For The iPhone (2009, 195 pgs.) by Lee S. Barney – By introducing the idea of a “hybrid” application, this book picks up where the Dashcode book left off. It does this by describing two Javascript packages (QuickConnectiPhone and PhoneGap) allowing the developer to interact with the underlying hardware. I’ve read this book a couple of times, I’ve looked over it a few more, and in the end I am still challanged. I’m excited about accessing things like hardware’s camera, GPS funcationality, and file system, but after reading this book I’m still confused on actually how to do it. The content of this book is an advanced topic to be tackled after the basics have been mastered.
  • Safari And WebKit Development For iPhone OS 3.0 (2010, 383 pgs.) by Richard Wagner – This book is practical, and the one I relied upon the most, but only before I bought Beginning iPhone And iPad Web Apps. It gives an overview of WebKit, Javascript, and CSS. It advocates Web app frameworks like iUI, iWebKit, and UIUIKit. It describes how to design interfaces for the small screen of iPhone and iPod Touch. It has a chapter the specific Javascript events supported by iPhone and iPod Touch. Like a couple of the other books, it describes how to use the HTML5 canvas to render graphics. I was excited to learn how to interact with the phone, maps, and SMS functions of the devices, but learned that this is done simply through specialized URLs. When the book talks about “offline applications” it is really talking about local database storage — another feature of HTML5. A couple things I should have explored but haven’t yet include bookmarklets and data URLs. The book describes how to take advantage of these concepts. This book is really a second edition of similar book with a different title but written by the same author in 2008. Its content is not as current as it could be, but the fundamentals are there.

Based on the things I’ve learned from these books, I’ve created several mobile interfaces. Each of them deserve their own blog posting so I will only outline them here:

  1. iMobile – A rough mobile interface to much of the Infomotions domain. Written a little more than a year ago, it combines backend Perl scripts with the iUI Javascript framework to render content. Now that I look back on it, the hacks there are pretty impressive, if I do say so myself. Of particular interest is the image gallery which gets its content from OAI-PMH data stored on the server, and my water collection which reads an XML file of my own design and plots where the water was collected on a Google map. iMobile was created from the knowledge I gained from Safari And WebKit Development For iPhone OS 3.0.
  2. DH@ND – The home page for a fledgling initiative called Digital Humanities at the University of Notre Dame. The purpose of the site is to support sets of tools enabling students and scholars to simultaneously do “close reading” and “distant reading”. It was built using the principles gleaned from the books above combined with a newer Javascript framework called JQueryMobile. There are only two things presently of note there. The first is Alex Lite for Mobile, a mobile interface to a tiny catalogue of classic novels. Browse the collection by author or title. Download and read selected books in ePub, PDF, or HTML formats. The second is Geo-location. After doing named-entity extraction against a limited number of classic novels, this interface displays a word cloud of place names. The user can then click on place names and have them plotted on a Google Map.

Remember, the sites listed above are designed for mobile, primarly driven by the WebKit engine. If you don’t use a mobile device to view the sites, then your milage will vary.

Image Gallery
Image Gallery
Water Collection
Water Collection
Alex Lite
Alex Lite
Geo-location
Geo-Location

Web app development is beyond a trend. It has all but become an expectation. Web app implementation requires an evolution in thinking about Web design as well as an additional skill set which includes advanced HTML, CSS, and Javascript. These are not your father’s websites. There are a number of books out there that can help you learn about these topics. Listed above are just a few of them.

Alex Lite (version 2.0)

Monday, April 11th, 2011

This posting describes Alex Lite (version 2.0) — a freely available, standards-compliant distribution of electronic texts and ebooks.

Alex LIte browser version
Alex Lite in a browser
Alex Lite webapp
Alex Lite on a mobile

A few years ago I created the first version of Alex Lite. Its primary purpose was to: 1) explore and demonstrate how to transform a particular flavor of XML (TEI) into a number of ebook formats, and 2) distribute the result on a CD-ROM. The process was successful. I learned a lot of about XSLT — the primary tool for doing this sort of work.

Since then two new developments have occurred. First, a “standard” ebook format has emerged — ePub. Based on XHTML, this standard specifies packaging up numerous XML files into a specialized ZIP archive. Software is intended to uncompress the file and display the result. Second, mobile devices have become more prevalent. Think “smart phones” and iPads. These two things have been combined to generate an emerging ebook market. Consequently, I decided to see how easy it would be to transform my TEI files into ePub files, make them available on the Web as well as a CD-ROM, and finally implement a “Webapp” for using the whole thing.

Alex Lite (version 2.0) is the result. There you will find a rudimentary Web browser-based “catalogue” of electronic texts. Browsable by authors and titles (no search), a person can read as many as eigthy classic writings in the forms of HTML, PDF, and ePub files. Using just about any mobile device, a person should be able to use a differnt interface to the collection with all of the functionality of the original. The only difference is the form factor, and thus the graphic design.

The entire Alex Lite distribution is designed to be given away and used as a stand-alone “library”. Download the .zip file. Uncompress it (about 116 MB). Optionally save the result on your Web server. Open the distribution’s index.html file with your browser or mobile. Done. Everything is included. Supporting files. HTML files. ePub files. PDF’s. Since all the files have been run through validators, a CD of Alex Lite should be readable for quite some time. Give away copies to your friends and relatives. Alex Lite makes a great gift.

Computers and their networks are extremely fragile. If they were to break, then access to much of world’s current information would suddently become inaccessible. Creating copies of content, like Alex Lite, are a sort of insurance against this catastrophe. Marking-up content in forms like TEI make it realatively easy to migrate ideas forward. TEI is just the information, not display nor container. Using XSLT it is possible to create different containers and different displays. Having copies of content locally enables a person to control their own destiny. Linking to content only creates maintenance nightmares.

Alex Lite is a fun little hack. Share it with your friends, and use it to evolve your definition of a library.

Where in the world is the mail going?

Wednesday, March 23rd, 2011

For a good time, I geo-located the subscribers from a number of mailing lists, and then plotted them on a Google map. In other words, I asked the question, “Where in the world is the mail going?” The answer was sort of surprising.

I moderate/manage three library-specific mailing lists: Usability4Lib, Code4Lib, and NGC4Lib. This means I constantly get email messages from the LISTSERV application alerting me to new subscriptions, unsubscriptions, bounced mail, etc. For the most part the whole thing is pretty hands-off, and all I have to do is manually unsubscribe people because their address changed. No big deal.

It is sort of fun to watch the subscription requests. They are usually from places within the United States but not always. I then got to wondering, “Exactly where are these people located?” Plotting the answer on a world map would make such things apparent. This process is called geo-location. For me it is easily done by combining a Perl module called Geo::IP with the Google Maps API. The process was not too difficult and implemented in a program called domains2map.pl:

  1. get a list of all the subscribers to a given mailing list
  2. remove all information but the domain of the email addresses
  3. get the latitude and longitude for a given domain — geo-locate the domain
  4. increment the number of times this domain occurs in the list
  5. got to Step #3 for each item in the list
  6. build a set of Javascript objects describing each domain
  7. insert the objects into an HTML template
  8. output the finished HTML

The results are illustrated below.

Usability4Lib – 600 subscribers
usability4lib
interactive map
usability4lib
pie chart
Code4Lib – 1,700 subscribers
code4lib
interactive map
code4lib
pie chart
NGC4Lib – 2,100 subscribers
ngc4lib
interactive map
ngc4lib
pie chart

It is interesting to note how many of the subscribers seem to be located in Mountain View (California). This is because many people use Gmail for their mailing list subscriptions. The mailing lists I moderate/manage are heavily based in the United States, western Europe, and Australia — for the most part, English-speaking countries. There is a large contingent of Usability4Lib subscribers located in Rochester (New York). Gee, I wonder why. Even though the number of subscribers to Code4Lib and NGC4Lib is similar, the Code4Libbers use Gmail more. NGC4Lib seems to have the most international subscription base.

In the interest of providing “access to the data behind the chart”, you can download the data sets: code4lib.txt, ngc4lib.txt, and usability4lib.txt. Fun with Perl, Google Maps, and mailing list subscriptions.

For something similar, take a gander at my water collection where I geo-located waters of the world.

Constant chatter at Code4Lib

Sunday, March 20th, 2011

As illustrated by the chart, it seems as if the chatter was constant during the most recent Code4Lib conference.

For a good time and in the vein of text mining, I made an effort to collect as many tweets with the hash tag #c4l11 as well as the backchannel log files. (“Thanks, lbjay!”). I then parsed the collection into fields (keys, author identifiers, date stamps, and chats/tweets), and stuffed them into a database. I then created a rudimentary tab-delimited text file consisting of a key (representing a conference event), a start time, and an end time. Looping through this file I queried my database returning the number of chats and tweets associated with each time interval. Lastly, I graphed the result.

chatter at code4lib
Constant chatter at Code4Lib, 2011

As you can see there are a number of spikes, most notably associated with keynote presentations and Lightning Talks. Do not be fooled, because each of these events are longer than balance of the events in the conference. The chatter was rather constant throughout Code4Lib 2011.

When talking about the backchannel, many people say, “It is too distracting; there is too much stuff there.” I then ask myself, “How much is too much?” Using the graph as evidence, I can see there are about 300 chats per event. Each event is about 20-30 minutes long. That averages out to 10ish chats per minute or 1 item every 6 seconds. I now have a yardstick. When the chat volume is equal to or greater than 1 item every 6 seconds, then there is too much stuff for many people to follow.

The next step will be to write a program allowing people to select time ranges from the chat/tweet collection, extract the associated data, and apply analysis tools against them. This includes things like concordances, lists of frequently used words and phrases, word clouds, etc.

Finally, just like traditional books, articles, microforms, and audio-visual materials things things like backchannel log files, tweets, blogs, and mailing list archives are forms of human expression. Do what degree do these things fall into the purview of library collections? Why (or why not) should libraries actively collect and archive them? If it is within our purview, then what do libraries need to do differently in order build such collections and take advantage of their fulltext nature?

Foray’s into parts-of-speech

Saturday, February 5th, 2011

This posting is the first of my text mining essays focusing on parts-of-speech. Based on the most rudimentary investigations, outlined below, it seems as if there is not much utility in the classification and description of texts in terms of their percentage use of parts-of-speech.

Background

For the past year or so I have spent a lot of my time counting words. Many of my friends and colleagues look at me strangely when I say this. I have to admit, it does sound sort of weird. On the other hand, the process has enabled me to easily compare & contrast entire canons in terms of length and readability, locate statistically significant words & phrases in individual works, and visualize both with charts & graphs. Through the process I have developed two Perl modules (Lingua::EN::Ngram and Lingua::Concordance), and I have integrated them into my Alex Catalogue of Electronic Texts. Many people are still skeptical about the utility of these endeavors, and my implementations do not seem to be compelling enough to sway their opinions. Oh well, such is life.

My ultimate goal is to figure out ways to exploit the current environment and provide better library service. The current environment is rich with full text. It abounds. I ask myself, “How can I take advantage of this full text to make the work of students, teachers, and scholars both easier and more productive?” My current answer surrounds the creation of tools that take advantage of the full text — making it easier for people to “read” larger quantities of information, find patterns in it, and through the process create new knowledge.

Much of my work has been based on rudimentary statistics with little regard to linguistics. Through the use of computers I strive to easily find patterns of meaning across works — an aspect of linguistics. I think such a thing is possible because the use of language assumes systems and patterns. If it didn’t then communication between ourselves would be impossible. Computers are all about systems and patterns. They are very good at counting and recording data. By using computers to count and record characteristics of texts, I think it is possible to find patterns that humans overlook or don’t figure as significant. I would really like to take advantage of core reference works which are full of meaning — dictionaries, thesauri, almanacs, biographies, bibliographies, gazetteers, encyclopedias, etc. — but the ambiguous nature of written language makes the automatic application of such tools challenging. By classifying individual words as parts-of-speech (POS), some of this ambiguity can be reduced. This posting is my first foray into this line of reasoning, and only time will tell if it is fruitful.

Comparing parts-of-speech across texts

My first experiment compares & contrasts POS usage across texts. “To what degree are there significant differences between authors’ and genres’ usage of various parts-of-speech?”, I asked myself. “Do some works contain a greater number of nouns, verbs, and adjectives than others?” If so, then maybe this would be one way to differentiate works, and make it easier for the student to both select a work for reading as well as understand its content.

POS tagging

To answer these questions, I need to first identify the POS in a document. In the English language there are eight generally accepted POS: 1) nouns, 2) pronouns, 3) verbs, 4) adverbs, 5) adjectives, 6) prepositions, 7) conjunctions, and 8) interjections. Since I am a “lazy Perl programmer”, I sought a POS tagger and in the end settled on one called Lingua::TreeTagger — a full-featured wrapper around a command line driven application called Tree Tagger. Using a process called the Hidden Markov Model, TreeTagger systematically goes through a document and guesses the POS for a given word. According to the research, it can do this with 96% accuracy because is has accurately modeled the systems and patterns of the English language alluded to above. For example, it knows that sentences begin with capital letters and end with punctuation marks. It knows that capitalized words in the middle of sentences are the names of things and the names of things are nouns. It knows that most adverbs end in “ly”. It knows that adjectives often precede nouns. Similarly, it knows the word “the” also precedes nouns. In short, it has done its best to model the syntactical nature of a number of languages and it uses these models to denote the POS in a document.

For example, below is the first sentence from Abraham Lincoln’s Gettysburg Address:

Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.

Using Lingua::TreeTagger it is trivial to convert the sentence into the following XML snippet where each element contains two attributes (a lemma of the word in question and its POS) and the word itself:

<pos><w lemma="Four" type="CD">Four</w> <w lemma="score" type="NN">score</w> <w lemma="and" type="CC">and</w> <w lemma="seven" type="CD">seven</w> <w lemma="year" type="NNS">years</w> <w lemma="ago" type="RB">ago</w> <w lemma="our" type="PP$">our</w> <w lemma="father" type="NNS">fathers</w> <w lemma="bring" type="VVD">brought</w> <w lemma="forth" type="RB">forth</w> <w lemma="on" type="IN">on</w> <w lemma="this" type="DT">this</w> <w lemma="continent" type="NN">continent</w> <w lemma="," type=",">,</w> <w lemma="a" type="DT">a</w> <w lemma="new" type="JJ">new</w> <w lemma="nation" type="NN">nation</w> <w lemma="," type=",">,</w> <w lemma="conceive" type="VVN">conceived</w> <w lemma="in" type="IN">in</w> <w lemma="Liberty" type="NP">Liberty</w> <w lemma="," type=",">,</w> <w lemma="and" type="CC">and</w> <w lemma="dedicate" type="VVN">dedicated</w> <w lemma="to" type="TO">to</w> <w lemma="the" type="DT">the</w> <w lemma="proposition" type="NN">proposition</w> <w lemma="that" type="IN/that">that</w> <w lemma="all" type="DT">all</w> <w lemma="man" type="NNS">men</w> <w lemma="be" type="VBP">are</w> <w lemma="create" type="VVN">created</w> <w lemma="equal" type="JJ">equal</w> <w lemma="." type="SENT">.</w></pos>

Each POS is represented by a different code. TreeTagger uses as many as 58 codes. Some of the less obscure are: CD for cardinal number, CC for conjunction, NN for noun, NNS for plural noun, JJ for adjective, VBP for the verb to be in the third-person plural, etc.

Using a slightly different version of the same trivial code, Lingua::TreeTagger can output a delimited stream where each line represents a record and the delimited values are words, lemmas, and POS. The first ten records from the sentence above are displayed below:

Word Lemma POS
Four Four CD
score score NN
and and CC
seven seven CD
years year NNS
ago ago RB
our our PP$
fathers father NNS
brought bring VVD
forth forth RB

In the end I wrote a simple program – tag.pl — taking a file name as input and streaming to standard output the tagged text in delimited form. Executing the code and saving the output to a file is simple:

$ bin/tag.pl corpus/walden.txt > pos/walden.pos

Consequently, I now have a way to quickly and easily denote the POS for each word in a given plain text file.

Counting and summarizing

Now that the POS of a given document are identified, the next step is to count and summarize them. Counting is something at which computers excel, and I wrote another program — summarize.pl — to do the work. The program’s input takes the following form:

summarize.pl <all|simple|other|pronouns|nouns|verbs|adverbs|adjectives> <t|l> <filename>

The first command line argument denotes what POS will be output. “All” denotes the POS defined by Tree Tagger. “Simple” denotes Tree Tagger POS mapped to the eight generally accepted POS of the English language. The use of “nouns”, “pronouns”, “verbs”, “adverbs”, and “adjectives” tells the script to output the tokens (words) or lemmas in each of these classes.

The second command line argument tells the script whether to tally tokens (words) or lemmas when counting specific items.

The last argument is the file to read, and it is expected to be in the form of tag.pl’s output.

Using summarize.pl to count the simple POS in Lincoln’s Address, the following output is generated:

$ summarize.pl simple t address.pos
noun 41
pronoun 29
adjective 21
verb 51
adverb 31
determiner 35
preposition 39
conjunction 11
interjection 0
symbol 2
punctuation 39
other 11

In other words, of the 272 words found in the Gettysburg Address 41 are nouns, 29 are pronouns, 21 are adjectives, etc.

Using a different from of the script, a list of all the pronouns in the Address, sorted by the number of occurances, can be generated:

$ summarize.pl pronouns t address.pos
we 10
it 5
they 3
who 3
us 3
our 2
what 2
their 1

In other words, the word “we” — a particular pronoun — was used 10 times in the Address.

Consequently, I now have tool enabling me to count the POS in a document.

Preliminary analysis

I now have the tools necessary to answer one of my initial questions, “Do some works contain a greater number of nouns, verbs, and adjectives than others?” To answer this I collected nine sets of documents for analysis:

  1. Henry David Thoreau’s Excursions (73,734 words; Flesch readability score: 57 )
  2. Henry David Thoreau’s Walden (106,486 words; Flesch readability score: 55 )
  3. Henry David Thoreau’s A Week on the Concord and Merrimack Rivers (117,670 words; Flesch readability score: 56 )
  4. Jane Austen’s Sense and Sensibility (119,625 words; Flesch readability score: 54 )
  5. Jane Austen’s Northanger Abbey (76,497 words; Flesch readability score: 58 )
  6. Jane Austen’s Emma (156,509 words; Flesch readability score: 60 )
  7. all of the works of Plato (1,162,460 words; Flesch readability score: 54 )
  8. all of the works of Aristotle (950,078 words; Flesch readability score: 50 )
  9. all of the works of Shakespeare (856,594 words; Flesch readability score: 72 )

Using tag.pl I created POS files for each set of documents. I then used summary.pl to output counts of the simple POS from each POS file. For example, after creating a POS file for Walden, I summarized the results and learned that it contains 23,272 nouns, 10,068 pronouns, 8,118 adjectives, etc.:

$ summarize.pl simple t walden.pos
noun 23272
pronoun 10068
adjective 8118
verb 17695
adverb 8289
determiner 13494
preposition 16557
conjunction 5921
interjection 37
symbol 997
punctuation 14377
other 2632

I then copied this information into a spreadsheet and calculated the relative percentage of each POS discovering that 19% of the words in Walden are nouns, 8% are pronouns, 7% are adjectives, etc. See the table below:

POS %
noun 19
pronoun 8
adjective 7
verb 15
adverb 7
determiner 11
preposition 14
conjunction 5
interjection 0
symbol 1
punctuation 12
other 2

I repeated this process for each of the nine sets of documents and tabulated them here:

POS Excursions Rivers Walden Sense Northanger Emma Aristotle Shakespeare Plato Average
noun 20 20 19 17 17 17 19 25 18 19
verb 14 14 15 16 16 17 15 14 15 15
punctuation 13 13 12 15 15 15 11 16 13 14
preposition 13 13 14 13 13 12 15 9 14 13
determiner 12 12 11 7 8 7 13 6 11 10
pronoun 7 7 8 12 11 11 5 11 7 9
adverb 6 6 7 8 8 8 6 6 6 7
adjective 7 7 7 5 6 6 7 5 6 6
conjunction 5 5 5 3 3 3 5 3 6 4
other 2 2 2 3 3 3 3 3 3 3
symbol 1 1 1 1 1 0 1 2 1 1
interjection 0 0 0 0 0 0 0 0 0 0
Percentage and average of parts-of-speech usage in 9 works or corpra

The result was very surprising to me. Despite the wide range of document sizes, and despite the wide range of genres, the relative percentages of POS are very similar across all of the documents. The last column in the table represents the average percentage of each POS use. Notice how the each individual POS value differs very little from the average.

This analysis can be illustrated in a couple of ways. First, below are nine pie charts. Each slice of each pie represents a different POS. Notice how all the dark blue slices (nouns) are very similar in size. Notice how all the red slices (verbs), again, are very similar. The only noticeable exception is in Shakespeare where there is a greater number of nouns and pronouns (dark green).


Thoreau’s Excursions

Thoreau’s Walden

Thoreau’s Rivers

Austen’s Sense

Austen’s Northanger

Austen’s Emma

all of Plato

all of Aristotle

all of Shakespeare

The similarity across all the documents can be further illustrated with a line graph:

Across the X axis is each POS. Up and down the Y axis is the percentage of usage. Notice how the values for each POS in each document are closely clustered. Each set of documents uses relatively the same number of nouns, pronouns, verbs, adjectives, adverbs, etc.

Maybe such a relationship between POS is one of the patterns of well-written documents? Maybe it is representative of works standing the test of time? I don’t know, but I doubt I am the first person to make such an observation.

Conclusion

My initial questions were, “To what degree are there significant differences between authors’ and genres’ usage of various parts-of-speech?” and “Do some works contain a greater number of nouns, verbs, and adjectives than others?” Based on this foray and rudimentary analysis the answers are, “No, there are not significant differences, and no, works do not contain different number of nouns, verbs, adjectives, etc.”

Of course, such a conclusion is faulty without further calculations. I will quite likely commit an error of induction if I base my conclusions on a sample of only nine items. While it would require a greater amount of effort on my part, it is not beyond possibility for me to calculate the average POS usage for every item in my Alex Catalogue. I know there will be some differences — especially considering the items having gone through optical character recognition — but I do not know the degree of difference. Such an investigation is left for a later time.

Instead, I plan to pursue a different line of investigation. The current work examined how texts were constructed, but in actuality I am more interested in the meanings works express. I am interested in what they say more than how they say it. Such meanings may be gleaned not so much from gross POS measurements but rather the words used to denote each POS. For example, the following table lists the 10 most frequently used pronouns and the number of times they occur in four works. Notice the differences:

Walden Rivers Northanger Sense
I (1,809) it (1,314) her (1,554) her (2,500)
it (1,507) we (1,101) I (1,240) I (1,917)
my (725) his (834) she (1,089) it (1,711)
he (698) I (756) it (1,081) she (1,553)
his (666) our (677) you (906) you (1,158)
they (614) he (649) he (539) he (1,068)
their (452) their (632) his (524) his (1,007)
we (447) they (632) they (379) him (628)
its (351) its (487) my (342) my (598)
who (340) who (352) him (278) they (509)

While the lists are similar, they are characteristic of work from which they came. The first — Walden — is about an individual who lives on a lake. Notice the prominence of the word “I” and “my”. The second — Rivers — is written by the same author as the first but is about brothers who canoe down a river. Notice the higher occurrence of the word “we” and “our”. The later two works, both written by Jane Austin, are works with females as central characters. Notice how the words “her” and “she” appear in these lists but not in the former two. (Compare these lists of pronouns with the list from Lincoln’s Address and even more interesting things appear.) It looks as if there are patterns or trends to be measured here.

‘More later.

Visualizing co-occurrences with Protovis

Sunday, January 9th, 2011

This posting describes how I am beginning to visualize co-occurrences with a Javascript library called Protovis. Alternatively, I an trying to answer the question, “What did Henry David Thoreau say in the same breath when he used the word ‘walden’?”

“In the same breath”

Network diagrams are great ways to illustrate relationships. In such diagrams nodes represent some sort of entity, and lines connecting nodes represent some sort of relationship. Nodes clustered together and sharing many lines denote some kind of similarity. Conversely, nodes whose lines are long and not interconnected represent entities outside the norm or at a distance. Network diagrams are a way of visualizing complex relationships.

Are you familiar with the phrase “in the same breath”? It is usually used to denote the relationship between one or more ideas. “He mentioned both ‘love’ and ‘war’ in the same breath.” This is exactly one of the things I want to do with texts. Concordances provide this sort of functionality. Given a word or phrase, a concordance will find the query in a corpus and display the words on either side of it. A KWIK (key word in context) index, concordances make it easier to read how words or phrases are used in relationship with their surrounding words. The use of network diagrams seem like good idea to see — visualize — how words or phrases are used within the context of surrounding words.

Protovis is a Javascript charting library developed by the Stanford Visualization Group. Using Protovis a developer can create all sorts of traditional graphs (histograms, box plots, line charts, pie charts, scatter plots) through a relatively easy-to-learn API (application programmer interface). One of the graphs Protovis supports is an interactive simulation of network diagrams called “force-directed layouts“. After experiencing some of the work done by a few of my colleagues (“Thank you Michael Clark and Ed Summers“), I wondered whether or not network diagrams could be used to visualize co-occurrences in texts. After discovering Protovis, I decided to try to implement something along these lines.

Implementation

The implementation of the visualization requires the recursive creation of a term matrix. Given a word (or regular expression), find the query in a text (or corpus). Identify and count the d most frequently used words within b number of characters. Repeat this process d times with each co-occurrence. For example, suppose the text is Walden by Henry David Thoreau, the query is “spring”, d is 5, and b is 50. The implementation finds all the occurrences of the word “spring”, gets the text 50 characters on either side of it, finds the 5 most commonly used words in those characters, and repeats the process for each of those words. The result is the following matrix:

spring day morning first winter
day days night every today
morning spring say day early
first spring last yet though
winter summer pond like snow

Thus, the most common co-occurrences for the word “spring” are “day”, “morning”, “first”, and “winter”. Each of these co-occurrences are recursively used to find more co-occurrences. In this example, the word “spring” co-occurs with times of day and seasons. These words then co-occur with more times of day and more seasons. Similarities and patterns being to emerge. Depending on the complexity of a writer’s sentence structure, the value of b (“breath”) may need to be increased or decreased. As the value of d (“detail”) is increased or decreased so does the number of co-occurrences to return.

Once this matrix is constructed, Protovis requires it to be converted into a simple JSON (Javascript Object Notation) data structure. In this example, “spring” points to “day”, “morning”, “first”, and “winter”. “Day” points to “days”, “night”, “every”, and “today”. Etc. As terms point to multiples of other terms, a network diagram is manifested, and the magic of Protovis is put to work. See the following illustration:

spring in walden
“spring” in Walden

It is interesting enough to see the co-occurrences of any given word in a text, but it is even more interesting to compare the co-occurrences between texts. Below are a number of visualizations from Thoreau’s Walden. Notice how the word “walden” frequently co-occurs with the words “pond”, “water”, and “woods”. This makes a lot of sense because Walden Pond is a pond located in the woods. Notice how the word “fish” is associated with “pond”, “fish”, and “fishing”. Pretty smart, huh?

walden in walden
“walden” in Walden
fish in walden
“fish” in Walden
woodchuck in walden
“woodchuck” in Walden
woods in walden
“woods” in Walden

Compare these same words with the co-occurrences in a different work by Thoreau, A Week on the Concord and Merrimack Rivers. Given the same inputs the outputs are significantly different. For example, notice the difference in co-occurrences given the word “woodchuck”.

walden in rivers
“walden” in Rivers
fish in rivers
“fish” in Rivers
woodchuck in walden
“woodchuck” in Rivers
woods in rivers
“woods” in Rivers

Give it a try

Give it a try for yourself. I have written three CGI scripts implementing the things outlined above:

In each implementation you are given the opportunity to input your own queries, define the “size of the breath”, and the “level of detail”. The result is an interactive network diagram visualizing the most frequent co-occurrences of a given term.

The root of the Perl source code is located at http://infomotions.com/sandbox/network-diagrams/.

Implications for librarianship

The visualization of co-occurrences obviously has implications for text mining and the digital humanities, but it also has implications for the field of librarianship.

Given the current environment where data and information abound in digital form, libraries have found themselves in an increasingly competitive environment. What are libraries to do? Lest they become marginalized, librarians can not rest on their “public good” laurels. Merely providing access to information is not good enough. Everybody feels as if they have plenty of access to information. What is needed are methods and tools for making better use of the data and information they acquire. Implementing text mining and visualization interfaces are one way to accomplish that goal within context of online library services. Do a search in the “online catalog”. Create a subset of interesting content. Click a button to read the content from a distance. Provide ways to analyze and summarize the content thus saving the time of the reader.

Us librarians have to do something differently. Think like an entrepreneur. Take account of your resources. Examine the environment. Innovate and repeat.

MIT’s SIMILE timeline widget

Monday, December 20th, 2010

For a good time, I took a stab at learning how to implement a MIT SIMILE timeline widget. This posting describes what I learned.

Background

The MIT SIMILE Widgets are a set of cool Javascript tools. There are tools for implementing “exhibits”, time plots, “cover flow” displays a la iTunes, a couple of other things, and interactive timelines. I have always had a fondness for timelines since college when I created one to help me study for my comprehensive examinations. Combine this interest with the rise of digital humanities and my belief that library data is too textual in nature, I decided to learn how to use the timeline widget. Maybe this tool can be used in Library Land?

timeline
Screen shot of local timeline implementation

Implementation

The family of SIMILE Widgets Web pages includes a number of sample timelines. By playing with the examples you can see the potencial of the tool. Going through the Getting Started guide was completely necessary since the Widget documentation has been written, re-written, and moved to other platforms numerous times. Needless to say, I found the instructions difficult to use. In a nutshell, using the Timeline Widget requires the developer to:

  1. load the libraries
  2. create and modify a timeline object
  3. create a data file
  4. load the data file
  5. render the timeline

Taking hints from “timelines in the wild“, I decided to plot my writings — dating from 1989 to the present. Luckily, just about all of them are available via RSS (Really Simple Syndication), and they include:

Consequently, after writing my implementation’s framework, the bulk of the work was spent converting RSS files into an XML file the widget could understand. In the end I:

  • created an HTML file complete with the widget framework
  • downloaded the totality of RSS entries from all my my RSS feeds
  • wrote a slightly different XSL file for each RSS feed
  • wrote a rudimentary shell script to loop through each XSL/RSS combination and create a data file
  • put the whole thing on the Web

You can see the fruits of these labors on a page called Eric Lease Morgan’s Writings Timeline, and you can download the source code — timeline-2010-12-20.tar.gz. From there a person can scroll backwards and forwards in time, click on events, read an abstract of the writing, and hyperlink to the full text. The items from the Water Collection work in the same way but also include a thumbnail image of the water. Fun!?

Take-aways

I have a number of take-aways. First, my implementation is far from perfect. For example, the dates from the Water Collection are not correctly formatted in the data file. Consequently, different Javascript interpreters render the dates differently. Specifically, the Water Collection links to not show up in Safari, but they do show up in Firefox. Second, the timeline is quite cluttered in some places. There has got to be a way to address this. Third, timelines are a great way to visualize events. From the implementation you can readily see what how often I was writing and on what topics. The presentation makes so much more sense compared to a simple list sorted by date, title, or subject terms.

Library “discovery systems” could benefit from the implementation of timelines. Do a search. Get back a list of results. Plot them on a timeline. Allow the learner, teacher, or scholar to visualize — literally see — how the results of their query compare to one another. The ability to visualize information hinges on the ability to quantify information characteristics. In this case, the quantification is a set of dates. Alas, dates in our information systems are poorly recorded. It seems as if we — the library profession — have made it difficult for ourselves to participate in the current information environment.

Illustrating IDCC 2010

Wednesday, December 8th, 2010

This posting illustrates the “tweets” assigned to the hash tag #idcc10.

I more or less just got back from the 6th International Data Curation Conference that took place in Chicago (Illinois). Somewhere along the line I got the idea of applying digital humanities computing techniques against the conference’s Twitter feed — hash tag #idcc10. After installing a Perl module implementing the Twitter API (Net::Twitter::Lite), I wrote a quick hack, fed the results to Wordle, and got the following word cloud:

idcc10

What sorts of conclusions can you make based on the content of the graphic?

The output static and rudimentary. What I’d really like to do is illustrate the tweets over time. Get the oldest tweets. Illustrate the result. Get the newer tweets. Update the illustration. Repeat for all the tweets. Done. In the end I see some sort of moving graphic where significant words represent bubbles. The size of the bubbles grow in size depending on number of times they are used. Each bubble is attached to other bubbles with a line representing associations. The color of the bubbles might represent parts of speech. Using this technique a person could watch the ebb and flow of the virtual conversation.

For a good time time, you can also download the Perl script used to create the textual output. Called twitter.pl, it is only forty-three lines long and many of those lines are comments.

Text mining Charles Dickens

Saturday, December 4th, 2010

This posting outlines how a person can do a bit of text mining against three works by Charles Dickens using a set of two Perl modules — Lingua::EN::Ngram and Lingua::Concordance.

Lingua::EN::Ngram

I recently wrote a Perl module called Lingua::EN::Ngram. Its primary purpose is to count all the ngrams (two-word phrases, three-word phrases, n-word phrases, etc.) in a given text. For two-word phrases (bigrams) it will order the output according to a statistical probability (t-score). Given a number of texts, it will count the ngrams common across the corpus. As of version 0.02 it supports non-ASCII characters making it possible to correctly read and parse a greater number of Romantic languages — meaning it correctly interprets characters with diacritics. Lingua::EN::Ngram is available from CPAN.

Lingua::Concordance

Concordances are just about the oldest of textual analysis tools. Originally developed in the Late Middle Ages to analyze the Bible, they are essentially KWIC (keyword in context) indexes used to search and display ngrams within the greater context of a work. Given a text (such as a book or journal article) and a query (regular expression), Lingua::Concordance can display the occurrences of the query in the text as well as map their locations across the entire text. In a previous blog posting I used Lingua::Concordance to compare & contrast the use of the phrase “good man” in the works of Aristotle, Plato, and Shakespeare. Lingua::Concordance too is available from CPAN.

Charles Dickens

In keeping with the season, I wondered about Charles Dickens’s A Christmas Carol. How often is the word “Christmas” used in the work and where? In terms of size, how does A Christmas Carol compare to some of other Dickens’s works? Are there sets of commonly used words or phrases between those texts?

Answering the first question was relatively easy. The word “Christmas” is occurs eighty-six (86) times, and twenty-two (22) of those occurrences are in the the first ten percent (10%) of the story. The following bar chart illustrates these facts:

bar chart

The length of books (or just about any text) measured in pages in ambiguous, at best. A much more meaningful measure is number of words. The following table lists the sizes, in words, of three Dickens stories:

story size in words
A Christmas Carol 28,207
Oliver Twist 156,955
David Copperfield 355,203

For some reason I thought A Christmas Carol was much longer.

A long time ago I calculated the average size (in words) of the books in my Alex Catalogue. Once I figured this out, I discovered I could describe items in the collection based on relative sizes. The following “dial” charts bring the point home. Each one of the books is significantly different in size:

christmas carol
A Christmas Carol
oliver twist
Oliver Twist
david copperfield
David Copperfield

If a person were pressed for time, then which story would you be able to read?

After looking for common ngrams between texts, I discovered that “taken with a violent fit of” appears both David Copperfield and A Christmas Carol. Interesting!? Moreover, the phrase “violent fit” appears on all three works. Specifically, characters in these three Dickens stories have violent fits of laughter, crying, trembling, and coughing. By concatenating the stories together and applying concordancing methods I see there are quite a number of violent things in the three stories:

  n such breathless haste and violent agitation, as seemed to betoken so
  ood-night, good-night!' The violent agitation of the girl, and the app
  sberne) entered the room in violent agitation. 'The man will be taken,
  o understand that, from the violent and sanguinary onset of Oliver Twi
  one and all, to entertain a violent and deeply-rooted antipathy to goi
  eep a little register of my violent attachments, with the date, durati
  cal laugh, which threatened violent consequences. 'But, my dear,' said
  in general, into a state of violent consternation. I came into the roo
  artly to keep pace with the violent current of her own thoughts: soon 
  ts and wiles have brought a violent death upon the head of one worth m
   There were twenty score of violent deaths in one long minute of that 
  id the woman, making a more violent effort than before; 'the mother, w
   as it were, by making some violent effort to save himself from fallin
  behind. This was rather too violent exercise to last long. When they w
   getting my chin by dint of violent exertion above the rusty nails on 
  en who seem to have taken a violent fancy to him, whether he will or n
  peared, he was taken with a violent fit of trembling. Five minutes, te
  , when she was taken with a violent fit of laughter; and after two or 
  he immediate precursor of a violent fit of crying. Under this impressi
  and immediately fell into a violent fit of coughing: which delighted T
  of such repose, fell into a violent flurry, tossing their wild arms ab
   and accompanying them with violent gesticulation, the boy actually th
  ght I really must have laid violent hands upon myself, when Miss Mills
   arm tied up, these men lay violent hands upon him -- by doing which, 
   every aggravation that her violent hate -- I love her for it now -- c
   work himself into the most violent heats, and deliver the most wither
  terics were usually of that violent kind which the patient fights and 
   me against the donkey in a violent manner, as if there were any affin
   to keep down by force some violent outbreak. 'Let me go, will you,--t
  hands with me - which was a violent proceeding for him, his usual cour
  en.' 'Well, sir, there were violent quarrels at first, I assure you,' 
  revent the escape of such a violent roar, that the abused Mr. Chitling
  t gradually resolved into a violent run. After completely exhausting h
  , on which he ever showed a violent temper or swore an oath, was this 
  ullen, rebellious spirit; a violent temper; and an untoward, intractab
  fe of Oliver Twist had this violent termination or no. CHAPTER III REL
  in, and seemed to presage a violent thunder-storm, when Mr. and Mrs. B
  f the theatre, are blind to violent transitions and abrupt impulses of
  ming into my house, in this violent way? Do you want to rob me, or to

These observations simply beg other questions. Is violence a common theme in Dickens works? What other adjectives are used to a greater or lesser degree in Dickens works? How does the use of these adjectives differ from other authors of the same time period or within the canon of English literature?

Summary

The combination of the Internet, copious amounts of freely available full text, and ubiquitous as well as powerful desktop computing, it is now possible to analyze texts in ways that was not feasible twenty years ago. While the application of computing techniques against texts dates back to at least Father Busa’s concordance work in the 1960s, it has only been in the last decade that digital humanities has come into its own. The application of digital humanities to library work offers great opportunities for the profession. Their goals are similar and their tools are complementary. From my point of view, their combination is a marriage made in heaven.

A .zip file of the texts and scripts used to do the analysis is available for you to download and experiment with yourself. Enjoy.

Great Books data set

Saturday, November 6th, 2010

screenshot This posting makes the Great Books data set freely available.

As described previously, I want to answer the question, “How ‘great’ are the Great Books?” In this case I am essentially equating “greatness” with statistical relevance. Specifically, I am using the Great Books of the Western World’s list of “great ideas” as search terms and using them to query the Great Books to compute a numeric value for each idea based on term frequency inverse document frequency (TFIDF). I then sum each of the great idea values for a given book to come up with a total score — the “Great Ideas Coefficient”. The book with the largest Coefficient is then considered the “greatest” book. Along the way and just for fun, I have also kept track of the length of each book (in words) as well as two scores denoting each book’s reading level, and one score denoting each book’s readability.

The result is a canonical XML file named great-books.xml. This file, primarily intended for computer-to-computer transfer contains all the data outlined above. Since most data analysis applications (like databases, spreadsheets, or statistical packages) do not deal directly with XML, the data was transformed into a comma-separated value (CSV) file — great-books.csv. But even this file, a matrix of 220 rows and 104 columns, can be a bit unwieldily for the uninitiated. Consequently, the CSV file has been combined with a Javascript library (called DataTables) and embedded into an HTML for file general purpose use — great-books.htm.

The HTML file enables you to sort the matrix by column values. Shift click on columns to do sub-sorts. Limit the set by entering queries into the search box. For example:

  • sort by the last column (coefficient) and notice how Kant has written the “greatest” book
  • sort by the column labeled “love” and notice that Shakespeare has written seven (7) of the top ten (10) “greatest books” about love
  • sort by the column labeled “war” and notice that something authored by the United States is ranked #2 but also has very poor readability scores
  • sort by things like “angel” or “god”, then ask yourself, “Am I surprised at what I find?”

Even more interesting questions may be asked of the data set. For example, is their a correlation between greatness and readability? If a work has a high love score, then it is likely it will have a high (or low) score from one or more of the other columns? What is the greatness of the “typical” Great Book? Is this best represented as the average of the Great Ideas Coefficient or would it be better stated as the value of the mean of all the Great Ideas? In the case of the later, which books are greater than most, which books are typical, an which books are below typical? This sort of analysis, as well as the “kewl” Web-based implementation, is left up the the gentle reader.

Now ask yourself, “Can all of these sorts of techniques be applied to the principles and practices of librarianship, and if so, then how?”

Where in the world are windmills, my man Friday, and love?

Sunday, September 12th, 2010

This posting describes how a Perl module named Lingua::Concordance allows the developer to illustrate where in the continum of a text words or phrases appear and how often.

Windmills, my man Friday, and love

When it comes to Western literature and windmills, we often think of Don Quiote. When it comes to “my man Friday” we think of Robinson Crusoe. And when it comes to love we may very well think of Romeo and Juliet. But I ask myself, “How often do these words and phrases appear in the texts, and where?” Using digital humanities computing techniques I can literally illustrate the answers to these questions.

Lingua::Concordance

Lingua::Concordance is a Perl module (available locally and via CPAN) implementing a simple key word in context (KWIC) index. Given a text and a query as input, a concordance will return a list of all the snippets containing the query along with a few words on either side. Such a tool enables a person to see how their query is used in a literary work.

Given the fact that a literary work can be measured in words, and given then fact that the number of times a particular word or phrase can be counted in a text, it is possible to illustrate the locations of the words and phrases using a bar chart. One axis represents a percentage of the text, and the other axis represents the number of times the words or phrases occur in that percentage. Such graphing techniques are increasingly called visualization — a new spin on the old adage “A picture is worth a thousand words.”

In a script named concordance.pl I answered such questions. Specifically, I used it to figure out where in Don Quiote windmills are mentiond. As you can see below they are mentioned only 14 times in the entire novel, and the vast majority of the time they exist in the first 10% of the book.

  $ ./concordance.pl ./don.txt 'windmill'
  Snippets from ./don.txt containing windmill:
	* DREAMT-OF ADVENTURE OF THE WINDMILLS, WITH OTHER OCCURRENCES WORTHY TO
	* d over by the sails of the windmill, Sancho tossed in the blanket, the
	* thing is ignoble; the very windmills are the ugliest and shabbiest of 
	* liest and shabbiest of the windmill kind. To anyone who knew the count
	* ers say it was that of the windmills; but what I have ascertained on t
	* DREAMT-OF ADVENTURE OF THE WINDMILLS, WITH OTHER OCCURRENCES WORTHY TO
	* e in sight of thirty forty windmills that there are on plain, and as s
	* e there are not giants but windmills, and what seem to be their arms a
	* t most certainly they were windmills and not giants he was going to at
	*  about, for they were only windmills? and no one could have made any m
	* his will be worse than the windmills," said Sancho. "Look, senor; thos
	* ar by the adventure of the windmills that your worship took to be Bria
	*  was seen when he said the windmills were giants, and the monks' mules
	*  with which the one of the windmills, and the awful one of the fulling
  
  A graph illustrating in what percentage of ./don.txt windmill is located:
	 10 (11) #############################
	 20 ( 0) 
	 30 ( 0) 
	 40 ( 0) 
	 50 ( 0) 
	 60 ( 2) #####
	 70 ( 1) ##
	 80 ( 0) 
	 90 ( 0) 
	100 ( 0)

If windmills are mentioned so few times, then why do they play so prominently in people’s minds when they think of Don Quiote? To what degree have people read Don Quiote in its entirity? Are windmills as persistent a theme throughout the book as many people may think?

What about “my man Friday”? Where does he occur in Robinson Crusoe? Using the concordance features of the Alex Catalogue of Electronic Texts we can see that a search for the word Friday returns 185 snippets. Mapping those snippets to percentages of the text results in the following bar chart:

bar chart
Friday in Robinson Crusoe

Obviously the word Friday appears towards the end of the novel, and as anybody who has read the novel knows, it is a long time until Robinson Crusoe actually gets stranded on the island and meets “my man Friday”. A concordance helps people understand this fact.

What about love in Romeo and Juliet? How often does the word occur and where? Again, a search for the word love returns quite a number of snippets (175 to be exact), and they are distributed throughout the text as illustrated below:

bar chart
love in Romeo and Juliet

“Maybe love is a constant theme of this particular play,” I state sarcastically, and “Is there less love later in the play?”

Digital humanities and librarianship

Given the current environment, where full text literature abounds, digital humanities and librarianship are a match made in heaven. Our library “discovery systems” are essencially indexes. They enable people to find data and information in our collections. Yet find is not an end in itself. In fact, it is only an activity at the very beginning of the learning process. Once content is found it is then read in an attempt at understanding. Counting words and phrases, placing them in the context of an entire work or corpus, and illustrating the result is one way this understanding can be accomplished more quickly. Remember, “Save the time of the reader.”

Integrating digital humanities computing techniques, like concordances, into library “discovery systems” represent a growth opportunity for the library profession. If we don’t do this on our own, then somebody else will, and we will end up paying money for the service. Climb the learning curve now, or pay exorbitant fees later. The choice is ours.

Ngrams, concordances, and librarianship

Monday, August 30th, 2010

This posting describes how the extraction of ngrams and the implementation of concordances are integrated into the Alex Catalogue of Electronic Texts. Given the increasing availability of full-text content in libraries, the techniques described here could easily be incorporated into traditional library “discovery systems” and/or catalogs, if and only if the library profession were to shift its definition of what it means to practice librarianship.

Lingua::EN::Bigram

During the past couple of weeks, in fits of creativity, one of the things I spent some of my time on was a Perl module named Lingua::EN::Bigram. At version 0.03, it now supports not only bigrams, trigrams, and quadgrams (two-, three-, and four-word phrases, respectively), but also ngrams — multi-word phrases of an arbitrary length.

Given this enhanced functionality, and through the use of a script called ngrams.pl, I learned that the 10 most frequently used 5-word phrases and the number of times they occur in Henry David Thoreau’s Walden seem to surround spacial references:

  • a quarter of a mile (6)
  • i have no doubt that (6)
  • as if it were a (6)
  • the other side of the (5)
  • the surface of the earth (4)
  • the greater part of the (4)
  • in the midst of a (4)
  • in the middle of the (4)
  • in the course of the (3)
  • two acres and a half (3)

Whereas the same process applied to Thoreau’s A Week on the Concord and Merrimack Rivers returns lengths and references to flowing water, mostly:

  • a quarter of a mile (8)
  • on the bank of the (7)
  • the surface of the water (6)
  • the middle of the stream (6)
  • as if it were the (5)
  • as if it were a (4)
  • is for the most part (4)
  • for the most part we (4)
  • the mouth of this river (4)
  • in the middle of the (4)

While not always as clear cut as the examples outlined above, the extraction and counting of ngrams usually supports the process of “distant reading” — a phrase coined by Franco Moretti in Graphs, Maps, Trees: Abstract Models for Literary History (2007) to denote the counting, graphing, and mapping of literary texts. With so much emphasis on reading in libraries, I ask myself, “Ought the extraction of ngrams be applied to library applications?”

Concordances

Concordances are literary tools used to evaluate texts. Dating back to as early as the 12th or 13th centuries, they were first used to study religious materials. Concordances take many forms, but they usually list all the words in a text, the number of times each occurs, and most importantly, places where each word within the context of its surrounding text — a key-word in context (KWIC) index. Done by hand, the creation of concordances is tedious and time consuming, and therefore very expensive. Computers make the work of creating a concordance almost trivial.

Each of the full text items in the Alex Catalogue of Electronic Texts (close to 14,000 of them) is accompanied with a concordance. They support the following functions:

  • list of all the words in the text starting with a given letter and the number of times each occurs
  • list the most frequently used words in the text and the number of times each occurs
  • list the most frequently used ngrams in a text and the number of times each occurs
  • display individual items from the lists above in a KWIC format
  • enable the student or scholar to search the text for arbitrary words or phrases (regular expressions) and have them displayed in a KWIC format

Such functionality allows people to answer many questions quickly and easily, such as:

  • Does Mark Twain’s Adventures of Huckleberry Finn contain many words beginning with the letter z, and if so, how many times and in what context?
  • To what extent does Aristotle’s Metaphysics use the word “good”, and maybe just as importantly, how is the word “evil” used in the same context?
  • In Jack London’s Call of the Wild the phrase “man in the red sweater” is one of the more frequently used. Who was this man and what role does he play in the story?
  • Compared to Shakespeare, to what extent does Plato discuss love, and how do the authors’ expositions differ?

The counting of words, the enumeration of ngrams, and the use of concordances are not intended to short-circuit traditional literary studies. Instead, they are intended to supplement and enhance the process. Traditional literary investigations, while deep and nuanced, are not scalable. A person is not able to read, compare & contrast, and then comprehend the essence of all of Shakespeare, all of Plato, and all of Charles Dickens through “close reading”. An individual simply does not have enough time. In the words of Gregory Crane, “What do you do with a million books?” Distant reading, akin to the proceses outlined above, make it easier to compare & contrast large corpora, discover patterns, and illustrate trends. Moreover, such processes are reproducible, less prone to subjective interpretation, and not limited to any particular domain. The counting, graphing, and mapping of literary texts makes a lot of sense.

The home page for the concordances is complete with a number of sample texts. Alternatively, you can search the Alex Catalogue and find an item on your own.

Library “discovery systems” and/or catalogs

The amount of full text content available to libraries has never been greater than it is today. Millions of books have been collectively digitized through Project Gutenberg, the Open Content Alliance, and the Google Books Project. There are thousands of open access journals with thousands upon thousands of freely available scholarly articles. There are an ever-growing number of institutional repositories both subject-based as well as institutional-based. These too are rich with full text content. None of this even considers the myriad of grey literature sites like blogs and mailing list archives.

Library “discovery systems” and/or catalogs are designed to organize and provide access to the materials outlined above, but they need to do more. First of all, the majority of the profession’s acquisitions processes assume collections need to be paid for. With the increasing availability of truly free content on the Web, greater emphasis needs to be placed on harvesting content as opposed to purchasing or licensing it. Libraries are expected to build collections designed to stand the test of time. Brokering access to content through licensing agreements — one of the current trends in librarianship — will only last as long as the money lasts. Licensing content makes libraries look like cost centers and negates the definition of “collections”.

Second, library “discovery systems” and/or catalogs assume an environment of sacristy. They assume the amount of accessible, relevant data and information needed by students, teachers, and researchers is relatively small. Thus, a great deal of the profession’s efforts go into enabling people to find their particular needle in one particular haystack. In reality, current indexing technology makes the process of finding relavent materials trivial, almost intelligent. Implemented correctly, indexers return more content than most people need, and consequently they continue to drink from the proverbial fire hose.

Let’s turn these lemons into lemonade. Let’s redirect some of the time and money spent on purchasing licenses towards the creation of full text collections by systematic harvesting. Let’s figure out how to apply “distant reading” techniques to the resulting collections thus making them, literally, more useful and more understandable. These redirections represent a subtle change in the current direction of librarianship. At the same time, they retain the core principles of the profession, namely: collection, organization, preservation, and dissemination. The result of such a shift will result in an increased expertise on our part, the ability to better control our own destiny, and contribute to the overall advancement of our profession.

What can we do to make these things come to fruition?

Lingua::EN::Bigram (version 0.02)

Sunday, August 22nd, 2010

I have written and uploaded to CPAN version 0.02 of my Perl module Lingua::EN::Bigram. From the README file:

This module is designed to: 1) pull out all of the two-, three-, and four-word phrases in a given text, and 2) list these phrases according to their frequency. Using this module is it possible to create lists of the most common phrases in a text as well as order them by their probable occurrence, thus implying significance. This process is useful for the purposes of textual analysis and “distant reading”.

Using this module I wrote a script called n-grams.pl. Feed it a plain text file, and it will return the top 10 most significant bigrams (as calculated by T-Score) as well as the top 10 most common trigrams and quadgrams. For example, here is the output of n-grams.pl when Henry David Thoreau’s Walden is input:

  Bi-grams (T-Score, count, bigram)
  4.54348783312048  22  one day  
  4.35133234596553  19  new england  
  3.705427371426    14  walden pond  
  3.66575742655033  14  one another  
  3.57857056272537  13  many years  
  3.55592136768501  13  every day  
  3.46339791276118  12  fair haven  
  3.46101939872834  12  years ago  
  3.38519781332654  12  every man  
  3.29818626191729  11  let us  
  
  Tri-grams (count, trigram)
  41  in the woods
  40  i did not
  28  i do not
  28  of the pond
  27  as well as
  27  it is a
  26  part of the
  25  that it was
  25  as if it
  25  out of the
  
  Quad-grams (count, quadgram)
  20  for the most part
  16  from time to time
  15  as if it were
  14  in the midst of
  11  at the same time
   9  the surface of the
   9  i think that i
   8  in the middle of
   8  worth the while to
   7  as if they were

The whole thing gets more interesting when you compare that output to another of Thoreau’s works — A Week on the Concord and Merrimack Rivers:

  Bi-grams (T-Score, count, bi-gram)
  4.62683939320543  22  one another  
  4.57637831535376  21  new england  
  4.08356124174142  17  let us  
  3.86858364314677  15  new hampshire  
  3.43311180449584  12  one hundred  
  3.31196701774012  11  common sense  
  3.25007069543896  11  can never  
  3.15955504269006  10  years ago  
  3.14821552996352  10  human life  
  3.13793008615654  10  told us  
  
  Tri-grams (count, tri-gram)
  41  as well as
  38  of the river
  34  it is a
  30  there is a
  30  one of the
  28  it is the
  27  as if it
  26  it is not
  26  if it were
  24  it was a
  
  Quad-grams (count, quad-gram)
  21  for the most part
  20  as if it were
  17  from time to time
   9  on the bank of
   8  the bank of the
   8  in the midst of
   8  a quarter of a
   8  the middle of the
   8  quarter of a mile
   7  at the same time

Ask yourself, “Are their similarities between the outputs? How about differences? Do you notice any patterns or anomalies? What sorts of new discoveries might be made if n-grams.pl where applied to the entire corpus of Thoreau’s works? How might the output be different if a second author’s works were introduced?” Such questions are the core of digital humanities research. With the increasing availability of full text content in library collections, such are the questions the library profession can help answer if the profession were to expand it’s definition of “service”.

Search and retrieve are not the pressing problems to solved. People can find more data and information than they know what to do with. Instead, the pressing problems surround use and understanding. Lingua::EN::Bigram is an example of how these newer and more pressing problems can be addressed. The module is available for downloading (locally as well as from CPAN). Also for your perusal is n-grams.pl.

Cool URIs

Sunday, August 22nd, 2010

I have started implementing “cool” URIs against the Alex Catalogue of Electronic Texts.

As outlined in Cool URIs for the Semantic Web, “The best resource identifiers… are designed with simplicity, stability and manageability in mind…” To that end I have taken to creating generic URIs redirecting user-agents to URLs based on content negotiation — 303 URI forwarding. These URIs also provide a means to request specific types of pages. The shapes of these URIs follow, where “key” is a foreign key in my underlying (MyLibrary) database:

  • http://infomotions.com/etexts/id/key – generic; redirection based on content negotiation
  • http://infomotions.com/etexts/page/key – HTML; the text itself
  • http://infomotions.com/etexts/data/key – RDF; data about the text
  • http://infomotions.com/etexts/concordance/key – concordance; a means for textual analysis

For example, the following URIs return different versions/interfaces of Henry David Thoreau’s Walden:

This whole thing makes my life easier. No need to remember complicated URLs. All I have to remember is the shape of my URI and the foreign key. Through the process this also makes the URLs easier to type, shorten, distribute, and display.

The downside of this implementation is the need for an always-on intermediary application doing the actual work. The application, implemented as mod_perl module, is called Apache2::Alex::Dereference and available for your perusal. Another downside is the need for better, more robust RDF, but that’s for later.

rsync, a really cool utility

Wednesday, August 18th, 2010

Without direct physical access to my co-located host, backing up and preserving the Infomotions’ 150 GB of website is challenging, but through the use of rsync things are a whole lot easier. rsync is a really cool utility, and thanks go to Francis Kayiwa who recommended it to me in the first place. “Thank you!”

Here is my rather brain-dead back-up utility:

# rsync.sh - brain-dead backup of wilson

# change directories to the local store
cd /Users/eric/wilson

# get rid of any weird Mac OS X filenames
find ./ -name '.DS_Store' -exec rm -rf {} \;

# do the work for one remote file system...
rsync --exclude-from=/Users/eric/etc/rsync-exclude.txt \
    -avz wilson:/disk01/ \
    ./disk01/

# ...and then another
rsync --exclude-from=/Users/eric/etc/rsync-exclude.txt \
    -avz wilson:/home/eric/ \
    ./home/eric/

After I run this code my local Apple Macintosh Time Capsule automatically copies my content to yet a third spinning disk. I feel much better about my data now that I have started using rsync.

How “great” is this article?

Friday, July 9th, 2010

During Digital Humanities 2010 I participated in the THATCamp London Developers’ Challenge and tried to answer the question, “How ‘great’ is this article?” This posting outlines the functionality of my submission, links to a screen capture demonstrating it, and provides access to the source code.

screen captureGiven any text file — say an article from the English Women’s Journal — my submission tries to answer the question, “How ‘great’ is this article?” It does this by:

  1. returning the most common words in a text
  2. returning the most common bigrams in a text
  3. calculating a few readability scores
  4. comparing the texts to a standardized set of “great ideas”
  5. supporting a concordance for browsing

Functions #1, #2, #3, and #5 are relatively straight-forward and well-understood. Function #4 needs some explanation.

In the 1960′s a set of books was published called the Great Books. The set is based on a set of 102 “great ideas” (such as art, love, honor, truth, justice, wisdom, science, etc.). By summing the TFIDF scores of each of these ideas for each of the books, a “great ideas coefficient” can be computed. Through this process we find that Shakespeare wrote seven of the top ten books when it comes to love. Kant wrote the “greatest book”. The American State’s Articles of Confederation ranks the highest when it come to war. This “coefficient” can then be used as a standard — an index — for comparing other documents. This is exactly what this program does. (See the screen capture for a demonstration.)

The program can be improved a number of ways:

  1. it could be Web-based
  2. it could process non-text files
  3. it could graphically illustrate a text’s “greatness”
  4. it could hyperlink returned words directly to the concordance

Thanks to Gerhard Brey and the folks of the Nineteenth Century Serials Editions for providing the data. Very interesting.

Measuring the Great Books

Tuesday, June 15th, 2010

This posting describes how I am assigning quantitative characteristics to texts in an effort to answer the question, “How ‘great’ are the Great Books?” In the end I make a plea for library science.

Background

With the advent of copious amounts of freely available plain text on the ‘Net comes the ability of “read” entire corpora with a computer and apply statistical processes against the result. In an effort to explore the feasibility of this idea, I am spending time answering the question, “How ‘great’ are the Great Books?

More specifically, want to assign quantitative characteristics to each of the “books” in the Great Books set, look for patterns in the result, and see whether or not I can draw any conclusions about the corpus. If such processes are proven effective, then the same processes may be applicable to other corpora such as collections of scholarly journal articles, blog postings, mailing list archives, etc. If I get this far, then I hope to integrate these processes into traditional library collections and services in an effort to support their continued relevancy.

On my mark. Get set. Go.

Assigning quantitative characteristics to texts

The Great Books set posits 102 “great ideas” — basic, foundational themes running through the heart of Western civilization. Each of the books in the set were selected for inclusion by the way they expressed the essence of these great ideas. The ideas are grand and ambiguous. They include words such as angel, art, beauty, courage, desire, eternity, god, government, honor, idea, physics, religion, science, space, time, wisdom, etc. (See Appendix B of “How ‘great’ are the Great Books?” for the complete list.)

In a previous posting, “Great Ideas Coefficient“, I outlined the measure I propose to use to determine the books’ “greatness” — essentially a sum of all TFIDF (term frequency / inverse document frequency) scores as calculated against the list of great ideas. TFIDF is defined as:

( c / t ) * log( d / f )

where:

  • c = number of times a given word appears in a document
  • t = total number of words in a document
  • d = total number of documents in a corpus
  • f = total number of documents containing a given word

Thus, the problem boils down to determining the values for c, t, d, and f for a given great idea, 2) summing the resulting TFIDF scores, 3) saving the results, and 4) repeating the process for each book in the corpus. Here, more exactly, is how I am initially doing such a thing:

  1. Build corpus – In a previous posting, “Collecting the Great Books“, I described how I first collected 223 of the roughly 250 Great Books.
  2. Index corpus – The process used to calculate the TFIDF values of c and t are trivial because any number of computer programs do such a thing quickly and readily. In our case, the value of d is a constant — 223. On the other hand, trivial methods for determining the number of documents containing a given word (f) are not scalable as the size of a corpus increases. Because an index is essentially a list of words combined with the pointers to where the words can be found, an index proves to be a useful tool for determining the value of f. Index a corpus. Search the index for a word. Get back the number of hits and use it as the value for f. Lucene is currently the gold standard when it comes to open source indexers. Solr — an enhanced and Web Services-based interface to Lucene — is the indexer used in this process. The structure of the local index is rudimentary: id, author, title, URL, and full text. Each of the metadata values are pulled out of a previously created index file — great-books.xml — while the full text is read from the file system. The whole lot is then stuffed into Solr. A program called index.pl does this work. Another program called search.pl was created simply for testing the validity of the index.
  3. Count words and determine readability – A Perl module called Lingua::EN::Fathom does a nice job of counting the number of words in a file, thus providing me with a value for t. Along the way it also calculates a number of “readability” scores — values used to determine the necessary education level of a person needed to understand a given text. While I had “opened the patient” I figured it would be a good idea to take note of this information. Given the length of a book as well as its readability scores, I enable myself to answer questions such as, “Are longer books more difficult to read?” Later on, given my Great Ideas Coefficient, I will be able to answer questions such as “Is the length of a book a determining factor in ‘greatness’?” or “Are ‘great’ books more difficult to read?”
  4. Calculate TFIDF – This is the fuzziest and most difficult part of the measurement process. Using Lingua::EN::Fathom again I find all of the unique words in a document, stem them with Lingua::Stem::Snowball, and calculate the number of times each stem occurs. This gives me a value for c. I then loop through each great idea, stem them, and search the index for the stem thus returning a value for f. For each idea I now have values for c, t, d, and f enabling me to calculate TFIDF — ( c / t ) * log( d / f ).
  5. Calculate the Great Ideas Coefficient – This is trivial. Keep a running sum of all the great idea TFIDF scores.
  6. Go to Step #4 – Repeat this process for each of the 102 great ideas.
  7. Save – After all the various scores (number of words, readability scores, TFIDF scores, and Great Ideas Coefficient) have been calculated I save each to my pseudo database file called great-ideas.xml. Each is stored as an attribute associated with a book’s unique identifier. Later I will use the contents of this file as the basis of my statistical analysis.
  8. Go to Step #3 – Repeat this process for each book in the corpus, and in this case 223 times.

Of course I didn’t do all of this by hand, and the program I wrote to do the work is called measure.pl.

The result is my pseudo database file — great-books.xml. This is my data set. It keeps track all of my information in a human-readable, application- and operating system-independent manner. Very nice. If there is only one file you download from this blog posting, then it should be this file. Using it you will be able to create your own corpus and do your own analysis.

The process outlined above is far from perfect. First, there are a few false negatives. For example, the great idea “universe” returned a TFIDF value of zero (0) for every document. Obviously is is incorrect, and I think the error has something to do with the stemming and/or indexing subprocesses. Second, the word “being”, as calculated by TFIDF, is by far and away the “greatest” idea. I believe this is true because the word “being” is… being counted as both a noun as well as a verb. This points to a different problem — the ambiguity of the English language. While all of these issues will knowingly skew the final results, I do not think they negate the possibility of meaningful statistical investigation. At the same time it will be necessary to refine the measurement process to reduce the number of “errors”.

Measurment, the humanities, and library science

Measurement is one of the fundamental qualities of science. The work of Archimedes is the prototypical example. Kepler and Galileo took the process to another level. Newton brought it to full flower. Since Newton the use of measurement — the assignment of mathematical values — applied against observations of the natural world and human interactions have given rise to the physical and social sciences. Unlike studies in the humanities, science is repeatable and independently verifiable. It is objective. Such is not a value judgment, merely a statement of fact. While the sciences seem cold, hard, and dry, the humanities are subjective, appeal to our spirit, give us a sense of purpose, and tend to synthesis our experiences into a meaningful whole. Both of the scientific and humanistic thinking processes are necessary for us to make sense of the world around us. I call these combined processes “arscience“.

The library profession could benefit from the greater application of measurement. In my opinion, too much of the profession’s day-to-day as well as strategic decisions are based on antidotal evidence and gut feelings. Instead of basing our actions on data, actions are based on tradition. “This is the way we have always done it.” This is medieval, and consequently, change comes very slowly. I sincerely believe libraries are not going away any time soon, but I do think the profession will remain relevant longer if librarians were to do two things: 1) truly exploit the use of computers, and 2) base a greater number of their decisions on data — measurment — as opposed to opinion. Let’s call this library science.

Collecting the Great Books

Sunday, June 13th, 2010

In an effort to answer the question, “How ‘great’ are the Great Books?“, I need to mirror the full texts of the Great Books. This posting describes the initial process I am using to do such a thing, but the imporant thing to note is that this process is more about librarianship than it is about software.

Background

The Great Books is/was a 60-volume set of content intended to further a person’s liberal arts education. About 250 “books” in all, it consists of works by Homer, Aristotle, Augustine, Chaucer, Cervantes, Locke, Gibbon, Goethe, Marx, James, Freud, etc. There are a few places on the ‘Net where the complete list of authors/titles can be read. One such place is a previous blog posting of mine. My goal is to use digital humanities computing techniques to statistically describe the works and use these descriptions to supplement a person’s understanding of the texts. I then hope to apply these same techniques to other corpora. To accomplish this goal I first need to acquire full text versions of the Great Books. This posting describes how I am initially going about it.

Mirroring and caching the Great Books

All of the books of the Great Books were written by “old dead white men”. It is safe to assume the texts have been translated into a myriad of languages, including English, and it is safe to assume the majority exist in the public domain. Moreover, with the advent of the Web and various digitizing projects, it is safe to assume quality information gets copied forward and will be available for downloading. All of this has proven to be true. Through the use of Google and a relatively small number of repositories (Project Gutenberg, Alex Catalogue of Electronic Texts, Internet Classics Archive, Christian Classics Ethereal Library, Internet Archive, etc.), I have been able to locate and mirror 223 of the roughly 250 Great Books. Here’s how:

  1. Bookmark texts – Trawl the Web for the Great Books and use Delicious to bookmark links to plain text versions translated into English. Firefox combined with the Delicious extension have proven to be very helpful in this regard. My bookmarks should be located at http://delicious.com/ericmorgan/gb.
  2. Save and edit bookmarks file – Delicious gives you the option to save your bookmarks file locally. The result is a bogus HTML file intended to be imported into Web browsers. It contains the metadata used to describe your bookmarks such as title, notes, and URLs. After exporting my bookmarks to the local file system, I contorted the bogus HTML into rudimentary XML so I could systematically read it for subsequent processing.
  3. Extract URLs – Using a 7-line program called bookmarks2urls.pl, I loop through the edited bookmarks file and output all the URLs.
  4. Mirror content – Because I want/need to retain a pristine version of the original texts, I feed the URLs to wget and copy the texts to a local directory. This use of wget is combined with the output of Step #3 through a brain-dead shell script called mirror.sh.
  5. Create corpus – The mirrored files are poorly named; using just the mirror it is difficult to know what “great book” hides inside files named annals.mb.txt, pg2600.txt, or whatever. Moreover, no metadata is associated with the collection. Consequently I wrote a program — build-corpus.pl — that loops through my edited bookmarks file, extracts the necessary metadata (author, title, and URL), downloads the remote texts, saves them locally with a human-readable filename, creates a rudimentary XHTML page listing each title, and creates an XML file containing all of the metadata generated to date.

The results of this 5-step process include:

The most important file, by far, is the metadata file. It is intended to be a sort of application- and operating system-independent database. Given this file, anybody ought to be able to duplicate the analysis I propose to do later. If there is only one file you download from this blog posting, it should be the metadata file — great-books.xml.

The collection process is not perfect. I was unable to find many of the works of Archimedes, Copernicus, Kepler, Newton, Galileo, or Freud. For all but Freud, I attribute this to the lack of translations, but I suppose I could stoop to the use of poorly OCR’ed texts from Google Books. I attribute the unavailability of Freud to copyright issues. There’s no getting around that one. A few times I located HTML versions of desired texts, but HTML will ultimately skew my analysis. Consequently I used a terminal-based program called lynx to convert and locally save the remote HTML to a plain text file. I then included that file into my corpus. Alas, there are always ways to refine collections. Like software, they are are never done.

Summary — Collection development, acquisitions, and cataloging

The process outlined above is really about librarianship and not software. Specifically, it is about collection development, acquisitions, and cataloging. I first needed to articulate a development policy. While it did not explicitly describe the policy it did outline why I wanted to create the collection as well as a few of each item’s necessary qualities. The process above implemented a way to actually get the content — acquisitions. Finally, I described — “cataloged” — my content, albiet in a very rudimentary form.

It is an understatement to say the Internet has changed the way data, information, and knowledge are collected, preserved, organized, and disseminated. By extension, librarianship needs to change in order to remain relevant with the times. Our profession spends much of its time trying to refine old processes. It is like trying to figure out how to improve the workings of a radio when people have moved on to the use of televisions instead. While traditional library processes are still important, they are not as important as the used to be.

The processes outline above illustrate one possible way librarianship can change the how’s of its work while retaining it’s what’s.

Not really reading

Wednesday, June 9th, 2010

Using a number of rudimentary digital humanities computing techniques, I tried to practice what I preach and extract the essence from a set of journal articles. I feel like the process met with some success, but I was not really reading.

The problem

A set of twenty-one (21) essays on the future of academic librarianship was recently brought to my attention:

Leaders Look Toward the Future – This site compiled by Camila A. Alire and G. Edward Evans offers 21 essays on the future of academic librarianship written by individuals who represent a cross-section of the field from the largest institutions to specialized libraries.

Since I was too lazy to print and read all of the articles mentioned above, I used this as an opportunity to test out some of my “services against text” ideas.

The solution

Specifically, I used a few rudimentary digital humanities computing techniques to glean highlights from the corpus. Here’s how:

  1. First I converted all of the PDF files to plain text files using a program called pdftotext — a part of xpdf. I then concatenated the whole lot together, thus creating my corpus. This process is left up to you — the reader — as an exercise because I don’t have copyright hutzpah.
  2. Next, I used Wordle to create a word cloud. Not a whole lot of new news here, but look how big the word “information” is compared to the word “collections”.

  3. Using a program of my own design, I then created a textual version of the word cloud listing the top fifty most frequently used words and the number of times they appeared in the corpus. Again, not a whole lot of new news. The articles are obviously about academic libraries, but notice how the word “electronic” is listed and not the word “book”.
  4. Things got interesting when I created a list of the most significant two-word phrases (bi-grams). Most of the things are nouns, but I was struck by “will continue” and “libraries will” so I applied a concordance application to these phrases and got lists of snippets. Some of the more interesting ones include: libraries will be “under the gun” financially, libraries will be successful only if they adapt, libraries will continue to be strapped for staffing, libraries will continue to have a role to play, will continue their major role in helping, will continue to be important, will continue to shift toward digital information, will continue to seek new opportunities.

Yes, there may very well be some subtle facts I missed by not reading the full texts, but I think I got a sense of what the articles discussed. It would be interesting to sit a number of people down, have them read the articles, and then have them list out a few salient sentences. To what degree would their result be the same or different from mine?

I was able to write the programs from scratch, do the analysis, and write the post in about two hours, total. It would have taken me that long to read the articles. Just think what a number of librarians could do, and how much time could be saved if this system were expanded to support just about any plain text data.

About Infomotions Image Gallery: Flickr as cloud computing

Saturday, May 22nd, 2010

Infomotions Image GalleryThis posting describes the whys and wherefores behind the Infomotions Image Gallery.

Photography

I was introduced to photography during library school, specifically, when I took a multi-media class. We were given film and movie cameras, told to use the equipment, and through the process learn about the medium. I took many pictures of very tall smoke stacks and classical-looking buildings. I also made a stop-action movie where I step-by-step folded an origami octopus and underwater sea diver while a computer played the Beatles’ “Octopuses Garden” in the background. I’d love to resurrect that 16mm film.

I was introduced to digital photography around 1995 when Steve Cisler (Apple Computer) gave me a QuickTake camera as a part of a payment for writing a book about Macintosh-based HTTP servers. That camera was pretty much fun. If I remember correctly, it took 8-bit images and could store about twenty-four of them at a time. The equipment worked perfectly until my wife accidentally dropped it into a pond. I still have the camera, somewhere, but it only works if it is plugged into an electrical socket. Since then I’ve owned a few other digital cameras and one or two digital movie cameras. They have all been more than simple point-and-shoot devices, but at the same time, they have always had more features than I’ve ever really exploited.

Over the years I mostly used the cameras to document the places I’ve visited. I continue to photograph buildings. I like to take macro shots of flowers. Venuses are always appealing. Pictures of food are interesting. In the self-portraits one is expected to notice the background, not necessarily the subject of the image. I believe I’m pretty good at composition. When it comes to color I’m only inspired when the sun is shining bright, and that makes some of my shots overexposed. I’ve never been very good at photographing people. I guess that is why I prefer to take pictures of statues. All things library and books are a good time. I wish I could take better advantage of focal lengths in order blur the background but maintain a sharp focus in the foreground. The tool requires practice. I don’t like to doctor the photographs with effects. I don’t believe the result represents reality. Finally, I often ask myself an aesthetic question, “If I was looking through the camera to take the picture, then did I really see what was on the other side?” After all, my perception was filtered through an external piece of equipment. I guess I could ask the same question of all my perceptions since I always wear glasses.

The Infomotions Image Gallery is simply a collection of my photography, sans personal family photos. It is just another example of how I am trying to apply the principles of librarianship to the content I create. Photographs are taken. Individual items are selected, and the collection is curated. Given the available resources, metadata is applied to each item, and the whole is organized into sets. Every year the newly created images are archived to multiple mediums for preservation purposes. (I really ought to make an effort to print more of the images.) Finally, an interface is implemented allowing people to access the collection.

Enjoy.

orange hot stained glassTilburg University sculpturecoastal homebeach sculpturemetal bookthistleDSCN5242Three Sisters

Fickr as cloud computing

This section describes how the Gallery is currently implemented.

About ten years ago I began to truly manage my photo collection using Apple’s iPhoto. At just about the same time I purchased an iPhoto add-on called BetterHTMLExport. Using a macro language, this add-on enabled me to export sets of images to index and detail pages complete with titles, dates, and basic numeric metadata such as exposure, f-stop, etc. The process worked but the software grew long in the tooth, was sold to another company, and was always a bit cumbersome. Moreover, maintaining the metadata was tedious inhibiting my desire to keep it up to date. Too much editing here, exporting there, and uploading to the third place. To make matters worse, people expect to comment on the photos, put them into their own sets, and watch some sort of slide show. Enter Flickr and a jQuery plug-in called ColorBox.

After learning how to use iPhoto’s ability to publish content to Flickr, and after taking a closer look at Flickr’s application programmer interace (API), I decided to use Flickr to host my images. The idea was to: 1) maintain the content on my local file system, 2) upload the images and metadata to Flickr, and 3) programmatically create in interface to the content on my website. The result was a more streamlined process and a set of Perl scripts implementing a cleaner user interface. I was entering the realm of cloud computing. The workflow is described below:

  1. Take photographs – This process is outlined in the previous section.
  2. Import photographs – Import everything, but weed right away. I’m pretty brutal in this regard. I don’t keep duplicate nor very similar shots. No (or very very few) out-of-focus or poorly composed shots are kept either.
  3. Add titles – Each photo gets some sort of title. Sometimes they are descriptive. Sometimes they are rather generic. After all, how many titles can different pictures of roses have? If I were really thorough I would give narrative descriptions to each photo.
  4. Make sets – Group the imported photos into a set and then give a title to the set. Again, I ought to add narrative descriptions, but I don’t. Too lazy.
  5. Add tags – Using iPhoto’s keywords functionality, I make an effort to “tag” each photograph. Tags are rather generic: flower, venus, church, me, food, etc.
  6. Publish to Flickr – I then use iPhoto’s sharing feature to upload each newly created set to Flickr. This works very well and saves me the time and hassle of converting images. This same functionality works in reverse. If I use Flickr’s online editing functions, changes are reflected on my local file system after a refresh process is done. Very nice.
  7. Re-publish to Infomotions – Using a system of Perl scripts I wrote called flickr2gallery I then create sets of browsable pages from the content saved on Flickr.

Using this process I can focus more on my content and less on my presentation. It makes it easier for me to focus on the images and their metadata and less on how the content will be displayed. Graphic design is not necessarily my forte.

Flickr2gallery is a suite of Perl scripts and plain text files:

  1. tags2gallery.pl – Used to create pages of images based on photos’ tags.
  2. sets2gallery.pl – Used to create pages of image sets as well as the image “database”.
  3. make-home.pl – Used to create the Image Gallery home page.
  4. flickr2gallery.sh – A shell script calling each of the three scripts above and thus (re-)building the entire Image Gallery subsite. Currently, the process takes about sixty seconds.
  5. images.db – A tab-delimited list of each photograph’s local home page, title, and Flickr thumbnail.
  6. Images.pm – A really-rudimentary Perl module containing a single subroutine used to return a list of HTML img elements filled with links to random images.
  7. random-images.pl – Designed to be used as a server-side include, calls Images.pm to display sets of random images from images.db.

I know the Flickr API has been around for quite a while, and I know I’m a Johnny Come Lately when it comes to learning how to use it, but that does not mean it can’t be outlined here. The API provides a whole lot of functionality. Reading and writing of image content and metadata. Reading and writing information about users, groups, and places. Using the REST-like interface the programmer constructs a command in the form of a URL. The URL is sent to Flickr via HTTP. Responses are returned in easy-to-read XML.

A good example is the way I create my pages of images with a given tag. First I denote a constant which is the root of a Flickr tag search. Next, I define the location of the Infomotions pages on Flickr. Then, after getting a list of all of my tags, I search Flickr for images using each tag as a query. These results are then looped through, parsed, and built into a set of image links. Finally, the links are incorporated into a template and saved to a local file. Below lists the heart of the process:

  use constant S => 'http://api.flickr.com/services/rest/?
                                  method=flickr.photos.search&
                                  api_key=YOURKEY&user_id=YOURID&tags=';
  use constant F => 'http://www.flickr.com/photos/infomotions/';
  
  # get list of all tags here
  
  # find photos with this tag
  $request  = HTTP::Request->new( GET => S . $tag );
  $response = $ua->request( $request );
  
  # process each photo
  $parser    = XML::XPath->new( xml => $response->content );
  $nodes     = $parser->find( '//photo' );
  my $cgi    = CGI->new;
  my $images = '';
  foreach my $node ( $nodes->get_nodelist ) {
  
  # parse
  my $id     = $node->getAttribute( 'id' );
  my $title  = $node->getAttribute( 'title' );
  my $farm   = $node->getAttribute( 'farm' );
  my $server = $node->getAttribute( 'server' );
  my $secret = $node->getAttribute( 'secret' );
  
  # build image links
  my $thumb = "http://farm$farm.static.flickr.com/$server/$id" . 
              '_' . $secret . '_s.jpg';
  my $full  = "http://farm$farm.static.flickr.com/$server/$id" . 
              '_' . $secret . '.jpg';
  my $flickr = F . "$id/";
    
  # build list of images
  $images .= $cgi->a({ href => $full, 
                       rel => 'slideshow',
                       title => "<a href='$flickr'>Details on Flickr</a>"
                      },
                      $cgi->img({ alt => $title, src => $thumb, 
                      border => 0, hspace => 1, vspace => 1 }));
    
  }
  
  # save image links to file here

Notice the rel attribute (slideshow) in each of the images’ anchor elements. These attributes are used as selectors in a jQuery plug-in called ColorBox. In the head of each generated HTML file is a call to ColorBox:

  <script type="text/javascript">
    $(document).ready(function(){
      $("a[rel='slideshow']").colorbox({ slideshowAuto: false, 
                                         current: "{current} of {total}",
                                         slideshowStart: 'Slideshow',
                                         slideshowStop: 'Stop',
                                         slideshow: true,
                                         transition:"elastic" });
      });
  </script>

Using this plug-in I am able to implement a simple slideshow when the user clicks on any image. Each slideshow display consists of simple navigation and title. In my case the title is really a link back to Flickr where the user will be able to view more detail about the image, comment, etc.

barn ceilingkilnHesburgh Libraryself-portraitGiant EraserbirdsChristian Scientist ChurchRedwood Library

Summary and conclusion

I am an amateur photographer, and the fruits of this hobby are online here for sharing. If you use them, then please give credit where credit is due.

The use of Flickr as a “cloud” to host my images is very useful. It enables me to mirror my content in more than one location as well as provide access in multiple ways. When the Library of Congress announced they were going to put some of their image content on Flickr I was a bit taken aback, but after learning how the Flickr API can be exploited I think there are many opportunities for libraries and other organizations to do the same thing. Using the generic Flickr interface is one way to provide access, but enhanced and customized access can be implemented through the API. Lots of food for thought. Now to apply the same process to my movies by exploiting YouTube.

Counting words

Saturday, April 10th, 2010

When I talk about “services against text” I usually get blank stares from people. When I think about it more, many of the services I enumerate are based on the counting of words. Consequently, I spent some time doing just that — counting words.

I wanted to analyze the content of a couple of the mailing lists I own/moderate, specifically Code4Lib and NGC4Lib. Who are the most frequent posters? What words are used most often in the subject lines, and what words are used most often in the body of the messages? Using a hack I wrote (mine-mail.pl) I was able to generate simple tables of data:

I then fed these tables to Wordle to create cool looking images. I also fed these tables to a second hack (dat2cloud.pl) to create not-even-close-to-valid HTML files in the form of hyperlinked tag clouds. Below is are the fruits of these efforts:


image of names

tag cloud of names

image of subjects

tag cloud of subjects

image of words

tag cloud of words

The next step is to plot the simple tables on a Cartesian plane. In other words, graph the data. Wish me luck.

My first ePub file

Sunday, March 21st, 2010

I made available my first ePub file today.

screen shot
Screen shot

EPub is the current de facto standard file format for ebook readers. After a bit of reading, the format is not too difficult since all the files are plain-text XML files or images. The various metadata files are ePub-specific XML. The content is XHTML. The graphics can be in any number of formats. The whole lot is compressed into a single file using the zip “standard”, and suffixed with a .epub extension.

Since much of my content has been previously saved as TEI files, the process of converting my content into ePub is straight-forward. Use XPath to extract metadata. Use XSLT to transform the TEI to XHTML. Zip up the whole thing and make it available on the Web. I have found the difficult part to be the images. It is hard to figure out where one’s images are saved and then incorporate them into the ePub file. I will have to be a bit more standard with my image locations in the future and/or I will need to do a bit of a retrospective conversion process. (I probably will go the second route. Crazy.)

Loading my ePub into Firefox’s EPUBReader worked just fine. The whole thing rendered pretty well in Stanza too. More importantly, it validated against a Java-based tool called epubcheck. Whew!

While I cogitate how to convert my content, you can download my first ePub file as well as the beginnings of my ePub creation script.

Enjoy?

P.S. I think the Apple iPad is going to have a significant impact on digital reading in the very near future. I’m preparing.


Creator: Eric Lease Morgan <eric_morgan@infomotions.com>
Date created: 2008-05-26
Date updated: 2010-05-09
URL: http://infomotions.com/blog/