Archive for the ‘Hacks’ Category

Collecting water and putting it on the Web (Part III of III)

Thursday, September 3rd, 2009

This is Part III of an essay about my water collection, specifically a summary, opportunities for future study, and links to the source code. Part I described the collection’s whys and hows. Part II described the process of putting it on the Web.

Summary, future possibilities, and source code

There is no doubt about it. My water collection is eccentric but through my life time I have encountered four other people who also collect water. At least I am not alone.

Putting the collection on the Web is a great study in current technology. It includes relational database design. Doing input/output against the database through a programming language. Exploiting the “extensible” in XML by creating my own mark-up language. Using XSLT to transform the XML for various purposes: display as well as simple transformation. Literally putting the water collection on the map. Undoubtably technology will change, but the technology of my water collection is a representative reflection of the current use of computers to make things available on the Web.

I have made all the software a part of this system available here:

  1. SQL file sans any data – good for study of simple relational database
  2. SQL file complete with data – see how image data is saved in the database
  3. PHP scripts – used to do input/output against the database
  4. waters.xml – a database dump, sans images, in the form of an XML file
  5. waters.xsl – the XSLT used to display the browser interface
  6. waters2markers.xsl – transform water.xml into Google Maps XML file
  7. map.pl – implementation of Google Maps API

My water also embodies characteristics of librarianship. Collection. Acquisition. Preservation. Organization. Dissemination. The only difference is that the content is not bibliographic in nature.

There are many ways access to the collection could be improved. It would be nice to sort by date. It would be nice to index the content and make the collection searchable. I have given thought to transforming the WaterML into FO (Formatting Objects) and feeding the FO to a PDF processor like FOP. This could give me a printed version of the collection complete with high resolution images. I could transform the WaterML into an XML file usable by Google Earth providing another way to view the collection. All of these things are “left up the reader for further study”. Software is never done, nor are library collections.

River Lune
River Lune
Roman Bath
Roman Bath
Ogle Lake
Ogle Lake

Finally, again, why do I do this? Why do I collect the water? Why have a spent so much time creating a system for providing access to the collection? Ironically, I am unable to answer succinctly. It has something to do with creativity. It has something to do with “arsience“. It has something to do with my passion for the library profession and my ability to manifest it through computers. It has something to do with the medium of my art. It has something to do with my desire to share and expand the sphere of knowledge. “Idea. To be an idea. To be an idea and an example to others… Idea”. I really don’t understand it through and through.

Read all the posts in this series:

  1. The whys and hows of the water collection
  2. How the collection is put on the Web
  3. This post

Visit the water collection.

Collecting water and putting it on the Web (Part II of III)

Thursday, September 3rd, 2009

This is Part II of an essay about my water collection, specifically the process of putting it on the Web. Part I describes the whys and hows of the collection. Part III is a summary, provides opportunities for future study, and links to the source code.

Making the water available on the Web

As a librarian, I am interested in providing access to my collection(s). As a librarian who has the ability to exploit the use of computers, I am especially interested in putting my collection(s) on the Web. Unfortunately, the process is not as easy as the actual collection process, and there have been a number of processes along the way. When I was really into HyperCard I created a “stack” complete with pictures of my water, short descriptions, and an automatic slide show feature that played the sound of running water in the background. (If somebody asks, I will dig up this dinosaur and make it available.) Later I created a Filemaker Pro database of the collection, but that wasn’t as cool as the HyperCard implementation.

Mississippi River
Mississippi River
Mississippi River
Mississippi River
Mississippi River
Mississippi River

The current implementation is more modern. It takes advantage of quite a number of technologies, including:

  • a relational database
  • a set of PHP scripts that do input/output against the database
  • an image processor to create thumbnail images
  • an XSL processor to generate a browsable Web presence
  • the Google Maps API to display content on a world map

The use of each of these technologies is described in the following sections.

Relational database

ER diagram
ER diagram

Since 2002 I have been adding and maintaining newly acquired waters in a relational, MySQL, database. (Someday I hope to get the waters out of those cardboard boxes and add them to the database too. Someday.) The database itself is rather simple. Four tables: one for the waters, one for the collectors, a join table denoting who collected what, and a metadata table consisting of a single record describing the collection as a whole. The entity-relationship diagram illustrates the structure of the database in greater detail.

Probably the most interesting technical characteristic of the database is the image field of type mediumblob in the waters table. When it comes to digital libraries and database design, one of the perennial choices to make is where to save your content. Saving it outside your database makes your database smaller and more complicated but forces you to maintain links to your file system or the Internet where the actual content resides. This can be an ongoing maintenance nightmare and can side-step the preservation issues. On the other hand inserting your content inside the database allows you to keep your content all in once place while “marrying” it to up in your database application. Putting the content in the database also allows you to do raw database dumps making the content more portable and easier to back-up. I’ve designed digital library systems both ways. Each has its own strengths and weaknesses. This is one of the rarer times I’ve put the content into the database itself. Never have I solely relied on maintaining links to off-site content. Too risky. Instead I’ve more often mirrored content locally and maintained two links in the database: one to the local cache and another to the canonical website.

PHP scripts for database input/output

Sets of PHP scripts are used to create, maintain, and report against the waters database. Creating and maintaining database records is tedious but not difficult as long as you keep in mind that there are really only four things you need to do with any database: 1) create records, 2) find records, 3) edit records, and 4) delete records. All that is required is to implement each of these processes against each of the fields in each of the tables. Since PHP was designed for the Web, each of these processes is implemented as a Web page only accessible to myself. The following screen shots illustrate the appearance and functionality of the database maintenance process.

ER diagram
Admin home
ER diagram
Admin waters
ER diagram
Edit water

High-level menus on the right. Sub-menus and data-entry forms in the middle. Simple. One of the nice things about writing applications for oneself is the fact that you don’t have to worry about usability, just functionality.

The really exciting stuff happens when the reports are written against the database. Both of them are XML files. The first is a essentially a database dump — water.xml — complete with the collection’s over-arching metadata record, each of the waters and their metadata, and a list of collectors. The heart of the report-writing process includes:

  1. finding all of the records in the database
  2. converting and saving each water’s image as a thumbnail
  3. initializing the water record
  4. finding all of the water’s collectors
  5. adding each collector to the record
  6. going to Step #5 for each collector
  7. finishing the record
  8. going to Step #2 for each water
  9. saving the resulting XML to the file system

There are two hard parts about this process. The first, “MOGRIFY”, is a shelled out hack to the operating system using an ImageMagik utility to convert the content of the image field into a thumbnail image. Without this utility saving the image from the database to the file system would be problematic. Second, the SELECT statement used to find all the collectors associated with a particular water is a bit tricky. Not really to difficult, just a typical SQL join process. Good for learning relational database design. Below is a code snippet illustrating the heart of this report-writing process:

  # process every found row
  while ($r = mysql_fetch_array($rows)) {

    # get, define, save, and convert the image -- needs error checking
    $image     = stripslashes($r['image']);
    $leafname  = explode (' ' ,$r['name']);
    $leafname  = $leafname[0] . '-' . $r['water_id'] . '.jpg';
    $original  = ORIGINALS  . '/' . $leafname;
    $thumbnail = THUMBNAILS . '/' . $leafname;
    writeReport($original, $image);
    copy($original, $thumbnail);
    system(MOGRIFY . $thumbnail);

    # initialize and build a water record
    $report .= '<water>';
    $report .= "<name water_id='$r[water_id]' lat='$r[lat]' lng='$r[lng]'>" .
               prepareString($r['name']) . '</name>';
    $report .= '<date_collected>';
    $report .= "<year>$r[year]</year>";
    $report .= "<month>$r[month]</month>";
    $report .= "<day>$r[day]</day>";
    $report .= '</date_collected>';

    # find all the COLLECTORS associated with this water, and...
    $sql = "SELECT c.*
            FROM waters AS w, collectors AS c, items_for_collectors AS i
            WHERE w.water_id   = i.water_id
            AND c.collector_id = i.collector_id
            AND w.water_id     = $r[water_id]
            ORDER BY c.last_name, c.first_name";
    $all_collectors = mysql_db_query ($gDatabase, $sql);
    checkResults();

    # ...process each one of them
    $report .= "<collectors>";
    while ($c = mysql_fetch_array($all_collectors)) {

      $report .= "<collector collector_id='$c[collector_id]'><first_name>
                 $c[first_name]</first_name>
                 <last_name>$c[last_name]</last_name></collector>";

    }
    $report .= '</collectors>';

    # finish the record
    $report .= '<description>' . stripslashes($r['description']) .
               '</description></water>';

  }

The result is the following “WaterML” XML content — a complete description of a water, in this case water from Copenhagen:

  <water>
    <name water_id='87' lat='55.6889' lng='12.5951'>Canal
      surrounding Kastellet, Copenhagen, Denmark
    </name>
    <date_collected>
      <year>2007</year>
      <month>8</month>
      <day>31</day>
    </date_collected>
    <collectors>
      <collector collector_id='5'>
        <first_name>Eric</first_name>
        <last_name>Morgan</last_name>
    </collector>
    </collectors>
    <description>I had the opportunity to participate in the
      Ticer Digital Library School in Tilburg, The Netherlands.
      While I was there I also had the opportunity to visit the
      folks at
      <a href="http://indexdata.com">Index Data</a>, a company
      that writes and supports open source software for libraries.
      After my visit I toured around Copenhagen very quickly. I
      made it to the castle (Kastellet), but my camera had run out
      of batteries. The entire Tilburg, Copenhagen, Amsterdam
      adventure was quite informative.
    </description>
  </water>

When I first created this version of the water collection RSS was just coming on line. Consequently I wrote an RSS feed for the water, but then I got realistic. How many people want to get an RSS feed of my water. Crazy?!

XSL processing

Now that the XML file has been created an the images are saved to the file system, the next step is to make a browser-based interface. This is done though an XSLT style sheet and XSL processor called Apache2::TomKit.

Apache2::TomKit is probably the most eclectic component of my online water collection application. Designed to be a replacement for another XSL processor called AxKit, Apache2::TomKit enables the developer to create CGI-like applications, complete with HTTP GET parameters, in the form of XML/XSLT combinations. Specify the location of your XML files. Denote what XSLT files to use. Configure what XSLT processor to use. (I use LibXSLT.) Define an optional cache location. Done. The result is on-the-fly XSL transformations that work just like CGI scripts. The hard part is writing the XSLT.

The logic of my XSLT style sheet — waters.xsl — goes like this:

  1. Get input – There are two: cmd and id. Cmd is used to denote the desired display function. Id is used to denote which water to display
  2. Initialize output – This is pretty standard stuff. Display XHTML head elements and start the body.
  3. Branch – Depending on the value of cmd, display the home page, a collectors page, all the images, all the waters, or a specific water.
  4. Display the content – This is done with the thorough use of XPath expressions.
  5. Done – Complete the XHTML with a standard footer.

Of all the XSLT style sheets I’ve written in my career, waters.xsl is definitely the most declarative in nature. This is probably because the waters.xml file is really data driven as opposed mixed content. The XSLT file is very elegant but challenging for the typical Perl or PHP hacker to quickly grasp.

Once the integration of the XML file, the XSLT style sheet, and Apache2::TomKit is complete, I was able to design URL’s such as the following:

Okay. So its not very REST-ful; the URLs are not very “cool”. Sue me. I originally designed this in 2002.

Waters and Google Maps

In 2006 I used my water collection to create my first mash-up. It combined latitudes and longitudes with the Google Maps API.

Inserting maps into your Web pages via the Google API is a three-step process: 1) create an XML file containing latitudes and longitudes, 2) insert a call to the Google Maps javascript into the head of your HTML, and 3) call the javascript from within the body of your HTML.

For me, all I had to do was: 1) create new fields in my database for latitudes and longitudes, 2) go through each record in the database doing latitude and longitude data-entry, 3) write a WaterML file, 4) write an XSLT file transforming the WaterML into an XML file expected of Google Maps, 5) write a CGI script that takes latitudes and longitudes as input, 6) display a map, and 7) create links from my browser-based interface to the maps.

It may sound like a lot of steps, but it is all very logical, and taken bit by bit is relatively easy. Consequently, I am able to display a world map complete with pointers to all of my water. Conversely, I am able to display a water record and link its location to a map. The following two screen dumps illustrate the idea, and I try to get as close to the actual collection point as possible:

ER diagram
World map
ER diagram
Single water

Read all the posts in this series:

  1. The whys and hows of the water collection
  2. This post
  3. A summary, future directions, and source code

Visit the water collection.

Automatic metadata generation

Thursday, July 30th, 2009

I have been having a great deal of success extracting keywords and two-word phrases from documents and assigning them as “subject headings” to electronic texts — automatic metadata generation. In many cases but not all, the set of assigned keywords I’ve created are just as good if not better as the controlled vocabulary terms assigned by librarians.

The problem

The Alex Catalogue is a collection of roughly 14,000 electronic texts. The vast majority come from Project Gutenberg. Some come from the Internet Archive. The smallest number come from a defunct etext collection of Virginia Tech. All of the documents are intended to surround the themes of American and English literature and Western philosophy.

With the exception of the non-fiction works from the Internet Archive, none of the electronic texts were associated with subject-related metadata. With the exception of author names (which are yet to be “well-controlled”), it has been difficult learn the “aboutness” of each of the documents. Such a thing is desirable for two reasons: 1) to enable the reader to evaluate the relevance of document, and 2) to provide a browsable interface to the collection. Without some sort of tags, subject headings, or application of clustering techniques, browsability is all but impossible. My goal was to solve this problem in an automated manner.

The solution

A couple of years ago I used tools such as Lingua::EN::Summarize and Open Text Summarizer to extract keywords and summaries from the etexts and assign them as subject terms. The process worked, but not extraordinarily well. I then learned about Term Frequency Inverse Document Frequency (TFIDF) to calculate “relevance”, and T-Score to calculate the probability of two words appearing side-by-side — bi-grams or two-word phrases. Applying these techniques to the etexts of the Alex Catalogue I have been able to create and add meaningful subject “tags” to each of my documents which then paves the way to browsability. Here is the algorithm I used to implement the solution:

  1. Collect documents – This was done through various harvesting techniques. Etexts are saved to the local file system and what metadata does exist gets saved to a database.
  2. Index the collection – Each of the documents is full-text indexed. Not only does this facilitate Steps #3 and #4, below, it makes the collection searchable.
  3. Calculate a relevancy score (TFIDF) for each word – With the exception of parsing each etext into a set of “words”, counting the number of words in a document and the frequency of each word is easy. Determining the total number of documents in the collection is trivial. By searching the index for each word and getting back the number of documents in which it appears is the work of the indexer. With these four values (number of words in a document, frequency of a word in a document, the number of total documents, and the number of documents where the word appears) TFIDF can be calculated for each word.
  4. Calculate a relevancy score for each bi-gram – Instead of extracting words from an etext, bi-grams (two-word phrases) were extracted and TFIDF is calculated for each of them, just like Step #3.
  5. Save – If the score for each word or bi-gram is greater than an arbitrarily denoted lower bounds, and if the word or bi-gram is not a stop word, then assign the word or bi-gram to the etext. This step was the most time-consuming. It required many dry runs of the algorithm to determine an optimal lower-bounds as well as set of stop words. The lower the bounds the greater number of words and phrases are returned, but as the number of words and phrases increases their apparent usefulness decreases. The words become too common among the controlled vocabulary. At the other end of the scale, a stop word list needed to be created to remove meaningless words and phrases. The stop word problem was complicated in Project Gutenberg texts because of the “fine print” and legalese in most of the documents, and by the OCRed (optical character recognized) text from the Internet Archive. Words like “thofe” where the “f” was really an “s” needed to be removed.
  6. Go to Step #3 for each document in the collection.
  7. Done.

The results

Through this process I discovered a number of things.

First, in regards to fictional works, the words or phrases returned are often pronouns, and these were usually the names of characters from the work. An excellent example is Mark Twain’s Adventures of Huckleberry Finn whose currently assigned terms include: huck, tom, joe, injun joe, aunt polly, tom sawyer, muff potter, and injun joe’s.

Second, in regards to works of non-fiction, the words and phrases returned are also nouns, and these are objects referred to often in the etext. A good example includes John Stuart Mill’s Auguste Comte and Positivism where the assigned words are: comte, phaenomena, metaphysical, science, mankind, social, scientific, philosophy, and sciences.

Third, automatically generated keywords and phrases were many times just as useful as the librarian-assigned Library of Congress Subject headings. Many of the items harvested from the Internet Archive were complete with MARC records. Some of those records included subject headings. During Step #5 (above), I spent time observing the output and comparing it to previously assigned terms. Take for example a work called Universalism in America: A History by Richard Eddy. Its assigned headings included:

  • Universalism United States History
  • Unitarian Universalist churches United States

My automatically generated terms/phrases are:

  • universalist
  • ballou
  • hosea ballou
  • boston
  • universalist church
  • sermon
  • convention
  • first universalist
  • universalist quarterly
  • doctrine
  • universalist society
  • restorationist controversy
  • thomas whittemore
  • delivered
  • abner kneeland
  • sermon delivered
  • church
  • universalist meeting
  • universalist magazine
  • universal salvation
  • america
  • hosea ballon
  • vers alism
  • edward turner
  • general convention
  • universalism

Granted, the generated list is not perfect. For example, Hosea Ballou is mentioned twice, and the second was probably caused by an OCR error. On the other hand, how was a person to know that Hosea Ballou was even a part of the etext if it weren’t for this process? The same goes for the other people: Thomas Whittemore, Abner Kneeland, and Edward Turner. In defense of controlled vocabulary, the terms “church”, “sermon”, “doctrine”, and “american” could all be assumed from the (rather) hierarchal nature of LCSH, but unless a person understands the nature of LCSH such a thing is not obvious.

As a librarian I understand the power of a controlled vocabulary, but since I am not limited to three to five subject headings per entry, and because controlled vocabularies are often very specific, I have retained the LCSH in each record whenever possible. The more the merrier.

Next steps

Now that the collection has richer metadata, the next steps will be to exploit it. Some of those nexts steps include:

  1. Normalize the data – Each of the subjects are currently saved in a single database field. They need to be normalized across the database to enable database joins and make it easier to generate reports.
  2. Create a browsable interface – Write a set of static Web pages linking keywords and phrases to etexts. This will make it easier to see at a glance the type of content in the collection.
  3. Re-index – Trivial. Send all the data and metadata back to the indexer ultimately improving the precision/recall ratio.
  4. Enhance search experience – Extract the keywords and phrases from search results and display them to the user. Make them linkable to easily “find more like this one.” Extract the same keywords and phrases and use them to implement the increasingly popular browsable facets feature.
  5. Enhance linked data – Generate a report against the database to create (better) RDF files complete with more meaningful (subject) tags. Link these tags to external vocabularies such as WordNet through the use of linked data thus contributing to the Semantic Web and enabling others to benefit from my labors. (Infomotions Man says, ‘Give back to the ‘Net”.)

Fun! Combining traditional librarianship with computer applications; not automating existing workflows as much as exploiting the inherent functions of a computer. Using mathematics to solve large-scale problems. Making it easier to do learning and research. It is the not what of librarianship that needs to change as much as the how.

Lingua::EN::Bigram (version 0.01)

Tuesday, June 23rd, 2009

Below is the POD (Plain O’ Documentation) file describing a Perl module I wrote called Lingua::EN::Bigram.

The purpose of the module is to: 1) extract all of the two-word phrases from a given text, and 2) rank each phrase according to its probability of occurance. Very nice for doing textual analysis. For example, by applying this module to Mark Twain’s Adventures of Tom Sawyer it becomes evident that the signifcant two-word phrases are names of characters in the story. On the other hand, Ralph Waldo Emerson’s Essays: First Series returns action statements — instructions. On the other hand Henry David Thoreau’s Walden returns “walden pond” and descriptions of pine trees. Interesting.

The code is available here or on CPAN.

NAME

Lingua::EN::Bigram – Calculate significant two-word phrases based on frequency and/or T-Score

SYNOPSIS

  use Lingua::EN::Bigram;
  $bigram = Lingua::EN::Bigram->new;
  $bigram->text( 'All men by nature desire to know. An indication of this...' );
  $tscore = $bigram->tscore;
  foreach ( sort { $$tscore{ $b } <=> $$tscore{ $a } } keys %$tscore ) {

    print "$$tscore{ $_ }\t" . "$_\n";

  }

DESCRIPTION

This module is designed to: 1) pull out all of the two-word phrases (collocations or “bigrams”) in a given text, and 2) list these phrases according to thier frequency and/or T-Score. Using this module is it possible to create list of the most common two-word phrases in a text as well as order them by their probable occurance, thus implying significance.

METHODS

new

Create a new, empty bigram object:

  # initalize
  $bigram = Lingua::EN::Bigram->new;

text

Set or get the text to be analyzed:

  # set the attribute
  $bigram->text( 'All good things must come to an end...' );

  # get the attribute
  $text = $bigram->text;

words

Return a list of all the tokens in a text. Each token will be a word or puncutation mark:

  # get words
  @words = $bigram->words;

word_count

Return a reference to a hash whose keys are a token and whose values are the number of times the token occurs in the text:

  # get word count
  $word_count = $bigram->word_count;

  # list the words according to frequency
  foreach ( sort { $$word_count{ $b } <=> $$word_count{ $a } } keys %$word_count ) {

    print $$word_count{ $_ }, "\t$_\n";

  }

bigrams

Return a list of all bigrams in the text. Each item will be a pair of tokens and the tokens may consist of words or puncutation marks:

  # get bigrams
  @bigrams = $bigram->bigrams;

bigram_count

Return a reference to a hash whose keys are a bigram and whose values are the frequency of the bigram in the text:

  # get bigram count
  $bigram_count = $bigram->bigram_count;

  # list the bigrams according to frequency
  foreach ( sort { $$bigram_count{ $b } <=> $$bigram_count{ $a } } keys %$bigram_count ) {

    print $$bigram_count{ $_ }, "\t$_\n";

  }

tscore

Return a reference to a hash whose keys are a bigram and whose values are a T-Score — a probabalistic calculation determining the significance of bigram occuring in the text:

  # get t-score
  $tscore = $bigram->tscore;

  # list bigrams according to t-score
  foreach ( sort { $$tscore{ $b } <=> $$tscore{ $a } } keys %$tscore ) {

    print "$$tscore{ $_ }\t" . "$_\n";

  }

DISCUSSION

Given the increasing availability of full text materials, this module is intended to help “digital humanists” apply mathematical methods to the analysis of texts. For example, the developer can extract the high-frequency words using the word_count method and allow the user to search for those words in a concordance. The bigram_count method simply returns the frequency of a given bigram, but the tscore method can order them in a more finely tuned manner.

Consider using T-Score-weighted bigrams as classification terms to supplement the “aboutness” of texts. Concatonate many texts together and look for common phrases written by the author. Compare these commonly used phrases to the commonly used phrases of other authors.

Each bigram includes punctuation. This is intentional. Developers may need want to remove bigrams containing such values from the output. Similarly, no effort has been made to remove commonly used words — stop words — from the methods. Consider the use of Lingua::StopWords, Lingua::EN::StopWords, or the creation of your own stop word list to make output more meaningful. The distribution came with a script (bin/bigrams.pl) demonstrating how to remove puncutation and stop words from the displayed output.

Finally, this is not the only module supporting bigram extraction. See also Text::NSP which supports n-gram extraction.

TODO

There are probably a number of ways the module can be improved:

  • the constructor method could take a scalar as input, thus reducing the need for the text method
  • the distribution’s license should probably be changed to the Perl Aristic License
  • the addition of alternative T-Score calculations would be nice
  • it would be nice to support n-grams
  • make sure the module works with character sets beyond ASCII

ACKNOWLEDGEMENTS

T-Score is calculated as per Nugues, P. M. (2006). An introduction to language processing with Perl and Prolog: An outline of theories, implementation, and application with special consideration of English, French, and German. Cognitive technologies. Berlin: Springer. Page 109.

AUTHOR

Eric Lease Morgan <eric_morgan@infomotions.com>

Lingua::Concordance (version 0.01)

Wednesday, June 10th, 2009

Below is a man page describing a Perl I module I recently wrote called Lingua::Concordance (version 0.01).

Given the increasing availability of full text books and journals, I think it behooves the library profession to aggressively explore the possibilities of providing services against text as a means of making the proverbial fire hose of information more useful. Providing concordance-like functions against texts is just one example.

The distribution is available from this blog as well as CPAN.

NAME

Lingua::Concordance – Keyword-in-context (KWIC) search interface

SYNOPSIS

  use Lingua::Concordance;
  $concordance = Lingua::Concordance->new;
  $concordance->text( 'A long time ago, in a galaxy far far away...' );
  $concordance->query( 'far' );
  foreach ( $concordance->lines ) { print "$_\n" }

DESCRIPTION

Given a scalar (such as the content of a plain text electronic book or journal article) and a regular expression, this module implements a simple keyword-in-context (KWIC) search interface — a concordance. Its purpose is to return lists of lines from a text containing the given expression. See the Discussion section, below, for more detail.

METHODS

new

Create a new, empty concordance object:

  $concordance = Lingua::Concordance->new;

text

Set or get the value of the concordance’s text attribute where the input is expected to be a scalar containing some large amount of content, like an electronic book or journal article:

  # set text attribute
  $concordance->text( 'Call me Ishmael. Some years ago- never mind how long...' );

  # get the text attribute
  $text = $concordance->text;

Note: The scalar passed to this method gets internally normalized, specifically, all carriage returns are changed to spaces, and multiple spaces are changed to single spaces.

query

Set or get the value of the concordance’s query attribute. The input is expected to be a regular expression but a simple word or phrase will work just fine:

  # set query attribute
  $concordance->query( 'Ishmael' );

  # get query attribute
  $query = $concordance->query;

See the Discussion section, below, for ways to make the most of this method through the use of powerful regular expressions. This is where the fun it.

radius

Set or get the length of each line returned from the lines method, below. Each line will be padded on the left and the right of the query with the number of characters necessary to equal the value of radius. This makes it easier to sort the lines:

  # set radius attribute
  $concordance->radius( $integer );

  # get radius attribute
  $integer = $concordance->query;

For terminal-based applications it is usually not reasonable to set this value to greater than 30. Web-based applications can use arbitrarily large numbers. The internally set default value is 20.

sort

Set or get the type of line sorting:

  # set sort attribute
  $concordance->sort( 'left' );

  # get sort attribute
  $sort = $concordance->sort;

Valid values include:

  • none – the default value; sorts lines in the order they appear in the text — no sorting
  • left – sorts lines by the (ordinal) word to the left of the query, as defined the ordinal method, below
  • right – sorts lines by the (ordinal) word to the right of the query, as defined the ordinal method, below
  • match – sorts lines by the value of the query (mostly)

This is good for looking for patterns in texts, such as collocations (phrases, bi-grams, and n-grams). Again, see the Discussion section for hints.

ordinal

Set or get the number of words to the left or right of the query to be used for sorting purposes. The internally set default value is 1:

  # set ordinal attribute
  $concordance->ordinal( 2 );

  # get ordinal attribute
  $integer = $concordance->ordinal;

Used in combination with the sort method, above, this is good for looking for textual patterns. See the Discussion section for more information.

lines

Return a list of lines from the text matching the query. Our reason de existance:

  @lines = $concordance->lines;

DISCUSSION

[Elaborate upon a number of things here such as but not limited to: 1) the history of concordances and concordance systems, 2) the usefulness of concordances in the study of linguistics, 3) how to exploit regular expressions to get the most out of a text and find interesting snippets, and 4) how the module might be implemented in scripts and programs.]

BUGS

The internal _by_match subroutine, the one used to sort results by the matching regular expression, does not work exactly as expected. Instead of sorting by the matching regular expression, it sorts by the string exactly to the right of the matched regular expression. Consequently, for queries such as ‘human’, it correctly matches and sorts on human, humanity, and humans, but matches such as Humanity do not necessarily come before humanity.

TODO

  • Write Discussion section.
  • Implement error checking.
  • Fix the _by_match bug.
  • Enable all of the configuration methods (text, query, radius, sort, and ordinal) to be specified in the constructor.
  • Require the text and query attributes to be specified as a part of the constructor, maybe.
  • Remove line-feed characters while normalizing text to accomdate Windows-based text streams, maybe.
  • Write an example CGI script, to accompany the distribution’s terminal-based script, demonstrating how the module can be implemented in a Web interface.
  • Write a full-featured terminal-based script enhancing the one found in the distribution.

ACKNOWLEDGEMENTS

The module implements, almost verbatim, the concordance programs and subroutines described in Bilisoly, R. (2008). Practical text mining with Perl. Wiley series on methods and applications in data mining. Hoboken, N.J.: Wiley. pgs: 169-185. “Thanks Roger. I couldn’t have done it without your book!”

EAD2MARC

Friday, June 5th, 2009

This posting simply shares three hacks I’ve written to enable me to convert EAD files to MARC records, and ultimately add them to my “discovery” layer — VUFind — for the Catholic Portal:

  • ead2marcxml.sh – Using xsltproc and a modified version of Terry Reese’s XSL stylesheet, converts all the EAD/.xml files in the current directory into MARCXML files. “Thanks Terry!”
  • marcxml2marc.sh – Using yaz-marcdump, convert all .marcxml files in the current directory into “real” MARC records.
  • add-001.pl – A hack to add 001 fields to MARC records. Sometimes necessary since the EAD files do not always have unique identifiers.

The distribution is available in the archives, and distributed under the GNU Public License.

Now, off to go fishing.

Interent Archive content in “discovery” systems

Tuesday, June 2nd, 2009

This quick posting describes how Internet Archive content, specifically, content from the Open Content Alliance can be quickly and easily incorporated into local library “discovery” systems. VuFind is used here as the particular example:

  1. Get keys – The first step is to get a set of keys describing the content you desire. This can be acquired through the Internet Archive’s advanced search interface.
  2. Convert keys – The next step is to convert the keys into sets of URLs pointing to the content you want to download. Fortunately, all the URLs have a similar shape: http://www.archive.org/download/KEY/KEY.pdf, http://www.archive.org/download/KEY/KEY_meta.mrc, or http://www.archive.org/download/KEY/KEY__djvu.txt.
  3. Download – Feed the resulting URLs to your favorite spidering/mirroring application. I use wget.
  4. Update – Enhance the downloaded MARC records with 856$u valued denoting the location of your local PDF copy as well as the original (cononical) version.
  5. Index – Add the resulting MARC records to your “discovery” system.

Linked here is a small distribution of shell and Perl scripts that do this work for me and incorporate the content into VuFind. Here is how they can be used:

  $ getkeys.sh > catholic.keys
  $ keys2urls.pl catholic.keys > catholic.urls
  $ mirror.sh catholic.urls
  $ updatemarc.pl
  $ find /usr/var/html/etexts -name '*.marc' /
  -exec cat {} >> /usr/local/vufind/marc/archive.marc \;
  $ cd /usr/local/vufind
  $ ./import.sh marc/archive.marc
  $ sudo ./vufind.sh restart

Cool next steps would be use text mining techniques against the downloaded plain text versions of the documents to create summaries, extract named entities, and identify possible subjects. These items could then be inserted into the MARC records to enhance retrieval. Ideally the full text would be indexed, but alas, MARC does not accomodate that. “MARC must die.”

TFIDF In Libraries: Part III of III (For thinkers)

Sunday, May 31st, 2009

This is the third of the three-part series on the topic of TFIDF in libraries. In Part I the why’s and wherefore’s of TFIDF were outlined. In Part II TFIDF subroutines and programs written in Perl were used to demonstrate how search results can be sorted by relevance and automatic classification can be done. In this last part a few more subroutines and a couple more programs are presented which: 1) weigh search results given an underlying set of themes, and 2) determine similarity between files in a corpus. A distribution including the library of subroutines, Perl scripts, and sample data are available online.

Big Names and Great Ideas

As an intellectual humanist, I have always been interested in “great” ideas. In fact, one of the reasons I became I librarian was because of the profundity of ideas physically located libraries. Manifested in books, libraries are chock full of ideas. Truth. Beauty. Love. Courage. Art. Science. Justice. Etc. As the same time, it is important to understand that books are not source of ideas, nor are they the true source of data, information, knowledge, or wisdom. Instead, people are the real sources of these things. Consequently, I have also always been interested in “big names” too. Plato. Aristotle. Shakespeare. Milton. Newton. Copernicus. And so on.

As a librarian and a liberal artist (all puns intended) I recognize many of these “big names” and “great ideas” are represented in a set of books called the Great Books of the Western World. I then ask myself, “Is there someway I can use my skills as a librarian to help support other people’s understanding and perception of the human condition?” The simple answer is to collection, organize, preserve, and disseminate the things — books — manifesting great ideas and big names. This is a lot what my Alex Catalogue of Electronic Texts is all about. On the other hand, a better answer to my question is to apply and exploit the tools and processes of librarianship to ultimately “save the time of the reader”. This is where the use of computers, computer technology, and TFIDF come into play.

Part II of this series demonstrated how to weigh search results based on the relevancy ranked score of a search term. But what if you were keenly interested in “big names” and “great ideas” as they related to a search term? What if you wanted to know about librarianship and how it related to some of these themes? What if you wanted to learn about the essence of sculpture and how it may (or may not) represent some of the core concepts of Western civilization? To answer such questions a person would have to search for terms like sculpture or three-dimensional works of art in addition to all the words representing the “big names” and “great ideas”. Such a process would be laborious to enter by hand, but trivial with the use of a computer.

Here’s a potential solution. Create a list of “big names” and “great ideas” by copying them from a place such as the Great Books of the Western World. Save the list much like you would save a stop word list. Allow a person to do a search. Calculate the relevancy ranking score for each search result. Loop through the list of names and ideas searching for each of them. Calculate their relevancey. Sum the weight of search terms with the weight of name/ideas terms. Return the weighted list. The result will be a relevancy ranked list reflecting not only the value of the search term but also the values of the names/ideas. This second set of values I call the Great Ideas Coefficient.

To implement this idea, the following subroutine, called great_ideas, was created. Given an index, a list of files, and a set of ideas, it loops through each file calculating the TFIDF score for each name/idea:

  sub great_ideas {

    my $index = shift;
    my $files = shift;
    my $ideas = shift;

    my %coefficients = ();

    # process each file
    foreach $file ( @$files ) {

      my $words = $$index{ $file };
      my $coefficient = 0;

      # process each big idea
      foreach my $idea ( keys %$ideas ) {

        # get n and t for tdidf
        my $n = $$words{ $idea };
        my $t = 0;
        foreach my $word ( keys %$words ) { $t = $t + $$words{ $word } }

          # calculate; sum all tfidf scores for all ideas
          $coefficient = $coefficient + &tfidf( $n, $t, keys %$index, scalar @$files );

        }

      # assign the coefficient to the file
      $coefficients{ $file } = $coefficient;

    }

    return \%coefficients;

  }

A Perl script, ideas.pl, was then written taking advantage of the great_ideas subroutine. As described above, it applies the query to an index, calculates TFIDF for the search terms as well as the names/ideas, sums the results, and lists the results accordingly:

  # define
  use constant STOPWORDS => 'stopwords.inc';
  use constant IDEAS     => 'ideas.inc';

  # use/require
  use strict;
  require 'subroutines.pl';

  # get the input
  my $q = lc( $ARGV[ 0 ] );

  # index, sans stopwords
  my %index = ();
  foreach my $file ( &corpus ) { $index{ $file } = &index( $file, &slurp_words( STOPWORDS ) ) }

  # search
  my ( $hits, @files ) = &search( \%index, $q );
  print "Your search found $hits hit(s)\n";

  # rank
  my $ranks = &rank( \%index, [ @files ], $q );

  # calculate great idea coefficients
  my $coefficients = &great_ideas( \%index, [ @files ], &slurp_words( IDEAS ) );

  # combine ranks and coefficients
  my %scores = ();
  foreach ( keys %$ranks ) { $scores{ $_ } = $$ranks{ $_ } + $$coefficients{ $_ } }

  # sort by score and display
  foreach ( sort { $scores{ $b } <=> $scores{ $a } } keys %scores ) {

    print "\t", $scores{ $_ }, "\t", $_, "\n"

  }

Using the query tool described in Part II, a search for librarianship returns the following results:

  $ ./search.pl books
  Your search found 3 hit(s)
    0.00206045818083232   librarianship.txt
    0.000300606222548807  mississippi.txt
    5.91505974210339e-05  hegel.txt

Using the new program, ideas.pl, the same set of results are returned but in a different order, an order reflecting the existence of “big ideas” and “great ideas” in the texts:

  $ ./ideas.pl books
  Your search found 3 hit(s)
    0.101886904057731   hegel.txt
    0.0420767249559441  librarianship.txt
    0.0279062776599476  mississippi.txt

When it comes to books and “great” ideas, maybe I’d rather read hegel.txt as opposed to librarianship.txt. Hmmm…

Think of the great_ideas subroutine as embodying the opposite functionality as a stop word list. Instead of excluding the words in a given list from search results, use the words to skew search results in a particular direction.

The beauty of the the great_ideas subroutine is that anybody can create their own set of “big names” or “great ideas”. They could be from any topic. Biology. Mathematics. A particular subset of literature. Just as different sets of stop words are used in different domains, so can the application of a Great Ideas Coefficient.

Similarity between documents

TFIDF can be applied to the problem of finding more documents like this one.

The process of finding more documents like this is perennial. The problem is addressed in the field of traditional librarianship through the application of controlled vocabulary terms, author/title authority lists, the collocation of physical materials through the use of classification numbers, and bibliographic instruction as well as information literacy classes.

In the field of information retrieval, the problem is addressed through the application of mathematics. More specifically but simply stated, by plotting the TFIDF scores of two or more terms from a set of documents on a Cartesian plane it is possible to calculate the similarity between said documents by comparing the angle and length of the resulting vectors — a measure called “cosine similarity”. By extending the process to any number of documents and any number of dimensions it is relatively easy to find more documents like this one.

Suppose we have two documents: A and B. Suppose each document contains many words but those words were only science and art. Furthermore, suppose document A contains the word science 9 times and the word art 10 times. Given these values, we can plot the relationship between science and art on a graph, below. Document B can be plotted similarly supposing science occurs 6 times and the word art occurs 14 times. The resulting lines, beginning at the graph’s origin (O) to their end-points (A and B), are called “vectors” and they represent our documents on a Cartesian plane:

  s    |
  c  9 |         * A
  i    |        *
  e    |       *
  n  6 |      *      * B
  c    |     *     *
  e    |    *    *
       |   *   *
       |  *  *
       | * *
       O-----------------------
                10   14

                  art

  Documents A and B represented as vectors

If the lines OA and OB were on top of each other and had the same length, then the documents would be considered equal — exactly similar. In other words, the smaller the angle AOB is as well as the smaller the difference between the length lines OA and OB the more likely the given documents are the same. Conversely, the greater the angle of AOB and the greater the difference of the lengths of lines OA and OB the more unlike the two documents.

This comparison is literally expressed as the inner (dot) product of the vectors divided by the product of the Euclidian magnitudes of the vectors. Mathematically, it is stated in the following form and is called “cosine similarity”:

( ( A.B ) / ( ||A|| * ||B|| ) )

Cosine similarity will return a value between 0 and 1. The closer the result is to 1 the more similar the vectors (documents) compare.

Most cosine similarity applications apply the comparison to every word in a document. Consequently each vector has a large number of dimensions making calculations time consuming. For the purposes of this series, I am only interested in the “big names” and “great ideas”, and since The Great Books of the Western World includes about 150 of such terms, the application of cosine similarity is simplified.

To implement cosine similarity in Perl three additional subroutines needed to be written. One to calculate the inner (dot) product of two vectors. Another was needed to calculate the Euclidian length of a vector. These subroutines are listed below:

  sub dot {

    # dot product = (a1*b1 + a2*b2 ... ) where a and b are equally sized arrays (vectors)
    my $a = shift;
    my $b = shift;
    my $d = 0;
    for ( my $i = 0; $i <= $#$a; $i++ ) { $d = $d + ( $$a[ $i ] * $$b[ $i ] ) }
    return $d;

  }

  sub euclidian {

    # Euclidian length = sqrt( a1^2 + a2^2 ... ) where a is an array (vector)
    my $a = shift;
    my $e = 0;
    for ( my $i = 0; $i <= $#$a; $i++ ) { $e = $e + ( $$a[ $i ] * $$a[ $i ] ) }
    return sqrt( $e );

  }

The subroutine that does the actual comparison is listed below. Given a reference to an array of two books, stop words, and ideas, it indexes each book sans stop words, searches each book for a great idea, uses the resulting TFIDF score to build the vectors, and computes similarity:

  sub compare {

    my $books     = shift;
    my $stopwords = shift;
    my $ideas     = shift;

    my %index = ();
    my @a     = ();
    my @b     = ();

    # index
    foreach my $book ( @$books ) { $index{ $book } = &index( $book, $stopwords ) }

    # process each idea
    foreach my $idea ( sort( keys( %$ideas ))) {

      # search
      my ( $hits, @files ) = &search( \%index, $idea );

      # rank
      my $ranks = &rank( \%index, [ @files ], $idea );

      # build vectors, a & b
      my $index = 0;
      foreach my $file ( @$books ) {

        if    ( $index == 0 ) { push @a, $$ranks{ $file }}
        elsif ( $index == 1 ) { push @b, $$ranks{ $file }}
        $index++;

        }

      }

      # compare; scores closer to 1 approach similarity
      return ( cos( &dot( [ @a ], [ @b ] ) / ( &euclidian( [ @a ] ) * &euclidian( [ @b ] ))));

  }

Finally, a script, compare.pl, was written glueing the whole thing together. It’s heart is listed here:

  # compare each document...
  for ( my $a = 0; $a <= $#corpus; $a++ ) {

    print "\td", $a + 1;

    # ...to every other document
    for ( my $b = 0; $b <= $#corpus; $b++ ) {

      # avoid redundant comparisons
      if ( $b <= $a ) { print "\t - " }

      # process next two documents
      else {

        # (re-)initialize
        my @books = sort( $corpus[ $a ], $corpus[ $b ] );

        # do the work; scores closer to 1000 approach similarity
        print "\t", int(( &compare( [ @books ], $stopwords, $ideas )) * 1000 );

      }

    }

    # next line
    print "\n";

  }

In a nutshell, compare.pl loops through each document in a corpus and compares it to every other document in the corpus while skipping duplicate comparisons. Remember, only the dimensions representing “big names” and “great ideas” are calculated. Finally, it displays a similarity score for each pair of documents. Scores are multiplied by 1000 to make them easier to read. Given the sample data from the distribution, the following matrix is produced:

  $ ./compare.pl
    Comparison: scores closer to 1000 approach similarity

        d1   d2   d3   d4   d5   d6

    d1   -  922  896  858  857  948
    d2   -   -   887  969  944  971
    d3   -   -    -   951  954  964
    d4   -   -    -    -   768  905
    d5   -   -    -    -    -   933
    d6   -   -    -    -    -    - 

    d1 = aristotle.txt
    d2 = hegel.txt
    d3 = kant.txt
    d4 = librarianship.txt
    d5 = mississippi.txt
    d6 = plato.txt

From the matrix is it obvious that documents d2 (hegel.txt) and d6 (plato.txt) are the most similar since their score is the closest to 1000. This means the vectors representing these documents are closer to congruency than the other documents. Notice how all the documents are very close to 1000. This makes sense since all of the documents come from the Alex Catalogue and the Alex Catalogue documents are selected because of the “great idea-ness”. The documents should be similar. Notice which documents are the least similar: d4 (librarianship.txt) and d5 (mississippi.txt). The first is a history of librarianship. The second is a novel called Life on the Mississippi. Intuitively, we would expect this to be true; neither one of these documents are the topic of “great ideas”.

(Argg! Something is incorrect with my trigonometry. When I duplicate a document and run compare.pl the resulting cosine similarity value between the exact same documents is 540, not 1000. What am I doing wrong?)

Summary

This last part in the series demonstrated ways term frequency/inverse document frequency (TFIDF) can be applied to over-arching (or underlying) themes in a corpus of documents, specifically the “big names” and “great ideas” of Western civilization. It also demonstrated how TFIDF scores can be used to create vectors representing documents. These vectors can then be compared for similarity, and, by extension, the documents they represent can be compared for similarity.

The purpose of the entire series was to bring to light and take the magic out of a typical relevancy ranking algorithm. A distribution including all the source code and sample documents is available online. Use the distribution as a learning tool for your own explorations.

As alluded to previously, TFIDF is like any good folk song. It has many variations and applications. TFIDF is also like milled grain because it is a fundemental ingredient to many recipes. Some of these recipies are for bread, but some of them are for pies or just thickener. Librarians and libraries need to incorporate more mathematical methods into their processes. There needs to be a stronger marriage between the social characteristics of librarianship and the logic of mathematics. (Think arscience.) The application of TFIDF in libraries is just one example.

TFIDF In Libraries: Part II of III (For programmers)

Monday, April 20th, 2009

This is the second of a three-part series called TFIDF In Libraries, where relevancy ranking techniques are explored through a set of simple Perl programs. In Part I relevancy ranking was introduced and explained. In Part III additional word/document weighting techiques will be explored to the end of filtering search results or addressing the perennial task of “finding more documents like this one.” In the end it is the hoped to demonstrate that relevancy ranking is not magic nor mysterious but rather the process of applying statistical techiques to textual objects.

TFIDF, again

As described in Part I, term frequency/inverse document frequency (TFIDF) is a process of counting words in a document as well as throughout a corpus of documents to the end of sorting documents in statistically relevent ways.

Term frequency (TF) is essencially a percentage denoting the number of times a word appears in a document. It is mathematically expressed as C / T, where C is the number of times a word appears in a document and T is the total number of words in the same document.

Inverse document frequency (IDF) takes into acount that many words occur many times in many documents. Stop words and the word “human” in the MEDLINE database are very good examples. IDF is mathematically expressed as D / DF, where D is the total number of documents in a corpus and DF is the number of document in which a particular word is found. As D / DF increases so does the significance of the given word.

Given these two factors, TFIDF is literally the product of TF and IDF:

TFIDF = ( C / T ) * ( D / DF )

This is the basic form that has been used to denote relevance ranking for more than forty years, and please take note that it requires no advanced mathematical knowledge — basic arithmatic.

Like any good recipe or folk song, TFIDF has many variations. Google, for example, adds additional factors into their weighting scheme based on the popularity of documents. Other possibilities could include factors denoting the characteristics of the person using the texts. In order to accomodate for the wide variety of document sizes, the natural log of IDF will be employed throughout the balance of this demonstration. Therefore, for the purposes used here, TFIDF will be defined thus:

TFIDF = ( C / T ) * log( D / DF )

Simple Perl subroutines

In order to put theory into practice, I wrote a number of Perl subroutines implementing various aspects of relevancy ranking techniques. I then wrote a number of scripts exploiting the subroutines, essencially wrapping them in a user interface.

Two of the routines are trivial and will not be explained in any greater detail than below:

  • corpus – Returns an array of all the .txt files in the current directory, and is used to denote the library of content to be analyzed.
  • slurp_words – Returns a reference to a hash of all the words in a file, specifically for the purposes of implementing a stop word list.

Two more of the routines are used to support indexing and searching the corpus. Again, since neither is the focus of this posting, each will only be outlined:

  • index – Given a file name and a list of stop words, this routine returns a reference to a hash containing all of the words in the file (san stop words) as well as the number of times each word occurs. Strictly speaking, this hash is not an index but it serves our given purpose adequately.
  • search – Given an “index” and a query, this routine returns the number of times the query was found in the index as well as an array of files listing where the term was found. Search is limited. It only supports single-term queries, and there are no fields for limiting.

The heart of the library of subroutines is used to calculate TFIDF, ranks search results, and classify documents. Of course the TFIDF calculation is absolutely necessary, but ironically, it is the most straight-forward routine in the collection. Given values for C, T, D, and DF it returns decimal between 0 and 1. Trivial:

  # calculate tfidf
  sub tfidf {

    my $n = shift;  # C
    my $t = shift;  # T
    my $d = shift;  # D
    my $h = shift;  # DF

    my $tfidf = 0;

    if ( $d == $h ) { $tfidf = ( $n / $t ) }
    else { $tfidf = ( $n / $t ) * log( $d / $h ) }

    return $tfidf;

  }

Many readers will probably be most interested in the rank routine. Given an index, a list of files, and a query, this code calculates TFIDF for each file and returns the results as a reference to a hash. It does this by repeatedly calculating the values for C, T, D, and DF for each of the files and calling tfidf:

  # assign a rank to a given file for a given query
  sub rank {

    my $index = shift;
    my $files = shift;
    my $query = shift;

    my %ranks = ();

    foreach my $file ( @$files ) {

      # calculate n
      my $words = $$index{ $file };
      my $n = $$words{ $query };

      # calculate t
      my $t = 0;
      foreach my $word ( keys %$words ) { $t = $t + $$words{ $word } }

      # assign tfidf to file
      $ranks{ $file } = &tfidf( $n, $t, keys %$index, scalar @$files );

    }

    return \%ranks;

  }

The classify routine is an added bonus. Given the index, a file, and the corpus of files, this function calculates TFIDF for each word in the file and returns a refernece to a hash containing each word and its TFIDF value. In other words, instead of calculating TFIDF for a given query in a subset of documents, it calculates TFIDF for each word in an entire corpus. This proves useful in regards to automatic classification. Like rank, it repeatedly determines values for C, T, D, and DF and calls tfidf:

  # rank each word in a given document compared to a corpus
  sub classify {

    my $index  = shift;
    my $file   = shift;
    my $corpus = shift;

    my %tags = ();

    foreach my $words ( $$index{ $file } ) {

      # calculate t
      my $t = 0;
      foreach my $word ( keys %$words ) { $t = $t + $$words{ $word } }

      foreach my $word ( keys %$words ) {

        # get n
        my $n = $$words{ $word };

        # calculate h
        my ( $h, @files ) = &search( $index, $word );

        # assign tfidf to word
        $tags{ $word } = &tfidf( $n, $t, scalar @$corpus, $h );

      }

    }

    return \%tags;

  }

Search.pl

Two simple Perl scripts are presented, below, taking advantage of the routines described, above. The first is search.pl. Given a single term as input this script indexes the .txt files in the current directory, searches them for the term, assigns TFIDF to each of the results, and displays the results in a relevancy ranked order. The essencial aspects of the script are listed here:

  # define
  use constant STOPWORDS => 'stopwords.inc';

  # include
  require 'subroutines.pl';

  # get the query
  my $q = lc( $ARGV[ 0 ] );

  # index
  my %index = ();
  foreach my $file ( &corpus ) { $index{ $file } = &index( $file, &slurp_words( STOPWORDS ) ) }

  # search
  my ( $hits, @files ) = &search( \%index, $q );
  print "Your search found $hits hit(s)\n";

  # rank
  my $ranks = &rank( \%index, [ @files ], $q );

  # sort by rank and display
  foreach my $file ( sort { $$ranks{ $b } <=> $$ranks{ $a } } keys %$ranks ) {

    print "\t", $$ranks{ $file }, "\t", $file, "\n"

  }

  # done
  print "\n";
  exit;

Output from the script looks something like this:

  $ ./search.pl knowledge
  Your search found 6 hit(s)
    0.0193061840120664    plato.txt
    0.00558586078987563   kant.txt
    0.00299602568022012   aristotle.txt
    0.0010031177985631    librarianship.txt
    0.00059150597421034   hegel.txt
    0.000150303111274403  mississippi.txt

From these results you can see that the document named plato.txt is the most relevent because it has the highest score, in fact, it is almost four times more relevant than the second hit, kant.txt. For extra credit, ask yourself, “At what point do the scores become useless, or when do the scores tell you there is nothing of significance here?”

Classify.pl

As alluded to in Part I of this series, TFIDF can be turned on its head to do automatic classification. Weigh each term in a corpus of documents, and list the most significant words for a given document. Classify.pl does this by denoting a lower bounds for TFIDF scores, indexing an entire corpus, weighing each term, and outputing all the terms whose scores are greater than the lower bounds. If no terms are greater than the lower bounds, then it lists the top N scores as defined by a configuration. The essencial aspects of classify.pl are listed below:

  # define
  use constant STOPWORDS    => 'stopwords.inc';
  use constant LOWERBOUNDS  => .02;
  use constant NUMBEROFTAGS => 5;

  # require
  require 'subroutines.pl';

  # initialize
  my @corpus = &corpus;

  # index
  my %index = ();
  foreach my $file (@corpus ) { $index{ $file } = &index( $file, &slurp_words( STOPWORDS ) ) }

  # classify each document
  foreach my $file ( @corpus ) {

    print $file, "\n";

    # list tags greater than a given score
    my $tags  = &classify( \%index, $file, [ @corpus ] );
    my $found = 0;
    foreach my $tag ( sort { $$tags{ $b } <=> $$tags{ $a } } keys %$tags ) {

      if ( $$tags{ $tag } > LOWERBOUNDS ) {

        print "\t", $$tags{ $tag }, "\t$tag\n";
        $found = 1;

      }

      else { last }

    }

    # accomodate tags with low scores
    if ( ! $found ) {

      my $n = 0;
      foreach my $tag ( sort { $$tags{ $b } <=> $$tags{ $a } } keys %$tags ) {

        print "\t", $$tags{ $tag }, "\t$tag\n";
        $n++;
        last if ( $n == NUMBEROFTAGS );

      }

    }

    print "\n";

  }

  # done
  exit;

For example, sample, yet truncated, output from classify.pl looks like this:

  aristotle.txt
    0.0180678691531642  being
    0.0112840859266579  substances
    0.0110363803118312  number
    0.0106083766432284  matter
    0.0098440843778661  sense

  mississippi.txt
    0.00499714142455761  mississippi
    0.00429324597184886  boat
    0.00418922035591656  orleans
    0.00374087743616293  day
    0.00333830388445574  river

Thus, assuming a lower TFIDF bounds of 0.02, the words being, substance, number, matter, and sense are the most significant in the document named aristotle.txt. But since none of the words in mississippi.txt have a score that high the top five words are returned instead. For more extra credit, think of ways classify.pl can be improved by answering, “How can the output be mapped to controlled vocabulary terms or expanded through the use of some other thesarus?”

Summary

The Perl subroutines and scripts described here implement TFIDF to do rudimentary ranking of search results and automatic classification. They are not designed to be production applications, just example tools for the purposes of learning. Turning the ideas implemented in these scripts into production applications have been the fodder for many people’s careers and entire branches of computer science.

You can download the scripts, subroutines, and sample data in order for you to learn more. You are encouraged to remove the .txt files from the distribution and replace them with your own data. I think your search results and automatic classification output will confirm in your mind that TFIDF is well-worth the time and effort of the library community. Given the amounts of full text books and journal articles freely available on the Internet, it behooves the library profession to learn to exploit these concepts because our traditional practices simply: 1) do not scale, or 2) do not meet with our user’s expectations. Furthermore, farming these sorts of solutions out to vendors is irresponsible.

YAAC: Yet Another Alex Catalogue

Monday, February 2nd, 2009

I have implemented another version of my Alex Catalogue of Electronic Texts, more specifically, I have dropped the use of one indexer and replaced it with Solr/Lucene. See http://infomotions.com/alex/ This particular implementation does not have all the features of the previous one. No spell check. No thesaurus. No query suggestions. On the other hand, it does support paging, and since it runs under mod_perl, it is quite responsive.

As always I am working on the next version, and you can see where I’m going at http://infomotions.com/sandbox/alex4/ Like the implementation above, this one runs under mod_perl and supports paging. Unlike the implementation above, it also supports query suggestions, a thesaurus, and faceted browsing. It also sports the means to view metadata details. Content-wise, it included images, journal titles, journal articles, and some content from the HathiTrust.

It would be great if I were to get some feedback regarding these implementations. Are they easy to use?