<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	>

<channel>
	<title>Infomotions Mini-Musings</title>
	<atom:link href="http://infomotions.com/blog/feed/" rel="self" type="application/rss+xml" />
	<link>http://infomotions.com/blog</link>
	<description>Thoughts in libraries and librarianship</description>
	<pubDate>Wed, 01 Jul 2009 17:23:48 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.7.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Mass Digitization Mini-Symposium: A Reverse Travelogue</title>
		<link>http://infomotions.com/blog/2009/07/mass-digitization-mini-symposium-a-reverse-travelogue/</link>
		<comments>http://infomotions.com/blog/2009/07/mass-digitization-mini-symposium-a-reverse-travelogue/#comments</comments>
		<pubDate>Wed, 01 Jul 2009 17:23:48 +0000</pubDate>
		<dc:creator>Eric Lease Morgan</dc:creator>
		
		<category><![CDATA[Travelogues]]></category>

		<category><![CDATA[mass digitization]]></category>

		<guid isPermaLink="false">http://infomotions.com/blog/?p=310</guid>
		<description><![CDATA[

The Professional Development Committee of the Hesburgh Libraries at the University of Notre Dame a &#8220;mini-symposium&#8221; on the topic of mass digitization on Thursday, May 21, 2009. This text documents some of what the speakers had to say. Given the increasingly wide availability of free full text information provided through mass digitization, the forum offered [...]]]></description>
			<content:encoded><![CDATA[<div>
<p>
The Professional Development Committee of the Hesburgh Libraries at the University of Notre Dame a &#8220;mini-symposium&#8221; on the topic of mass digitization on Thursday, May 21, 2009. This text documents some of what the speakers had to say. Given the increasingly wide availability of free full text information provided through mass digitization, the forum offered an opportunity for participants to learn how such a thing might affect learning, teaching, and scholarship. *
</p>
<h3>Setting the Stage</h3>
<table align='right'>
<tr>
<td align='center'>
<img src="http://infomotions.com/blog/wp-content/uploads/2009/07/symposium-s.jpeg" alt="presenters and organizers" hspace="5" vspace="5" align="right"><br />
<span style='font-size: small'><a href="http://infomotions.com/blog/wp-content/uploads/2009/07/symposium.jpg">Presenters and organizers</a></span>
</td>
</tr>
</table>
<p>
After introductions by Leslie Morgan, I gave a talk called &#8220;<a href="http://infomotions.com/musings/mass-digitization-opportunities/">Mass digitization in 15 minutes</a>&#8221; where I described some of the types of library services and digital humanities processes that could be applied to digitized literature. &#8220;What might libraries be like if 51% or more of our collections were available in full text?&#8221;
</p>
<h3>Maura Marx</h3>
<p>
The Symposium really got underway with the remarks of Maura Marx (Executive Director of the Open Knowledge Commons) in a talk called &#8220;<a href="http://infomotions.com/blog/wp-content/uploads/2009/07/marx.pdf">Mass Digitization and Access to Books Online</a>.&#8221; She began by giving an overview of mass digitization (such as the efforts of the Google Books Project and the Internet Archive) and compared it with large-scale digitization efforts. &#8220;None of this is new,&#8221; she said, and gave examples including Project Gutenberg, the Library of Congress Digital Library, and the Million Books Project. Because the Open Knowledge Commons is an outgrowth of the Open Content Alliance, she was able to describe in detail the mechanical digitizing process of the Internet Archive with its costs approaching 10¢/page. Along the way she advocated the HathiTrust as a preservation and sharing method, and she described it as a type of &#8220;radical collaboration.&#8221; &#8220;Why is mass digitization so important?&#8221; She went on to list and elaborate upon six reasons: 1) search, 2) access, 3) enhanced scholarship, 4) new scholarship, 5) public good, and 6) the democratization of information.
</p>
<p>
The second half of Ms. Marx&#8217;s presentation outlined three key issues regarding the Google Books Settlement. Specifically, the settlement will give Google a sort of &#8220;most favored nation&#8221; status because it prevents Google from getting sued in the future, but it does not protect other possible digitizers the same way. Second, it circumvents, through contract law, the problem of orphan works; the settlement sidesteps many of the issues regarding copyright. Third, the settlement is akin to a class action suit, but in reality the majority of people affected by the suit are unknown since they fall into the class of orphan works holders. To paraphrase, &#8220;How can a group of unknown authors and publishers pull together a class action suit?&#8221;
</p>
<p>
She closed her presentation with a more thorough description of Open Knowledge Commons agenda which includes: 1) the production of digitized materials, 2) the preservation of said materials, and 3) and the building of tools to make the materials increasingly useful. Throughout her presentation I was repeatedly struck by the idea of the public good the Open Knowledge Commons was trying to create. At the same time, her ideas were not so naive to ignore the new business models that are coming into play and the necessity for libraries to consider new ways to provide library services. &#8220;We are a part of a cyber infrastructure where the key word is &#8217;shared.&#8217; We are not alone.&#8221;
</p>
<h3>Gary Charbonneau</h3>
<p>
Gary Charbonneau (Systems Librarian, Indiana University - Bloomington) was next and gave his presentation called &#8220;<a href="http://infomotions.com/blog/wp-content/uploads/2009/07/charbonneau.pdf">The Google Books Project at Indiana University</a>&#8220;.
</p>
<p>
Indiana University, in conjunction with a number of other CIC (Committee on Institutional Cooperation) libraries have begun working with Google on the Google Books Project. Like many previous Google Book Partners, Charbonneau was not authorized to share many details regarding the Project; he was only authorized &#8220;to paint a picture&#8221; with the metaphoric &#8220;broad brush.&#8221; He described the digitization process as rather straightforward: 1) pull books from a candidate list, 2) charge them out to Google, 3) put the books on a truck, 4) wait for them to return in few weeks or so, and 5) charge the books back into the library. In return for this work they get: 1) attribution, 2) access to snippets, and 3) sets of digital files which are in the public domain. About 95% of the works are still under copyright and none of the books come from their rare book library &#8212; the Lilly Library.
</p>
<p>
Charbonneau thought the real value of the Google Book search was the deep indexing, something mentioned by Marx as well.
</p>
<p>
Again, not 100% of the library&#8217;s collection is being digitized, but there are plans to get closer to that goal. For example, they are considering plans to digitize their &#8220;Collections of Distinction&#8221; as well as some of their government documents. Like Marx, he advocated the HathiTrust but he also suspected commercial content might make its way into its archives.
</p>
<p>
One of the more interesting things Charbonneau mentioned was in regards to URLs. Specifically, there are currently no plans to insert the URLs of digitized materials into the 856 $u field of MARC records denoting the location of items. Instead they plan to use an API (application programmer interface) to display the location of files on the fly.
</p>
<p>
Indiana University hopes to complete their participation in the Google Books Project by 2013.
</p>
<h3>Sian Meikle</h3>
<p>
The final presentation of the day was given by Sian Meikle (Digital Services Librarian, University of Toronto Libraries) whose comments were quite simply entitled &#8220;<a href="http://infomotions.com/blog/wp-content/uploads/2009/07/meikle.pdf">Mass Digitization</a>.&#8221;
</p>
<p>
The massive (no pun intended) University of Toronto library system consisting of a whopping 18 million volumes spread out over 45 libraries on three campuses began working with the Internet Archive to digitize books in the Fall of 2004. With their machines (the &#8220;scribes&#8221;) they are able to scan about 500 pages/hour and, considering the average book is about 300 pages long, they are scanning at a rate of about 100,000 books/year. Like Indiana and the Google Books Project, not all books are being digitized. For example, they can&#8217;t be too large, too small, brittle, tightly bound, etc. Of all the public domain materials, only 9% or so do not get scanned. Unlike the output of the Google Book Project, the deliverables from their scanning process include images of the texts, a PDF file of the text, an OCRed version of the text, a &#8220;flip book&#8221; version of the text, and a number of XML files complete with various types of metadata.
</p>
<p>
Considering Meikle&#8217;s experience with mass digitized materials, she was able to make a number of observations and distinctions. For example, we &#8212; the library profession &#8212; need to understand the difference between &#8220;born digital&#8221; materials and digitized materials. Because of formatting, technology, errors in OCR, etc, the different manifestations have different strengths and weaknesses. Some things are more easily searched. Some things are displayed better on screens. Some things are designed for paper and binding. Another distinction is access. According to some of her calculations, materials that are in electronic form get &#8220;used&#8221; more than their printed form. In this case &#8220;used&#8221; means borrowed or downloaded. Sometimes the ratio is as high as 300-to-1. There are three hundred downloads to one borrow. Furthermore, she has found that proportionately, English language items are not used as heavily as materials in other languages. One possible explanation is that material in other languages can be harder to locate in print. Yet another difference is the type of reading one format offers over another; compare and contrast &#8220;intentional reading&#8221; with &#8220;functional reading.&#8221; Books on computers make it easy to find facts and snippets. Books on paper tend to lend themselves better to the understanding of bigger ideas.
</p>
<p>
Lastly, Meikle alluded to ways the digitized content will be made available to users. Specifically, she imagines it will become a part of an initiative called the Scholar&#8217;s Portal &#8212; a single index of journal article literature, full text books, and bibliographic metadata. In my mind, such an idea is the heart of the &#8220;next generation&#8221; library catalog.
</p>
<h3>Summary and Conclusion</h3>
<p>
The symposium was attended by approximately 125 people. Most were from the Hesburgh Libraries of the University of Notre Dame. Some were from regional libraries. There were a few University faculty in attendance. The event was a success in that it raised the awareness of what mass digitization is all about, and it fostered communication during the breaks as well as after the event was over.
</p>
<p>
The opportunities for librarianship and scholarship in general are almost boundless considering the availability of full text content. The opportunities are even greater when the content is free of licensing restrictions. While the idea of complete collections totally free of restrictions is a fantasy, the idea of significant amounts of freely available full text content is easily within our grasp. During the final question and answer period, someone asked, &#8220;What skills and resources are necessary to do this work?&#8221; The answer was agreed upon by the speakers, &#8220;What is needed? An understanding that the perfect answer is not necessary prior to implementation.&#8221; There were general nods of agreement from the audience.
</p>
<p>
Now is a good time to consider the possibilities of mass digitization and to be prepared to deal with them before they become the norm as opposed to the exception. This symposium, generously sponsored by the Hesburgh Libraries Professional Development Committee, as well as library administration, provided the opportunity to consider these issues. &#8220;Thank you!&#8221;
</p>
<h3>Notes</h3>
<p>* This posting was orignally &#8220;published&#8221; as a part of the Hesburgh Libraries of the University of Notre Dame website, and it is duplicated here because &#8220;Lot&#8217;s of copies keep stuff safe.&#8221;
</p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://infomotions.com/blog/2009/07/mass-digitization-mini-symposium-a-reverse-travelogue/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Lingua::EN::Bigram (version 0.01)</title>
		<link>http://infomotions.com/blog/2009/06/linguaenbigram-version-001/</link>
		<comments>http://infomotions.com/blog/2009/06/linguaenbigram-version-001/#comments</comments>
		<pubDate>Tue, 23 Jun 2009 13:41:36 +0000</pubDate>
		<dc:creator>Eric Lease Morgan</dc:creator>
		
		<category><![CDATA[Hacks]]></category>

		<category><![CDATA[bigrams]]></category>

		<category><![CDATA[Perl]]></category>

		<guid isPermaLink="false">http://infomotions.com/blog/?p=307</guid>
		<description><![CDATA[
Below is the POD (Plain O&#8217; Documentation) file describing a Perl module I wrote called Lingua::EN::Bigram.


The purpose of the module is to: 1) extract all of the two-word phrases from a given text, and 2) rank each phrase according to its probability of occurance. Very nice for doing textual analysis. For example, by applying this [...]]]></description>
			<content:encoded><![CDATA[<p>
Below is the POD (Plain O&#8217; Documentation) file describing a Perl module I wrote called Lingua::EN::Bigram.
</p>
<p>
The purpose of the module is to: 1) extract all of the two-word phrases from a given text, and 2) rank each phrase according to its probability of occurance. Very nice for doing textual analysis. For example, by applying this module to Mark Twain&#8217;s <cite>Adventures of Tom Sawyer</cite> it becomes evident that the signifcant two-word phrases are names of characters in the story. On the other hand, Ralph Waldo Emerson&#8217;s <cite>Essays: First Series</cite> returns action statements &#8212; instructions. On the other hand Henry David Thoreau&#8217;s <cite>Walden</cite> returns &#8220;walden pond&#8221; and descriptions of pine trees. Interesting.
</p>
<p>
The code is available <a href="http://infomotions.com/blog/wp-content/uploads/2009/06/Lingua-EN-Bigram-0.01.tar.gz">here</a> or on <a href="http://search.cpan.org/~emorgan/Lingua-EN-Bigram-0.01/">CPAN</a>.
</p>
<div>
<h3>NAME</h3>
<p>Lingua::EN::Bigram - Calculate significant two-word phrases based on frequency and/or T-Score</p>
<h3>SYNOPSIS</h3>
<pre>
<code>  use Lingua::EN::Bigram;
  $bigram = Lingua::EN::Bigram-&gt;new;
  $bigram-&gt;text( 'All men by nature desire to know. An indication of this...' );
  $tscore = $bigram-&gt;tscore;
  foreach ( sort { $$tscore{ $b } &lt;=&gt; $$tscore{ $a } } keys %$tscore ) {

    print "$$tscore{ $_ }\t" . "$_\n";

  }</code>
</pre>
<h3>DESCRIPTION</h3>
<p>This module is designed to: 1) pull out all of the two-word phrases (collocations or &#8220;bigrams&#8221;) in a given text, and 2) list these phrases according to thier frequency and/or T-Score. Using this module is it possible to create list of the most common two-word phrases in a text as well as order them by their probable occurance, thus implying significance.</p>
<h3>METHODS</h3>
<h4>new</h4>
<p>Create a new, empty bigram object:</p>
<pre>
<code>  # initalize
  $bigram = Lingua::EN::Bigram-&gt;new;</code>
</pre>
<h4>text</h4>
<p>Set or get the text to be analyzed:</p>
<pre>
<code>  # set the attribute
  $bigram-&gt;text( 'All good things must come to an end...' );

  # get the attribute
  $text = $bigram-&gt;text;</code>
</pre>
<h4>words</h4>
<p>Return a list of all the tokens in a text. Each token will be a word or puncutation mark:</p>
<pre>
<code>  # get words
  @words = $bigram-&gt;words;</code>
</pre>
<h4>word_count</h4>
<p>Return a reference to a hash whose keys are a token and whose values are the number of times the token occurs in the text:</p>
<pre>
<code>  # get word count
  $word_count = $bigram-&gt;word_count;

  # list the words according to frequency
  foreach ( sort { $$word_count{ $b } &lt;=&gt; $$word_count{ $a } } keys %$word_count ) {

    print $$word_count{ $_ }, "\t$_\n";

  }</code>
</pre>
<h4>bigrams</h4>
<p>Return a list of all bigrams in the text. Each item will be a pair of tokens and the tokens may consist of words or puncutation marks:</p>
<pre>
<code>  # get bigrams
  @bigrams = $bigram-&gt;bigrams;</code>
</pre>
<h4>bigram_count</h4>
<p>Return a reference to a hash whose keys are a bigram and whose values are the frequency of the bigram in the text:</p>
<pre>
<code>  # get bigram count
  $bigram_count = $bigram-&gt;bigram_count;

  # list the bigrams according to frequency
  foreach ( sort { $$bigram_count{ $b } &lt;=&gt; $$bigram_count{ $a } } keys %$bigram_count ) {

    print $$bigram_count{ $_ }, "\t$_\n";

  }</code>
</pre>
<h4>tscore</h4>
<p>Return a reference to a hash whose keys are a bigram and whose values are a T-Score &#8212; a probabalistic calculation determining the significance of bigram occuring in the text:</p>
<pre>
<code>  # get t-score
  $tscore = $bigram-&gt;tscore;

  # list bigrams according to t-score
  foreach ( sort { $$tscore{ $b } &lt;=&gt; $$tscore{ $a } } keys %$tscore ) {

    print "$$tscore{ $_ }\t" . "$_\n";

  }</code>
</pre>
<h3>DISCUSSION</h3>
<p>Given the increasing availability of full text materials, this module is intended to help &#8220;digital humanists&#8221; apply mathematical methods to the analysis of texts. For example, the developer can extract the high-frequency words using the word_count method and allow the user to search for those words in a concordance. The bigram_count method simply returns the frequency of a given bigram, but the tscore method can order them in a more finely tuned manner.</p>
<p>Consider using T-Score-weighted bigrams as classification terms to supplement the &#8220;aboutness&#8221; of texts. Concatonate many texts together and look for common phrases written by the author. Compare these commonly used phrases to the commonly used phrases of other authors.</p>
<p>Each bigram includes punctuation. This is intentional. Developers may need want to remove bigrams containing such values from the output. Similarly, no effort has been made to remove commonly used words &#8212; stop words &#8212; from the methods. Consider the use of Lingua::StopWords, Lingua::EN::StopWords, or the creation of your own stop word list to make output more meaningful. The distribution came with a script (bin/bigrams.pl) demonstrating how to remove puncutation and stop words from the displayed output.</p>
<p>Finally, this is not the only module supporting bigram extraction. See also Text::NSP which supports n-gram extraction.</p>
<h3>TODO</h3>
<p>There are probably a number of ways the module can be improved:</p>
<ul>
<li>the constructor method could take a scalar as input, thus reducing the need for the text method</li>
<li>the distribution&#8217;s license should probably be changed to the Perl Aristic License</li>
<li>the addition of alternative T-Score calculations would be nice</li>
<li>it would be nice to support n-grams</li>
<li>make sure the module works with character sets beyond ASCII</li>
</ul>
<h3>ACKNOWLEDGEMENTS</h3>
<p>T-Score is calculated as per Nugues, P. M. (2006). An introduction to language processing with Perl and Prolog: An outline of theories, implementation, and application with special consideration of English, French, and German. Cognitive technologies. Berlin: Springer. Page 109.</p>
<h3>AUTHOR</h3>
<p>Eric Lease Morgan &lt;eric_morgan@infomotions.com&gt;</p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://infomotions.com/blog/2009/06/linguaenbigram-version-001/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Lingua::Concordance (version 0.01)</title>
		<link>http://infomotions.com/blog/2009/06/linguaconcordance-version-001/</link>
		<comments>http://infomotions.com/blog/2009/06/linguaconcordance-version-001/#comments</comments>
		<pubDate>Wed, 10 Jun 2009 17:05:37 +0000</pubDate>
		<dc:creator>Eric Lease Morgan</dc:creator>
		
		<category><![CDATA[Hacks]]></category>

		<category><![CDATA[concordance]]></category>

		<guid isPermaLink="false">http://infomotions.com/blog/?p=304</guid>
		<description><![CDATA[
Below is a man page describing a Perl I module I recently wrote called Lingua::Concordance (version 0.01).
Given the increasing availability of full text books and journals, I think it behooves the library profession to aggressively explore the possibilities of providing services against text as a means of making the proverbial fire hose of information more [...]]]></description>
			<content:encoded><![CDATA[<p>
Below is a man page describing a Perl I module I recently wrote called Lingua::Concordance (version 0.01).</p>
<p>Given the increasing availability of full text books and journals, I think it behooves the library profession to aggressively explore the possibilities of providing services against text as a means of making the proverbial fire hose of information more useful. Providing concordance-like functions against texts is just one example.
</p>
<p>
The <a href="http://infomotions.com/blog/wp-content/uploads/2009/06/Lingua-Concordance-0.01.tar.gz">distribution is available from this blog</a> as well as <a href="http://search.cpan.org/~emorgan/Lingua-Concordance-0.01/">CPAN</a>.
</p>
<h3>NAME</h3>
<p>Lingua::Concordance - Keyword-in-context (KWIC) search interface</p>
<h3>SYNOPSIS</h3>
<pre>
<code>  use Lingua::Concordance;
  $concordance = Lingua::Concordance-&gt;new;
  $concordance-&gt;text( 'A long time ago, in a galaxy far far away...' );
  $concordance-&gt;query( 'far' );
  foreach ( $concordance-&gt;lines ) { print "$_\n" }</code>
</pre>
<h3>DESCRIPTION</h3>
<p>Given a scalar (such as the content of a plain text electronic book or journal article) and a regular expression, this module implements a simple keyword-in-context (KWIC) search interface &#8212; a concordance. Its purpose is to return lists of lines from a text containing the given expression. See the Discussion section, below, for more detail.</p>
<h3>METHODS</h3>
<h4>new</h4>
<p>Create a new, empty concordance object:</p>
<pre>
<code>  $concordance = Lingua::Concordance-&gt;new;</code>
</pre>
<h4>text</h4>
<p>Set or get the value of the concordance&#8217;s text attribute where the input is expected to be a scalar containing some large amount of content, like an electronic book or journal article:</p>
<pre>
<code>  # set text attribute
  $concordance-&gt;text( 'Call me Ishmael. Some years ago- never mind how long...' );

  # get the text attribute
  $text = $concordance-&gt;text;</code>
</pre>
<p>Note: The scalar passed to this method gets internally normalized, specifically, all carriage returns are changed to spaces, and multiple spaces are changed to single spaces.</p>
<h4>query</h4>
<p>Set or get the value of the concordance&#8217;s query attribute. The input is expected to be a regular expression but a simple word or phrase will work just fine:</p>
<pre>
<code>  # set query attribute
  $concordance-&gt;query( 'Ishmael' );

  # get query attribute
  $query = $concordance-&gt;query;</code>
</pre>
<p>See the Discussion section, below, for ways to make the most of this method through the use of powerful regular expressions. This is where the fun it.</p>
<h4>radius</h4>
<p>Set or get the length of each line returned from the lines method, below. Each line will be padded on the left and the right of the query with the number of characters necessary to equal the value of radius. This makes it easier to sort the lines:</p>
<pre>
<code>  # set radius attribute
  $concordance-&gt;radius( $integer );

  # get radius attribute
  $integer = $concordance-&gt;query;</code>
</pre>
<p>For terminal-based applications it is usually not reasonable to set this value to greater than 30. Web-based applications can use arbitrarily large numbers. The internally set default value is 20.</p>
<h4>sort</h4>
<p>Set or get the type of line sorting:</p>
<pre>
<code>  # set sort attribute
  $concordance-&gt;sort( 'left' );

  # get sort attribute
  $sort = $concordance-&gt;sort;</code>
</pre>
<p>Valid values include:</p>
<ul>
<li>none - the default value; sorts lines in the order they appear in the text &#8212; no sorting</li>
<li>left - sorts lines by the (ordinal) word to the left of the query, as defined the ordinal method, below</li>
<li>right - sorts lines by the (ordinal) word to the right of the query, as defined the ordinal method, below</li>
<li>match - sorts lines by the value of the query (mostly)</li>
</ul>
<p>This is good for looking for patterns in texts, such as collocations (phrases, bi-grams, and n-grams). Again, see the Discussion section for hints.</p>
<h4>ordinal</h4>
<p>Set or get the number of words to the left or right of the query to be used for sorting purposes. The internally set default value is 1:</p>
<pre>
<code>  # set ordinal attribute
  $concordance-&gt;ordinal( 2 );

  # get ordinal attribute
  $integer = $concordance-&gt;ordinal;</code>
</pre>
<p>Used in combination with the sort method, above, this is good for looking for textual patterns. See the Discussion section for more information.</p>
<h4>lines</h4>
<p>Return a list of lines from the text matching the query. Our reason de existance:</p>
<pre>
<code>  @lines = $concordance-&gt;lines;</code>
</pre>
<h3>DISCUSSION</h3>
<p>[Elaborate upon a number of things here such as but not limited to: 1) the history of concordances and concordance systems, 2) the usefulness of concordances in the study of linguistics, 3) how to exploit regular expressions to get the most out of a text and find interesting snippets, and 4) how the module might be implemented in scripts and programs.]</p>
<h3>BUGS</h3>
<p>The internal _by_match subroutine, the one used to sort results by the matching regular expression, does not work exactly as expected. Instead of sorting by the matching regular expression, it sorts by the string exactly to the right of the matched regular expression. Consequently, for queries such as &#8216;human&#8217;, it correctly matches and sorts on human, humanity, and humans, but matches such as Humanity do not necessarily come before humanity.</p>
<h3>TODO</h3>
<ul>
<li>Write Discussion section.</li>
<li>Implement error checking.</li>
<li>Fix the _by_match bug.</li>
<li>Enable all of the configuration methods (text, query, radius, sort, and ordinal) to be specified in the constructor.</li>
<li>Require the text and query attributes to be specified as a part of the constructor, maybe.</li>
<li>Remove line-feed characters while normalizing text to accomdate Windows-based text streams, maybe.</li>
<li>Write an example CGI script, to accompany the distribution&#8217;s terminal-based script, demonstrating how the module can be implemented in a Web interface.</li>
<li>Write a full-featured terminal-based script enhancing the one found in the distribution.</li>
</ul>
<h3>ACKNOWLEDGEMENTS</h3>
<p>The module implements, almost verbatim, the concordance programs and subroutines described in Bilisoly, R. (2008). Practical text mining with Perl. Wiley series on methods and applications in data mining. Hoboken, N.J.: Wiley. pgs: 169-185. &#8220;Thanks Roger. I couldn&#8217;t have done it without your book!&#8221;</p>
]]></content:encoded>
			<wfw:commentRss>http://infomotions.com/blog/2009/06/linguaconcordance-version-001/feed/</wfw:commentRss>
		</item>
		<item>
		<title>EAD2MARC</title>
		<link>http://infomotions.com/blog/2009/06/ead2marc/</link>
		<comments>http://infomotions.com/blog/2009/06/ead2marc/#comments</comments>
		<pubDate>Fri, 05 Jun 2009 16:28:45 +0000</pubDate>
		<dc:creator>Eric Lease Morgan</dc:creator>
		
		<category><![CDATA[Hacks]]></category>

		<category><![CDATA[Encoded Archival Description (EAD)]]></category>

		<category><![CDATA[MARC]]></category>

		<guid isPermaLink="false">http://infomotions.com/blog/?p=300</guid>
		<description><![CDATA[
This posting simply shares three hacks I&#8217;ve written to enable me to convert EAD files to MARC records, and ultimately add them to my &#8220;discovery&#8221; layer &#8212; VUFind &#8212; for the Catholic Portal:


ead2marcxml.sh - Using xsltproc and a modified version of Terry Reese&#8217;s XSL stylesheet, converts all the EAD/.xml files in the current directory into [...]]]></description>
			<content:encoded><![CDATA[<p>
This posting simply shares three hacks I&#8217;ve written to enable me to convert EAD files to MARC records, and ultimately add them to my &#8220;discovery&#8221; layer &#8212; VUFind &#8212; for the Catholic Portal:
</p>
<ul>
<li>ead2marcxml.sh - Using xsltproc and a modified version of Terry Reese&#8217;s XSL stylesheet, converts all the EAD/.xml files in the current directory into MARCXML files. &#8220;Thanks Terry!&#8221;</li>
<li>marcxml2marc.sh - Using yaz-marcdump, convert all .marcxml files in the current directory into &#8220;real&#8221; MARC records.</li>
<li>add-001.pl - A hack to add 001 fields to MARC records. Sometimes necessary since the EAD files do not always have unique identifiers.</li>
</ul>
<p>
The distribution is <a href="http://infomotions.com/blog/wp-content/uploads/2009/06/ead2marc.tar.gz">available in the archives</a>, and distributed under the GNU Public License.
</p>
<p>
Now, off to go fishing.</p>
]]></content:encoded>
			<wfw:commentRss>http://infomotions.com/blog/2009/06/ead2marc/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Text mining: Books and Perl modules</title>
		<link>http://infomotions.com/blog/2009/06/text-mining-books-and-perl-modules/</link>
		<comments>http://infomotions.com/blog/2009/06/text-mining-books-and-perl-modules/#comments</comments>
		<pubDate>Thu, 04 Jun 2009 02:14:55 +0000</pubDate>
		<dc:creator>Eric Lease Morgan</dc:creator>
		
		<category><![CDATA[Article/book reviews]]></category>

		<category><![CDATA[Librarianship]]></category>

		<category><![CDATA[Perl]]></category>

		<category><![CDATA[text mining]]></category>

		<guid isPermaLink="false">http://infomotions.com/blog/?p=295</guid>
		<description><![CDATA[
This posting simply lists some of the books I&#8217;ve read and Perl modules I&#8217;ve explored in regards to the field of text mining.


Through my explorations of term frequency/inverse document frequency (TFIDF) I became aware of a relatively new field of study called text mining. In many ways, text mining is similar to data mining only [...]]]></description>
			<content:encoded><![CDATA[<p>
This posting simply lists some of the books I&#8217;ve read and Perl modules I&#8217;ve explored in regards to the field of text mining.
</p>
<p>
Through my explorations of term frequency/inverse document frequency (<a href="http://infomotions.com/blog/2009/04/tfidf-in-libraries-part-i-for-librarians/">TFIDF</a>) I became aware of a relatively new field of study called text mining. In many ways, text mining is similar to data mining only applied to unstructured texts instead of database rows and columns. Think plain text books such as items from Project Gutenberg or the Open Content Alliance. Text mining is a process including automatic classification, clustering (similar but distinct from classification), indexing and searching, entity extraction (names, places, organization, dates, etc.), statistically significant keyword and phrase extraction, parts of speech tagging, and summarization.
</p>
<p>
As a librarian, I found the whole thing extremely fascinating, consequently I read more.
</p>
<h3>Books<br />
</h3>
<p>
I have found the following four books helpful. They have enabled me to learn about the principles of text mining.
</p>
<ul>
<li>Bilisoly, R. (2008). <a href="http://www.worldcat.org/oclc/212020725">Practical text mining with Perl</a>. Wiley series on methods and applications in data mining. Hoboken, N.J.: Wiley. - Of all the books listed here, this one includes the most Perl programming examples, and it is not as scholarly as the balance of the list. Much of the book surrounds the description of regular expressions against texts. Its strongest suit is the creation of terminal-based concordance scripts. Very nice. Lot&#8217;s of fun. The concordances return very interesting results. The book does describe clustering techniques too, but the on the overall topic of automatic metadata generation the book is not very strong.</li>
<li>Konchady, M. (2006). <a href="http://www.worldcat.org/oclc/63245489">Text mining application programming</a>. Charles River Media programming series. Boston, Mass: Charles River Media. - This book is a readable survey of text mining covering parts of speech (POS) tagging, information extraction, search engines, clustering, classification, summarization, and question/answer processing. Many models for each aspect of text mining are described, compared, and contrasted. To put the author&#8217;s knowledge into practice, the book comes with a CD containing a Perl library for text mining, sample applications, and CGI scripts. This library is freely available on the Web.</li>
<li>Nugues, P. M. (2006). <a href="http://www.worldcat.org/oclc/68628882">An introduction to language processing with Perl and Prolog</a>: An outline of theories, implementation, and application with special consideration of English, French, and German. Cognitive technologies. Berlin: Springer. - Of the four books listed here, this one is probably the most dense. I found its Perl scripts used to parse text more useful than the ones in Bilisoly, but this one included no concordance applications. I also found the description of n-grams to be very interesting &#8212; the extraction of multi-word phrases. I suspect the model they describe can be extended to n number of words. This book also discusses parts of speech (POS) processing but it is the only one that describes how to really parse language. Think semantics, lexicons, discourse, and dialog. After the first couple of chapters the Perl examples disappear and give way to exclusively Prologue examples.</li>
<li>Weiss, S. M. (2005). <a href="http://www.worldcat.org/oclc/56192245">Text mining: Predictive methods for analyzing unstructured information</a>. New York: Springer. - The complexity of this book lies between Konchady and Nugues; it includes a greater number of mathematical models than Konchady, but it is easier to read than Nugues. Broad topics include textual documents as numeric vectors, using text for prediction, information retrieval, clustering &amp; classification, and looking for information in documents. Each chapter includes a section called &#8220;Historical and Bibliographical Remarks&#8221; which has proved to be very interesting reading.</li>
</ul>
<table align='center' padding='10'>
<tr>
<td>
<iframe src="http://rcm.amazon.com/e/cm?t=infomotions-20&#038;o=1&#038;p=6&#038;l=st1&#038;mode=books&#038;search=0470176431&#038;fc1=000000&#038;lt1=&#038;lc1=3366FF&#038;bg1=FFFFFF&#038;f=ifr" marginwidth="0" marginheight="0" width="120" height="150" border="0" frameborder="0" style="border:none;" scrolling="no"></iframe>
</td>
<td>
<iframe src="http://rcm.amazon.com/e/cm?t=infomotions-20&#038;o=1&#038;p=6&#038;l=st1&#038;mode=books&#038;search=1584504609&#038;fc1=000000&#038;lt1=&#038;lc1=3366FF&#038;bg1=FFFFFF&#038;f=ifr" marginwidth="0" marginheight="0" width="120" height="150" border="0" frameborder="0" style="border:none;" scrolling="no"></iframe>
</td>
<td>
<iframe src="http://rcm.amazon.com/e/cm?t=infomotions-20&#038;o=1&#038;p=6&#038;l=st1&#038;mode=books&#038;search=354025031X&#038;fc1=000000&#038;lt1=&#038;lc1=3366FF&#038;bg1=FFFFFF&#038;f=ifr" marginwidth="0" marginheight="0" width="120" height="150" border="0" frameborder="0" style="border:none;" scrolling="no"></iframe>
</td>
<td>
<iframe src="http://rcm.amazon.com/e/cm?t=infomotions-20&#038;o=1&#038;p=6&#038;l=st1&#038;mode=books&#038;search=0387954333&#038;fc1=000000&#038;lt1=&#038;lc1=3366FF&#038;bg1=FFFFFF&#038;f=ifr" marginwidth="0" marginheight="0" width="120" height="150" border="0" frameborder="0" style="border:none;" scrolling="no"></iframe>
</td>
</tr>
</table>
<p>
When it comes to the process of text mining I found each of these books useful in their own right. Each provided me with ways to reading texts, parsing texts, counting words, counting phrases, and through the application of statistical analysis create lists and readable summaries denoting the &#8220;aboutness&#8221; of given documents.
</p>
<h3>Perl modules</h3>
<p>
As a Perl hacker I am interested in writing scripts putting into practice some of the things I learn. Listed here are a number of modules that have gotten me further along in regard to text mining:
</p>
<ul>
<li><a href="http://search.cpan.org/dist/Lingua-EN-Fathom/">Lingua::EN::Fathom</a> - This library outputs interesting statistics regarding a given document: number of words and the number of times each occurs, number of sentences, complexity of words, number of paragraphs, etc. Of greatest interest are numbers (Fog, Flesch, and Flesch-Kincaid) denoting the readability of the text. Quick. Easy. Useful.</li>
<li><a href="http://search.cpan.org/~simon/Lingua-EN-Keywords-2.0/">Lingua::EN::Keywords</a> - Given a text, this library outputs a list of what it thinks are the most significant individual words in a document, sans stop words. Not fancy.</li>
<li><a href="http://search.cpan.org/dist/Lingua-EN-NamedEntity/">Lingua::EN::NamedEntity</a> - Given a text, I believe this library comes pre-trained to extract names, places, and organizations from texts. It returns a Perl data structure listing the probabilities of a word or phrase being any particular entity. It may need to be re-trained to work for your corpus.</li>
<li><a href="http://search.cpan.org/~igorm/Lingua-EN-Semtags-Engine-0.02/">Lingua::EN::Semtags::Engine</a> - Given text this module will return words and phrases in a relevancy ranked order. Initially, I have had some problems using this module because it seems to take a long time to return. On the other hand, it looks promising since it returns both individual words as well as phrases.</li>
<li><a href="http://search.cpan.org/~fimm/Lingua-EN-Summarize-0.2/">Lingua::EN::Summarize</a> - Given a text this library returns sentences it thinks encapsulates the essence of the document. The result is readable &#8212; grammatically correct. The process it uses to accomplish its task is self-proclaimed as unscientific.</li>
<li><a href="http://search.cpan.org/~acoburn/Lingua-EN-Tagger/">Lingua::EN::Tagger</a> - This library marks up a document in pseudo XML with tags denoting parts of speech in a given document. To do this work it also can extract words, noun phrases, and sentences from a text. Zippy. Probability-based. Developers are expected to parse the tagged output and do analysis against it, such as count the number of times particular parts of speech occur.</li>
<li><a href="http://search.cpan.org/~creamyg/Lingua-StopWords-0.09/">Lingua::StopWords</a> - Returns a simple list of stop words. Easy, but I can&#8217;t figure out how customizable it is. &#8220;One person&#8217;s stop word list is another person research topic.&#8221;</li>
<li><a href="http://search.cpan.org/~neilb/Net-Dict-2.07/">Net::Dict</a> - A network interface to DICT (dictionary) servers. While the DICT protocol is a bit long in the tooth, and not quite as cool as Web interfaces to things like Google or Wikipedia, this module does provide a handy way to look up definitions, a complimentary functionality to WordNet.</li>
<li><a href="http://search.cpan.org/~hank/Text-Aspell-0.09/">Text::Aspell</a> - A Perl interface to GNU Aspell which is great for spell-checking applications.</li>
<li><a href="http://textmine.sourceforge.net/">TextMine</a> - This is a set of modules written by Manu Konchady the author of Text Mining Application Programming. It includes submodules named Cluster, Entity, Index, Pos, Quanda (Q &amp; A), Summary, Tokens, and WordNet. While this set of modules is the most comprehensive I&#8217;ve seen, and while they are probably the most theoretically based interfacing with things like WordNet to be thorough, my initial experience has been a bit frustrating since scripts written against the libraries do not turn very quickly. Maybe I&#8217;m feeding them documents that are too large and if so, then the libraries are not necessarily scalable.</li>
<li><a href="http://wordnet.princeton.edu/">WordNet</a> - There are a bevy of modules providing functionality against WordNet &#8212; a &#8220;lexical database of English&#8230; Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations.&#8221; Any truly thorough text mining application of English will take advantage of WordNet.</li>
</ul>
<h3>Text mining and librarianship</h3>
<p>
Given the volume of &#8220;born digital&#8221; material being created, it is not possible to apply traditional library methods against them. The hand-crafted, heavy human touch process is not scalable. Given the amounts of mass digitized text being generated from the Google Books Project and the Open Content Alliance, new opportunities for literary analysis make themselves evident. Again, the traditional library processes can not fill the bill in these regards.
</p>
<p>
Text mining techniques offer possible solutions to these problems. Count words. Count phrases. Compare these words, phrases, and counts to other texts. Determine their statistical significance. Assign them to documents in the form of subject headings, keywords, author names, and other added entries in our metadata formats. Given large numbers of books, articles, and other &#8220;wordy&#8221; documents, learn how to &#8220;save the time of the reader&#8221; by summarizing these documents and ranking them in some sort of order in addition to alphabetical or date. Compare and contrast full text works by learning what words and types of words are used in documents. Are the words religious in nature? Mathematic and scientific? Poetic? Such things will provide additional means for understanding and interpreting everything from scholarly journal articles to works of classic fiction and philosophy. These techniques are not intended to replace existing methods of understanding and organization, but rather to supplement and build upon them. This is an evolutionary process.`
</p>
<p>
If libraries and librarians desire to remain relevant in the evolving information environment, then they will need to do the good work they do differently. The problem to be solved now-a-days is less about access and more about use. Text mining is one way of making the content of libraries more useful.</p>
]]></content:encoded>
			<wfw:commentRss>http://infomotions.com/blog/2009/06/text-mining-books-and-perl-modules/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Interent Archive content in &#8220;discovery&#8221; systems</title>
		<link>http://infomotions.com/blog/2009/06/interent-archive-content-in-discovery-systems/</link>
		<comments>http://infomotions.com/blog/2009/06/interent-archive-content-in-discovery-systems/#comments</comments>
		<pubDate>Tue, 02 Jun 2009 12:59:08 +0000</pubDate>
		<dc:creator>Eric Lease Morgan</dc:creator>
		
		<category><![CDATA[Hacks]]></category>

		<category><![CDATA[Librarianship]]></category>

		<category><![CDATA[discovery systems]]></category>

		<category><![CDATA[Internet Archive]]></category>

		<category><![CDATA[VUFind]]></category>

		<guid isPermaLink="false">http://infomotions.com/blog/?p=292</guid>
		<description><![CDATA[
This quick posting describes how Internet Archive content, specifically, content from the Open Content Alliance can be quickly and easily incorporated into local library &#8220;discovery&#8221; systems. VuFind is used here as the particular example:


Get keys - The first step is to get a set of keys describing the content you desire. This can be acquired [...]]]></description>
			<content:encoded><![CDATA[<p>
This quick posting describes how Internet Archive content, specifically, content from the Open Content Alliance can be quickly and easily incorporated into local library &#8220;discovery&#8221; systems. VuFind is used here as the particular example:
</p>
<ol>
<li>Get keys - The first step is to get a set of keys describing the content you desire. This can be acquired through the Internet Archive&#8217;s advanced search interface.</li>
<li>Convert keys - The next step is to convert the keys into sets of URLs pointing to the content you want to download. Fortunately, all the URLs have a similar shape: http://www.archive.org/download/KEY/KEY.pdf, http://www.archive.org/download/KEY/KEY_meta.mrc, or http://www.archive.org/download/KEY/KEY__djvu.txt.</li>
<li>Download - Feed the resulting URLs to your favorite spidering/mirroring application. I use wget.</li>
<li>Update - Enhance the downloaded MARC records with 856$u valued denoting the location of your local PDF copy as well as the original (cononical) version.</li>
<li>Index - Add the resulting MARC records to your &#8220;discovery&#8221; system.</li>
</ol>
<p>
Linked here is a small <a href="http://infomotions.com/blog/wp-content/uploads/2009/06/ia-and-discovery.tar.gz">distribution of shell and Perl scripts</a> that do this work for me and incorporate the content into VuFind. Here is how they can be used:
</p>
<pre><code>  $ getkeys.sh > catholic.keys
  $ keys2urls.pl catholic.keys > catholic.urls
  $ mirror.sh catholic.urls
  $ updatemarc.pl
  $ find /usr/var/html/etexts -name '*.marc' /
  -exec cat {} >> /usr/local/vufind/marc/archive.marc \;
  $ cd /usr/local/vufind
  $ ./import.sh marc/archive.marc
  $ sudo ./vufind.sh restart</code></pre>
<p>
Cool next steps would be use text mining techniques against the downloaded plain text versions of the documents to create summaries, extract named entities, and identify possible subjects. These items could then be inserted into the MARC records to enhance retrieval. Ideally the full text would be indexed, but alas, MARC does not accomodate that. &#8220;MARC must die.&#8221;</p>
]]></content:encoded>
			<wfw:commentRss>http://infomotions.com/blog/2009/06/interent-archive-content-in-discovery-systems/feed/</wfw:commentRss>
		</item>
		<item>
		<title>TFIDF In Libraries: Part III of III (For thinkers)</title>
		<link>http://infomotions.com/blog/2009/05/tfidf-in-libraries-part-iii-of-iii-for-thinkers/</link>
		<comments>http://infomotions.com/blog/2009/05/tfidf-in-libraries-part-iii-of-iii-for-thinkers/#comments</comments>
		<pubDate>Sun, 31 May 2009 20:30:39 +0000</pubDate>
		<dc:creator>Eric Lease Morgan</dc:creator>
		
		<category><![CDATA[Hacks]]></category>

		<category><![CDATA[Librarianship]]></category>

		<guid isPermaLink="false">http://infomotions.com/blog/?p=286</guid>
		<description><![CDATA[
This is the third of the three-part series on the topic of TFIDF in libraries. In Part I the why&#8217;s and wherefore&#8217;s of TFIDF were outlined. In Part II TFIDF subroutines and programs written in Perl were used to demonstrate how search results can be sorted by relevance and automatic classification can be done. In [...]]]></description>
			<content:encoded><![CDATA[<p>
This is the third of the three-part series on the topic of TFIDF in libraries. In <a href="http://infomotions.com/blog/2009/04/tfidf-in-libraries-part-i-for-librarians/">Part I</a> the why&#8217;s and wherefore&#8217;s of TFIDF were outlined. In <a href="http://infomotions.com/blog/2009/04/tfidf-in-libraries-part-ii-of-iii-for-programmers/">Part II</a> TFIDF subroutines and programs written in Perl were used to demonstrate how search results can be sorted by relevance and automatic classification can be done. In this last part a few more subroutines and a couple more programs are presented which: 1) weigh search results given an underlying set of themes, and 2) determine similarity between files in a corpus. A distribution including the <a href="http://infomotions.com/blog/wp-content/uploads/2009/04/tfidf.tar.gz">library of subroutines, Perl scripts, and sample data</a> are available online.
</p>
<h3>Big Names and Great Ideas</h3>
<p>
As an intellectual humanist, I have always been interested in &#8220;great&#8221; ideas. In fact, one of the reasons I became I librarian was because of the profundity of ideas physically located libraries. Manifested in books, libraries are chock full of ideas. Truth. Beauty. Love. Courage. Art. Science. Justice. Etc. As the same time, it is important to understand that books are not source of ideas, nor are they the true source of data, information, knowledge, or wisdom. Instead, people are the real sources of these things. Consequently, I have also always been interested in &#8220;big names&#8221; too. Plato. Aristotle. Shakespeare. Milton. Newton. Copernicus. And so on.
</p>
<table align='right' padding='10'>
<tr>
<td>
<iframe src="http://rcm.amazon.com/e/cm?t=infomotions-20&#038;o=1&#038;p=6&#038;l=st1&#038;mode=books&#038;search=0852295316&#038;fc1=000000&#038;lt1=&#038;lc1=3366FF&#038;bg1=FFFFFF&#038;f=ifr" marginwidth="0" marginheight="0" width="120" height="150" border="0" frameborder="0" style="border:none;" scrolling="no"></iframe></td>
</tr>
</table>
<p>
As a librarian and a liberal artist (all puns intended) I recognize many of these &#8220;big names&#8221; and &#8220;great ideas&#8221; are represented in a set of books called the Great Books of the Western World. I then ask myself, &#8220;Is there someway I can use my skills as a librarian to help support other people&#8217;s understanding and perception of the human condition?&#8221; The simple answer is to collection, organize, preserve, and disseminate the things &#8212; books &#8212; manifesting great ideas and big names. This is a lot what my <a href="http://infomotions.com/alex/">Alex Catalogue of Electronic Texts</a> is all about. On the other hand, a better answer to my question is to apply and exploit the tools and processes of librarianship to ultimately &#8220;save the time of the reader&#8221;. This is where the use of computers, computer technology, and TFIDF come into play.
</p>
<p>
<a href="http://infomotions.com/blog/2009/04/tfidf-in-libraries-part-ii-of-iii-for-programmers/">Part II</a> of this series demonstrated how to weigh search results based on the relevancy ranked score of a search term. But what if you were keenly interested in &#8220;big names&#8221; and &#8220;great ideas&#8221; as they related to a search term? What if you wanted to know about librarianship and how it related to some of these themes? What if you wanted to learn about the essence of sculpture and how it may (or may not) represent some of the core concepts of Western civilization? To answer such questions a person would have to search for terms like sculpture or three-dimensional works of art in addition to all the words representing the &#8220;big names&#8221; and &#8220;great ideas&#8221;. Such a process would be laborious to enter by hand, but trivial with the use of a computer.
</p>
<p>
Here&#8217;s a potential solution. Create a list of &#8220;big names&#8221; and &#8220;great ideas&#8221; by copying them from a place such as the Great Books of the Western World. Save the list much like you would save a stop word list. Allow a person to do a search. Calculate the relevancy ranking score for each search result. Loop through the list of names and ideas searching for each of them. Calculate their relevancey. Sum the weight of search terms with the weight of name/ideas terms. Return the weighted list. The result will be a relevancy ranked list reflecting not only the value of the search term but also the values of the names/ideas. This second set of values I call the Great Ideas Coefficient.
</p>
<p>
To implement this idea, the following subroutine, called great_ideas, was created. Given an index, a list of files, and a set of ideas, it loops through each file calculating the TFIDF score for each name/idea:
</p>
<pre><code>  sub great_ideas {

    my $index = shift;
    my $files = shift;
    my $ideas = shift;

    my %coefficients = ();

    # process each file
    foreach $file ( @$files ) {

      my $words = $$index{ $file };
      my $coefficient = 0;

      # process each big idea
      foreach my $idea ( keys %$ideas ) {

        # get n and t for tdidf
        my $n = $$words{ $idea };
        my $t = 0;
        foreach my $word ( keys %$words ) { $t = $t + $$words{ $word } }

          # calculate; sum all tfidf scores for all ideas
          $coefficient = $coefficient + &amp;tfidf( $n, $t, keys %$index, scalar @$files );

        }

      # assign the coefficient to the file
      $coefficients{ $file } = $coefficient;

    }

    return \%coefficients;

  }</code></pre>
<p>
A Perl script, ideas.pl, was then written taking advantage of the great_ideas subroutine. As described above, it applies the query to an index, calculates TFIDF for the search terms as well as the names/ideas, sums the results, and lists the results accordingly:
</p>
<pre><code>  # define
  use constant STOPWORDS =&gt; 'stopwords.inc';
  use constant IDEAS     =&gt; 'ideas.inc';

  # use/require
  use strict;
  require 'subroutines.pl';

  # get the input
  my $q = lc( $ARGV[ 0 ] );

  # index, sans stopwords
  my %index = ();
  foreach my $file ( &amp;corpus ) { $index{ $file } = &amp;index( $file, &amp;slurp_words( STOPWORDS ) ) }

  # search
  my ( $hits, @files ) = &amp;search( \%index, $q );
  print "Your search found $hits hit(s)\n";

  # rank
  my $ranks = &amp;rank( \%index, [ @files ], $q );

  # calculate great idea coefficients
  my $coefficients = &amp;great_ideas( \%index, [ @files ], &amp;slurp_words( IDEAS ) );

  # combine ranks and coefficients
  my %scores = ();
  foreach ( keys %$ranks ) { $scores{ $_ } = $$ranks{ $_ } + $$coefficients{ $_ } }

  # sort by score and display
  foreach ( sort { $scores{ $b } &lt;=&gt; $scores{ $a } } keys %scores ) {

    print "\t", $scores{ $_ }, "\t", $_, "\n"

  }</code></pre>
<p>
Using the query tool described in <a href="http://infomotions.com/blog/2009/04/tfidf-in-libraries-part-ii-of-iii-for-programmers/">Part II</a>, a search for librarianship returns the following results:
</p>
<pre>
  $ ./search.pl books
  Your search found 3 hit(s)
    0.00206045818083232   librarianship.txt
    0.000300606222548807  mississippi.txt
    5.91505974210339e-05  hegel.txt
</pre>
<p>
Using the new program, ideas.pl, the same set of results are returned but in a different order, an order reflecting the existence of &#8220;big ideas&#8221; and &#8220;great ideas&#8221; in the texts:
</p>
<pre>
  $ ./ideas.pl books
  Your search found 3 hit(s)
    0.101886904057731   hegel.txt
    0.0420767249559441  librarianship.txt
    0.0279062776599476  mississippi.txt
</pre>
<p>
When it comes to books and &#8220;great&#8221; ideas, maybe I&#8217;d rather read hegel.txt as opposed to librarianship.txt. Hmmm&#8230;
</p>
<p>
Think of the great_ideas subroutine as embodying the opposite functionality as a stop word list. Instead of excluding the words in a given list from search results, use the words to skew search results in a particular direction.
</p>
<p>
The beauty of the the great_ideas subroutine is that anybody can create their own set of &#8220;big names&#8221; or &#8220;great ideas&#8221;. They could be from any topic. Biology. Mathematics. A particular subset of literature. Just as different sets of stop words are used in different domains, so can the application of a Great Ideas Coefficient.
</p>
<h3>Similarity between documents</h3>
<p>
TFIDF can be applied to the problem of finding more documents like this one.
</p>
<p>
The process of finding more documents like this is perennial. The problem is addressed in the field of traditional librarianship through the application of controlled vocabulary terms, author/title authority lists, the collocation of physical materials through the use of classification numbers, and bibliographic instruction as well as information literacy classes.
</p>
<p>
In the field of information retrieval, the problem is addressed through the application of mathematics. More specifically but simply stated, by plotting the TFIDF scores of two or more terms from a set of documents on a Cartesian plane it is possible to calculate the similarity between said documents by comparing the angle and length of the resulting vectors &#8212; a measure called &#8220;cosine similarity&#8221;. By extending the process to any number of documents and any number of dimensions it is relatively easy to find more documents like this one.
</p>
<p>
Suppose we have two documents: A and B. Suppose each document contains many words but those words were only science and art. Furthermore, suppose document A contains the word science 9 times and the word art 10 times. Given these values, we can plot the relationship between science and art on a graph, below. Document B can be plotted similarly supposing science occurs 6 times and the word art occurs 14 times. The resulting lines, beginning at the graph&#8217;s origin (O) to their end-points (A and B), are called &#8220;vectors&#8221; and they represent our documents on a Cartesian plane:
</p>
<pre>
  s    |
  c  9 |         * A
  i    |        *
  e    |       *
  n  6 |      *      * B
  c    |     *     *
  e    |    *    *
       |   *   *
       |  *  *
       | * *
       O-----------------------
                10   14

                  art

  Documents A and B represented as vectors
</pre>
<p>
If the lines OA and OB were on top of each other and had the same length, then the documents would be considered equal &#8212; exactly similar. In other words, the smaller the angle AOB is as well as the smaller the difference between the length lines OA and OB the more likely the given documents are the same. Conversely, the greater the angle of AOB and the greater the difference of the lengths of lines OA and OB the more unlike the two documents.
</p>
<p>
This comparison is literally expressed as the inner (dot) product of the vectors divided by the product of the Euclidian magnitudes of the vectors. Mathematically, it is stated in the following form and is called &#8220;cosine similarity&#8221;:
</p>
<blockquote><p>( ( A.B ) / ( ||A|| * ||B|| ) )</p></blockquote>
<p>
Cosine similarity will return a value between 0 and 1. The closer the result is to 1 the more similar the vectors (documents) compare.
</p>
<p>
Most cosine similarity applications apply the comparison to every word in a document. Consequently each vector has a large number of dimensions making calculations time consuming. For the purposes of this series, I am only interested in the &#8220;big names&#8221; and &#8220;great ideas&#8221;, and since The Great Books of the Western World includes about 150 of such terms, the application of cosine similarity is simplified.
</p>
<p>
To implement cosine similarity in Perl three additional subroutines needed to be written. One to calculate the inner (dot) product of two vectors. Another was needed to calculate the Euclidian length of a vector. These subroutines are listed below:
</p>
<pre><code>  sub dot {

    # dot product = (a1*b1 + a2*b2 ... ) where a and b are equally sized arrays (vectors)
    my $a = shift;
    my $b = shift;
    my $d = 0;
    for ( my $i = 0; $i &lt;= $#$a; $i++ ) { $d = $d + ( $$a[ $i ] * $$b[ $i ] ) }
    return $d;

  }

  sub euclidian {

    # Euclidian length = sqrt( a1^2 + a2^2 ... ) where a is an array (vector)
    my $a = shift;
    my $e = 0;
    for ( my $i = 0; $i &lt;= $#$a; $i++ ) { $e = $e + ( $$a[ $i ] * $$a[ $i ] ) }
    return sqrt( $e );

  }</code></pre>
<p>
The subroutine that does the actual comparison is listed below. Given a reference to an array of two books, stop words, and ideas, it indexes each book sans stop words, searches each book for a great idea, uses the resulting TFIDF score to build the vectors, and computes similarity:
</p>
<pre><code>  sub compare {

    my $books     = shift;
    my $stopwords = shift;
    my $ideas     = shift;

    my %index = ();
    my @a     = ();
    my @b     = ();

    # index
    foreach my $book ( @$books ) { $index{ $book } = &amp;index( $book, $stopwords ) }

    # process each idea
    foreach my $idea ( sort( keys( %$ideas ))) {

      # search
      my ( $hits, @files ) = &amp;search( \%index, $idea );

      # rank
      my $ranks = &amp;rank( \%index, [ @files ], $idea );

      # build vectors, a &amp; b
      my $index = 0;
      foreach my $file ( @$books ) {

        if    ( $index == 0 ) { push @a, $$ranks{ $file }}
        elsif ( $index == 1 ) { push @b, $$ranks{ $file }}
        $index++;

        }

      }

      # compare; scores closer to 1 approach similarity
      return ( cos( &amp;dot( [ @a ], [ @b ] ) / ( &amp;euclidian( [ @a ] ) * &amp;euclidian( [ @b ] ))));

  }</code></pre>
<p>
Finally, a script, compare.pl, was written glueing the whole thing together. It&#8217;s heart is listed here:
</p>
<pre><code>  # compare each document...
  for ( my $a = 0; $a &lt;= $#corpus; $a++ ) {

    print "\td", $a + 1;

    # ...to every other document
    for ( my $b = 0; $b &lt;= $#corpus; $b++ ) {

      # avoid redundant comparisons
      if ( $b &lt;= $a ) { print "\t - " }

      # process next two documents
      else {

        # (re-)initialize
        my @books = sort( $corpus[ $a ], $corpus[ $b ] );

        # do the work; scores closer to 1000 approach similarity
        print "\t", int(( &amp;compare( [ @books ], $stopwords, $ideas )) * 1000 );

      }

    }

    # next line
    print "\n";

  }</code></pre>
<p>
In a nutshell, compare.pl loops through each document in a corpus and compares it to every other document in the corpus while skipping duplicate comparisons. Remember, only the dimensions representing &#8220;big names&#8221; and &#8220;great ideas&#8221; are calculated. Finally, it displays a similarity score for each pair of documents. Scores are multiplied by 1000 to make them easier to read. Given the sample data from the distribution, the following matrix is produced:
</p>
<pre>
  $ ./compare.pl
    Comparison: scores closer to 1000 approach similarity

        d1   d2   d3   d4   d5   d6

    d1   -  922  896  858  857  948
    d2   -   -   887  969  944  971
    d3   -   -    -   951  954  964
    d4   -   -    -    -   768  905
    d5   -   -    -    -    -   933
    d6   -   -    -    -    -    - 

    d1 = aristotle.txt
    d2 = hegel.txt
    d3 = kant.txt
    d4 = librarianship.txt
    d5 = mississippi.txt
    d6 = plato.txt
</pre>
<p>
From the matrix is it obvious that documents d2 (hegel.txt) and d6 (plato.txt) are the most similar since their score is the closest to 1000. This means the vectors representing these documents are closer to congruency than the other documents. Notice how all the documents are very close to 1000. This makes sense since all of the documents come from the Alex Catalogue and the Alex Catalogue documents are selected because of the &#8220;great idea-ness&#8221;. The documents should be similar. Notice which documents are the least similar: d4 (librarianship.txt) and d5 (mississippi.txt). The first is a history of librarianship. The second is a novel called Life on the Mississippi. Intuitively, we would expect this to be true; neither one of these documents are the topic of &#8220;great ideas&#8221;.
</p>
<p>
(Argg! Something is incorrect with my trigonometry. When I duplicate a document and run compare.pl the resulting cosine similarity value between the exact same documents is 540, not 1000. What am I doing wrong?)
</p>
<h3>Summary</h3>
<p>
This last part in the series demonstrated ways term frequency/inverse document frequency (TFIDF) can be applied to over-arching (or underlying) themes in a corpus of documents, specifically the &#8220;big names&#8221; and &#8220;great ideas&#8221; of Western civilization. It also demonstrated how TFIDF scores can be used to create vectors representing documents. These vectors can then be compared for similarity, and, by extension, the documents they represent can be compared for similarity.
</p>
<p>
The purpose of the entire series was to bring to light and take the magic out of a typical relevancy ranking algorithm. A distribution including all the source code and sample documents is available online. Use the distribution as a learning tool for your own explorations.
</p>
<p>
As alluded to previously, TFIDF is like any good folk song. It has many variations and applications. TFIDF is also like milled grain because it is a fundemental ingredient to many recipes. Some of these recipies are for bread, but some of them are for pies or just thickener. Librarians and libraries need to incorporate more mathematical methods into their processes. There needs to be a stronger marriage between the social characteristics of librarianship and the logic of mathematics. (Think <a href="http://infomotions.com/blog/2008/07/origami-is-arscient-and-so-is-librarianship/">arscience</a>.) The application of TFIDF in libraries is just one example.</p>
]]></content:encoded>
			<wfw:commentRss>http://infomotions.com/blog/2009/05/tfidf-in-libraries-part-iii-of-iii-for-thinkers/feed/</wfw:commentRss>
		</item>
		<item>
		<title>The decline of books</title>
		<link>http://infomotions.com/blog/2009/05/the-decline-of-books/</link>
		<comments>http://infomotions.com/blog/2009/05/the-decline-of-books/#comments</comments>
		<pubDate>Fri, 08 May 2009 13:41:04 +0000</pubDate>
		<dc:creator>Eric Lease Morgan</dc:creator>
		
		<category><![CDATA[Librarianship]]></category>

		<category><![CDATA[books]]></category>

		<guid isPermaLink="false">http://infomotions.com/blog/?p=276</guid>
		<description><![CDATA[[This posting is in response to a tiny thread on the NGC4Lib mailing list about the decline of books. --ELM]
Yes, books are on the decline, but in order to keep this trend in perspective it is important to not confuse the medium with the message. The issue is not necessarily about books as much as [...]]]></description>
			<content:encoded><![CDATA[<p>[This posting is in response to a tiny thread on the NGC4Lib mailing list about the <a href="http://serials.infomotions.com/ngc4lib/sru/?operation=searchRetrieve&#038;version=1.1&#038;stylesheet=%2Fngc4lib%2Fsru%2Fstyle.xsl&#038;query=subject%3D%22decline+of+books%22">decline of books</a>. --ELM]</p>
<p>Yes, books are on the decline, but in order to keep this trend in perspective it is important to not confuse the medium with the message. The issue is not necessarily about books as much as it is about the stuff inside the books.</p>
<p>Books &#8212; codexes &#8212; are a particular type of technology. Print words and pictures on leaves of paper. Number the pages. Add an outline of the book&#8217;s contents &#8212; a table of contents. Make the book somewhat searchable by adding an index. Wrap the whole thing between a couple of boards. The result is a thing that is portable, durable, long- lasting, and relatively free-standing as well as independent of other technology. But all of this is really a transport medium, a container for the content.</p>
<p>Consider the content of books. Upon close examination it is a recorded manifestation of humanity. Books &#8212; just like the Web &#8212; are a reflection of humankind because just anything you can think of can be manifested in printed form. Birth. Growth. Love. Marriage. Aging. Death. Poetry. Prose. Mathematics. Astronomy. Business. Instructions. Facts. Directories. Gardening. Theses and dissertations. News. White papers. Plans. History. Descriptions. Dreams. Weather. Stock quotes. The price of gold. Things for sale. Stories both real and fictional. Etc. Etc. Etc.</p>
<p>Consider the length of time humankind has been recording things in written form. Maybe five thousand years. What were the mediums used? Stone and clay tablets? Papyrus scrolls. Vellum. Paper. To what extent did people bemoan the death of clay tablets? To what extent did they bemoan the movement from scrolls to codexes? Probably the cultures who valued verbal traditions as opposed to written traditions (think of the American Indians) had more to complain about than the migration from one written from to another. The medium is not as important as the message.</p>
<p>Different types of content lend themselves to different mediums. Music can be communicated via the written score, but music is really intended to be experienced through hearing. Sculpture is, by definition, a three-dimensional medium, yet we take photographs of it, a two-dimensional medium. The poetry and prose lend themselves very well to the written word, but they can be seen as forms of storytelling, and while there are many advantages to stories being written down, there are disadvantages as well. No sound effects. Where to put the emphasis on phrases? Hand gestures to communicate subtle distinctions are lost. It is for all of these reasons that libraries (and museums and archives) also collect the mediums that better represent this content. Paintings. Sound recordings. Artifacts. CDs and DVDs.</p>
<p>The containers of information will continue to change, but I assert that the content will not. The content will continue to be a reflection of humankind. It will represent all of the things that it means to be men, woman, and children. It will continue to be an exposition of our collective thoughts, feelings, beliefs, and experiences.</p>
<p>Libraries and other &#8220;cultural heritage institutions&#8221; do not have and never did have a monopoly on recorded content, but now, more than ever, and as we have moved away from an industrial-based economy to a more service-based economy whose communication channels are electronic and global, the delivery of recorded content, in whatever form, is more profitable. Consequently there is more competition. Libraries need to get a grip on what they are all about. If it is about the medium &#8212; books, CDs, articles &#8212; then the future is grim. If it is about content and making that content useful to their clientele, then the opportunities are wide open. Shifting a person&#8217;s focus from the how to the what is challenging. Looking at the forest from the trees is sometimes overwhelming. Anybody can get information these days. We are still drinking from the proverbial fire hose. The problem to be solved is less about discovery and more about use. It is about placing content in context. Providing a means to understanding it, manipulating it, and using it to solve the problems revolving around what it means to be human.</p>
<p>We are a set of educated people. If we put our collective minds to the problem, then I sincerely believe libraries can and will remain relevant. In fact, that is why I instituted this [the NGC4Lib] mailing list.</p>
]]></content:encoded>
			<wfw:commentRss>http://infomotions.com/blog/2009/05/the-decline-of-books/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Code4Lib Software Award: Loose ends</title>
		<link>http://infomotions.com/blog/2009/04/code4lib-software-award-loose-ends/</link>
		<comments>http://infomotions.com/blog/2009/04/code4lib-software-award-loose-ends/#comments</comments>
		<pubDate>Mon, 27 Apr 2009 12:42:44 +0000</pubDate>
		<dc:creator>Eric Lease Morgan</dc:creator>
		
		<category><![CDATA[Librarianship]]></category>

		<category><![CDATA[Code4Lib]]></category>

		<guid isPermaLink="false">http://infomotions.com/blog/?p=273</guid>
		<description><![CDATA[Loose ends make me feel uncomfortable, and one of the loose ends in my professional life is the Code4Lib Software Award.
Code4Lib began as a mailing list in 2003 and has grown to about 1,200 subscribers from all over the world. New people subscribe to the list almost daily. Its Web presence started up in 2005. [...]]]></description>
			<content:encoded><![CDATA[<p>Loose ends make me feel uncomfortable, and one of the loose ends in my professional life is the <a href="http://infomotions.com/blog/2009/03/code4lib-open-source-software-award/">Code4Lib Software Award</a>.</p>
<p>Code4Lib began as a mailing list in 2003 and has grown to about 1,200 subscribers from all over the world. New people subscribe to the list almost daily. Its Web presence started up in 2005. Our conferences have been stimulating, informative, and productive for all three years of their existence. Our latest venture &#8212; the journal &#8212; records, documents, and shares the practical experience of our community. Underlying all of this is an IRC channel where answers to library-related computer problems can be answered in real-time. Heck, there even exists three for four Code4Lib &#8220;franchises&#8221;. In sum, by exploiting both traditional and less traditional mediums the Code4Lib Community has grown and matured quickly over the past five years. In doing so it has provided valuable and long-lasting services to itself as well as the greater library profession.</p>
<p>It is for the reasons outlined above that I believe our community is ripe for an award. Good things happen in Code4Lib. These things begin with individuals, and I believe the good code written by these individuals ought to be formally recognized. Unfortunately, ever since I put forward the idea, I have heard more negative things than positive. To paraphrase, &#8220;It would be seen as an endorsement, and we don&#8217;t endorse&#8230; It would turn out to be just a popularity contest&#8230; There are so many characteristics of good software that any decision would seem arbitrary.&#8221;</p>
<p>Apparently the place for an award is not as obvious to others as it is to me. Apparently our community is not as ready for an award as I thought we were. That is why, for the time being, I am withdrawing my offer to sponsor one. Considering who I am, I simply don&#8217;t have the political wherewithal to make the award a reality, but I do predict there will be an award at some time, just not right now. The idea needs to ferment for a while longer.</p>
]]></content:encoded>
			<wfw:commentRss>http://infomotions.com/blog/2009/04/code4lib-software-award-loose-ends/feed/</wfw:commentRss>
		</item>
		<item>
		<title>TFIDF In Libraries: Part II of III (For programmers)</title>
		<link>http://infomotions.com/blog/2009/04/tfidf-in-libraries-part-ii-of-iii-for-programmers/</link>
		<comments>http://infomotions.com/blog/2009/04/tfidf-in-libraries-part-ii-of-iii-for-programmers/#comments</comments>
		<pubDate>Tue, 21 Apr 2009 02:42:39 +0000</pubDate>
		<dc:creator>Eric Lease Morgan</dc:creator>
		
		<category><![CDATA[Hacks]]></category>

		<category><![CDATA[Librarianship]]></category>

		<category><![CDATA[term frequency/inverse document frequency (TFIDF)]]></category>

		<guid isPermaLink="false">http://infomotions.com/blog/?p=266</guid>
		<description><![CDATA[
This is the second of a three-part series called TFIDF In Libraries, where relevancy ranking techniques are explored through a set of simple Perl programs. In Part I relevancy ranking was introduced and explained. In Part III additional word/document weighting techiques will be explored to the end of filtering search results or addressing the perennial [...]]]></description>
			<content:encoded><![CDATA[<p>
This is the second of a three-part series called TFIDF In Libraries, where relevancy ranking techniques are explored through a set of simple Perl programs. In <a href="http://infomotions.com/blog/2009/04/tfidf-in-libraries-part-i-for-librarians/">Part I</a> relevancy ranking was introduced and explained. In <a href="http://infomotions.com/blog/2009/05/tfidf-in-libraries-part-iii-of-iii-for-thinkers/">Part III</a> additional word/document weighting techiques will be explored to the end of filtering search results or addressing the perennial task of &#8220;finding more documents like this one.&#8221; In the end it is the hoped to demonstrate that relevancy ranking is not magic nor mysterious but rather the process of applying statistical techiques to textual objects.
</p>
<h3>TFIDF, again</h3>
<p>
As described in Part I, term frequency/inverse document frequency (TFIDF) is a process of counting words in a document as well as throughout a corpus of documents to the end of sorting documents in statistically relevent ways.
</p>
<p>
Term frequency (TF) is essencially a percentage denoting the number of times a word appears in a document. It is mathematically expressed as C / T, where C is the number of times a word appears in a document and T is the total number of words in the same document.
</p>
<p>
Inverse document frequency (IDF) takes into acount that many words occur many times in many documents. Stop words and the word &#8220;human&#8221; in the MEDLINE database are very good examples. IDF is mathematically expressed as D / DF, where D is the total number of documents in a corpus and DF is the number of document in which a particular word is found. As D / DF increases so does the significance of the given word.
</p>
<p>
Given these two factors, TFIDF is literally the product of TF and IDF:
</p>
<blockquote><p>
TFIDF = ( C / T ) * ( D / DF )
</p></blockquote>
<p>
This is the basic form that has been used to denote relevance ranking for more than forty years, and please take note that it requires no advanced mathematical knowledge &#8212; basic arithmatic.
</p>
<p>
Like any good recipe or folk song, TFIDF has many variations. Google, for example, adds additional factors into their weighting scheme based on the popularity of documents. Other possibilities could include factors denoting the characteristics of the person using the texts. In order to accomodate for the wide variety of document sizes, the natural log of IDF will be employed throughout the balance of this demonstration. Therefore, for the purposes used here, TFIDF will be defined thus:
</p>
<blockquote><p>
TFIDF = ( C / T ) * log( D / DF )
</p></blockquote>
<h3>Simple Perl subroutines</h3>
<p>
In order to put theory into practice, I wrote a number of Perl subroutines implementing various aspects of relevancy ranking techniques. I then wrote a number of scripts exploiting the subroutines, essencially wrapping them in a user interface.
</p>
<p>
Two of the routines are trivial and will not be explained in any greater detail than below:
</p>
<ul>
<li>corpus - Returns an array of all the .txt files in the current directory, and is used to denote the library of content to be analyzed.</li>
<li>slurp_words - Returns a reference to a hash of all the words in a file, specifically for the purposes of implementing a stop word list.</li>
</ul>
<p>
Two more of the routines are used to support indexing and searching the corpus. Again, since neither is the focus of this posting, each will only be outlined:
</p>
<ul>
<li>index - Given a file name and a list of stop words, this routine returns a reference to a hash containing all of the words in the file (san stop words) as well as the number of times each word occurs. Strictly speaking, this hash is not an index but it serves our given purpose adequately.</li>
<li>search - Given an &#8220;index&#8221; and a query, this routine returns the number of times the query was found in the index as well as an array of files listing where the term was found. Search is limited. It only supports single-term queries, and there are no fields for limiting.</li>
</ul>
<p>
The heart of the library of subroutines is used to calculate TFIDF, ranks search results, and classify documents. Of course the TFIDF calculation is absolutely necessary, but ironically, it is the most straight-forward routine in the collection. Given values for C, T, D, and DF it returns decimal between 0 and 1. Trivial:
</p>
<pre><code>  # calculate tfidf
  sub tfidf {

    my $n = shift;  # C
    my $t = shift;  # T
    my $d = shift;  # D
    my $h = shift;  # DF

    my $tfidf = 0;

    if ( $d == $h ) { $tfidf = ( $n / $t ) }
    else { $tfidf = ( $n / $t ) * log( $d / $h ) }

    return $tfidf;

  }
</code></pre>
<p>
Many readers will probably be most interested in the rank routine. Given an index, a list of files, and a query, this code calculates TFIDF for each file and returns the results as a reference to a hash. It does this by repeatedly calculating the values for C, T, D, and DF for each of the files and calling tfidf:
</p>
<pre><code>  # assign a rank to a given file for a given query
  sub rank {

    my $index = shift;
    my $files = shift;
    my $query = shift;

    my %ranks = ();

    foreach my $file ( @$files ) {

      # calculate n
      my $words = $$index{ $file };
      my $n = $$words{ $query };

      # calculate t
      my $t = 0;
      foreach my $word ( keys %$words ) { $t = $t + $$words{ $word } }

      # assign tfidf to file
      $ranks{ $file } = &amp;tfidf( $n, $t, keys %$index, scalar @$files );

    }

    return \%ranks;

  }
</code></pre>
<p>
The classify routine is an added bonus. Given the index, a file, and the corpus of files, this function calculates TFIDF for each word in the file and returns a refernece to a hash containing each word and its TFIDF value. In other words, instead of calculating TFIDF for a given query in a subset of documents, it calculates TFIDF for each word in an entire corpus. This proves useful in regards to automatic classification. Like rank, it repeatedly determines values for C, T, D, and DF and calls tfidf:
</p>
<pre><code>  # rank each word in a given document compared to a corpus
  sub classify {

    my $index  = shift;
    my $file   = shift;
    my $corpus = shift;

    my %tags = ();

    foreach my $words ( $$index{ $file } ) {

      # calculate t
      my $t = 0;
      foreach my $word ( keys %$words ) { $t = $t + $$words{ $word } }

      foreach my $word ( keys %$words ) {

        # get n
        my $n = $$words{ $word };

        # calculate h
        my ( $h, @files ) = &amp;search( $index, $word );

        # assign tfidf to word
        $tags{ $word } = &amp;tfidf( $n, $t, scalar @$corpus, $h );

      }

    }

    return \%tags;

  }
</code></pre>
<h3>Search.pl</h3>
<p>
Two simple Perl scripts are presented, below, taking advantage of the routines described, above. The first is search.pl. Given a single term as input this script indexes the .txt files in the current directory, searches them for the term, assigns TFIDF to each of the results, and displays the results in a relevancy ranked order. The essencial aspects of the script are listed here:
</p>
<pre><code>  # define
  use constant STOPWORDS => 'stopwords.inc';

  # include
  require 'subroutines.pl';

  # get the query
  my $q = lc( $ARGV[ 0 ] );

  # index
  my %index = ();
  foreach my $file ( &amp;corpus ) { $index{ $file } = &amp;index( $file, &amp;slurp_words( STOPWORDS ) ) }

  # search
  my ( $hits, @files ) = &amp;search( \%index, $q );
  print "Your search found $hits hit(s)\n";

  # rank
  my $ranks = &amp;rank( \%index, [ @files ], $q );

  # sort by rank and display
  foreach my $file ( sort { $$ranks{ $b } &lt;=&gt; $$ranks{ $a } } keys %$ranks ) {

    print "\t", $$ranks{ $file }, "\t", $file, "\n"

  }

  # done
  print "\n";
  exit;
</code></pre>
<p>
Output from the script looks something like this:
</p>
<pre><code>  $ ./search.pl knowledge
  Your search found 6 hit(s)
    0.0193061840120664    plato.txt
    0.00558586078987563   kant.txt
    0.00299602568022012   aristotle.txt
    0.0010031177985631    librarianship.txt
    0.00059150597421034   hegel.txt
    0.000150303111274403  mississippi.txt
</code></pre>
<p>
From these results you can see that the document named plato.txt is the most relevent because it has the highest score, in fact, it is almost four times more relevant than the second hit, kant.txt. For extra credit, ask yourself, &#8220;At what point do the scores become useless, or when do the scores tell you there is nothing of significance here?&#8221;
</p>
<h3>Classify.pl</h3>
<p>
As alluded to in Part I of this series, TFIDF can be turned on its head to do automatic classification. Weigh each term in a corpus of documents, and list the most significant words for a given document. Classify.pl does this by denoting a lower bounds for TFIDF scores, indexing an entire corpus, weighing each term, and outputing all the terms whose scores are greater than the lower bounds. If no terms are greater than the lower bounds, then it lists the top N scores as defined by a configuration. The essencial aspects of classify.pl are listed below:
</p>
<pre><code>  # define
  use constant STOPWORDS    => 'stopwords.inc';
  use constant LOWERBOUNDS  => .02;
  use constant NUMBEROFTAGS => 5;

  # require
  require 'subroutines.pl';

  # initialize
  my @corpus = &amp;corpus;

  # index
  my %index = ();
  foreach my $file (@corpus ) { $index{ $file } = &amp;index( $file, &amp;slurp_words( STOPWORDS ) ) }

  # classify each document
  foreach my $file ( @corpus ) {

    print $file, "\n";

    # list tags greater than a given score
    my $tags  = &amp;classify( \%index, $file, [ @corpus ] );
    my $found = 0;
    foreach my $tag ( sort { $$tags{ $b } &lt;=&gt; $$tags{ $a } } keys %$tags ) {

      if ( $$tags{ $tag } &gt; LOWERBOUNDS ) {

        print "\t", $$tags{ $tag }, "\t$tag\n";
        $found = 1;

      }

      else { last }

    }

    # accomodate tags with low scores
    if ( ! $found ) {

      my $n = 0;
      foreach my $tag ( sort { $$tags{ $b } <=> $$tags{ $a } } keys %$tags ) {

        print "\t", $$tags{ $tag }, "\t$tag\n";
        $n++;
        last if ( $n == NUMBEROFTAGS );

      }

    }

    print "\n";

  }

  # done
  exit;
</code></pre>
<p>
For example, sample, yet truncated, output from classify.pl looks like this:
</p>
<pre><code>  aristotle.txt
    0.0180678691531642  being
    0.0112840859266579  substances
    0.0110363803118312  number
    0.0106083766432284  matter
    0.0098440843778661  sense

  mississippi.txt
    0.00499714142455761  mississippi
    0.00429324597184886  boat
    0.00418922035591656  orleans
    0.00374087743616293  day
    0.00333830388445574  river
</code></pre>
<p>
Thus, assuming a lower TFIDF bounds of 0.02, the words being, substance, number, matter, and sense are the most significant in the document named aristotle.txt. But since none of the words in mississippi.txt have a score that high the top five words are returned instead. For more extra credit, think of ways classify.pl can be improved by answering, &#8220;How can the output be mapped to controlled vocabulary terms or expanded through the use of some other thesarus?&#8221;
</p>
<h3>Summary</h3>
<p>
The Perl subroutines and scripts described here implement TFIDF to do rudimentary ranking of search results and automatic classification. They are not designed to be production applications, just example tools for the purposes of learning. Turning the ideas implemented in these scripts into production applications have been the fodder for many people&#8217;s careers and entire branches of computer science.
</p>
<p>
You can <a href="http://infomotions.com/blog/wp-content/uploads/2009/04/tfidf.tar.gz">download the scripts, subroutines, and sample data</a> in order for you to learn more. You are encouraged to remove the .txt files from the distribution and replace them with your own data. I think your search results and automatic classification output will confirm in your mind that TFIDF is well-worth the time and effort of the library community. Given the amounts of full text books and journal articles freely available on the Internet, it behooves the library profession to learn to exploit these concepts because our traditional practices simply: 1) do not scale, or 2) do not meet with our user&#8217;s expectations. Furthermore, farming these sorts of solutions out to vendors is irresponsible.</p>
]]></content:encoded>
			<wfw:commentRss>http://infomotions.com/blog/2009/04/tfidf-in-libraries-part-ii-of-iii-for-programmers/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
