Archive for the ‘Librarianship’ Category

Collecting water and putting it on the Web (Part III of III)

Thursday, September 3rd, 2009

This is Part III of an essay about my water collection, specifically a summary, opportunities for future study, and links to the source code. Part I described the collection’s whys and hows. Part II described the process of putting it on the Web.

Summary, future possibilities, and source code

There is no doubt about it. My water collection is eccentric but through my life time I have encountered four other people who also collect water. At least I am not alone.

Putting the collection on the Web is a great study in current technology. It includes relational database design. Doing input/output against the database through a programming language. Exploiting the “extensible” in XML by creating my own mark-up language. Using XSLT to transform the XML for various purposes: display as well as simple transformation. Literally putting the water collection on the map. Undoubtably technology will change, but the technology of my water collection is a representative reflection of the current use of computers to make things available on the Web.

I have made all the software a part of this system available here:

  1. SQL file sans any data – good for study of simple relational database
  2. SQL file complete with data – see how image data is saved in the database
  3. PHP scripts – used to do input/output against the database
  4. waters.xml – a database dump, sans images, in the form of an XML file
  5. waters.xsl – the XSLT used to display the browser interface
  6. waters2markers.xsl – transform water.xml into Google Maps XML file
  7. map.pl – implementation of Google Maps API

My water also embodies characteristics of librarianship. Collection. Acquisition. Preservation. Organization. Dissemination. The only difference is that the content is not bibliographic in nature.

There are many ways access to the collection could be improved. It would be nice to sort by date. It would be nice to index the content and make the collection searchable. I have given thought to transforming the WaterML into FO (Formatting Objects) and feeding the FO to a PDF processor like FOP. This could give me a printed version of the collection complete with high resolution images. I could transform the WaterML into an XML file usable by Google Earth providing another way to view the collection. All of these things are “left up the reader for further study”. Software is never done, nor are library collections.

River Lune
River Lune
Roman Bath
Roman Bath
Ogle Lake
Ogle Lake

Finally, again, why do I do this? Why do I collect the water? Why have a spent so much time creating a system for providing access to the collection? Ironically, I am unable to answer succinctly. It has something to do with creativity. It has something to do with “arsience“. It has something to do with my passion for the library profession and my ability to manifest it through computers. It has something to do with the medium of my art. It has something to do with my desire to share and expand the sphere of knowledge. “Idea. To be an idea. To be an idea and an example to others… Idea”. I really don’t understand it through and through.

Read all the posts in this series:

  1. The whys and hows of the water collection
  2. How the collection is put on the Web
  3. This post

Visit the water collection.

Collecting water and putting it on the Web (Part I of III)

Thursday, September 3rd, 2009

This is Part I of an essay about my water collection, specifically the whys and hows of it. Part II describes the process of putting the collection on the Web. Part III is a summary, provides opportunities for future study, and links to the source code.

I collect water

It may sound strange, but I have been collecting water since 1978, and to date I believe I have around 200 bottles containing water from all over the world. Most of the water I’ve collected myself, but much of it has also been collected by friends and relatives.

The collection began the summer after I graduated from high school. One of my best friends, Marlin Miller, decided to take me to Ocean City (Maryland) since I had never seen the ocean. We arrived around 2:30 in the morning, and my first impression was the sound. I didn’t see the ocean. I just heard it, and it was loud. The next day I purchased a partially melted glass bottle for 59ยข and put some water, sand, and air inside. I was going keep some of the ocean so I could experience it anytime I desired. (Actually, I believe my first water is/was from the Pacific Ocean, collected by a girl named Cindy Bleacher. She visited there in the late Spring of ‘78, and I asked her to bring some back so I could see it too. She did.) That is how the collection got started.

Cape Cod Bay
Cape Cod Bay
Robins Bay
Robins Bay
Gulf of Mexico
Gulf of Mexico

The impetus behind the collection was reinforced in college — Bethany College (Bethany, WV). As a philosophy major I learned about the history of Western ideas. That included Heraclitus who believed the only constant was change, and water was the essencial element of the universe. These ideas were elaborated upon by other philosophers who thought there was not one essencial element, but four: earth, water, air, and fire. I felt like I was on to something, and whenever I heard of somebody going abroad I asked them bring me back some water. Burton Thurston, a Bethany professor, went to the Middle East on a diplomatic mission. He brought back Nile River water and water from the Red Sea. I could almost see Moses floating in his basket and escaping from the Egyptians.

The collection grew significantly in the Fall of 1982 because I went to Europe. During college many of my friends studied abroad. They didn’t do much studying as much as they did traveling. They were seeing and experiencing all of the things I was learning about through books. Great art. Great architecture. Cities whose histories go back millennia. Foreign languages, cultures, and foods. I wanted to see those things too. I wanted to make real the things I learned about in college. I saved my money from my summer peach picking job. My father cashed in a life insurance policy he had taken out on me when I was three weeks old. Living like a turtle with its house on its back, I did the back-packing thing across Europe for a mere six weeks. Along the way I collected water from the Seine at Notre Dame (Paris), the Thames (London), the Eiger Mountain (near Interlaken, Switzerland) where I almost died, the Agean Sea (Ios, Greece), and many other places. My Mediterranean Sea water from Nice is the prettiest. Because of the all the alge, the water from Venice is/was the most biologically active.

Over the subsequent years the collection has grown at a slower but regular pace. Atlantic Ocean (Myrtle Beach, South Carolina) on a day of playing hooky from work. A pond at Versailles while on my honeymoon. Holy water from the River Ganges (India). Water from Lock Ness. I’m going to grow a monster from DNA contained therein. I used to have some of a glacier from the Canadian Rockies, but it melted. I have water from Three Mile Island (Pennsylvania). It glows in the dark. Amazon River water from Peru. Water from the Missouri River where Lewis & Clarke decided it began. Etc.

Many of these waters I haven’t seen in years. Moves from one home to another have relegated them to cardboard boxes that have never been unpacked. Most assuredly some of the bottles have broken and some of the water has evaporated. Such is the life of a water collection.

Lake Huron
Lake Huron
Trg Bana Jelacica
Trg Bana Jelacica
Jimmy Carter Water
Jimmy Carter Water

Why do I collect water? I’m not quite sure. The whole body of water is the second largest thing I know. The first being the sky. Yet the natural bodies of water around the globe are finite. It would be possible to collect water from everywhere, but very difficult. Maybe I like the challenge. Collecting water is cheap, and every place has it. Water makes a great souvenir, and the collection process helps strengthen my memories. When other people collect water for me it builds between us a special relationship — a bond. That feels good.

What do I do with the water? Nothing. It just sits around my house occupying space. In my office and in the cardboard boxes in the basement. I would like to display it, but over all the bottles aren’t very pretty, and they gather dust easily. I sometimes ponder the idea of re-bottling the water into tiny vials and selling it at very expensive prices, but in the process the air would escape, and the item would lose its value. Other times I imagine pouring the water into a tub and taking a bath it it. How many people could say they bathed in the Nile River, Amazon River, Pacific Ocean, Atlantic Ocean, etc. all at the same time.

How water is collected

The actual process of collecting water is almost trivial. Here’s how:

  1. Travel someplace new and different – The world is your oyster.
  2. Identify a body of water – This should be endemic of the locality such as an ocean, sea, lake, pond, river, stream, or even a public fountain. Natural bodies of water a preferable. Processed water is not.
  3. Find a bottle – In earlier years this was difficult, and I usually purchased a bottle of wine with my meal, kept the bottle and cork, and used the combination as my container. Now-a-days it is easier to root round in a trash can for a used water bottle. They’re ubiquitous, and they too are often endemic of the locality.
  4. Collect the water – Just fill the bottle with mostly water but some of what the water is flowing over as well. The air comes along for the ride.
  5. Take a photograph – Hold the bottle at arm’s length and take a picture it. What you are really doing here is two-fold. Documenting the appearance of the bottle but also documenting the authenticity of the place. The picture’s background supports the fact that water really came from where the collector says.
  6. Label the bottle – On a small piece of paper write the name of the body of water, where it came from, who collected it, and when. Anything else is extra.
  7. Save – Keep the water around for posterity, but getting it home is sometimes a challenge. With the advent of 911 it is difficult to get the water through airport security and/or customs. I have recently found myself checking my bags and incurring a handling fee just to bring my water home. Collecting water is not as cheap as it used to be.

Who can collect water for me? Not just anybody. I have to know you. Don’t take it personally, but remember, part of the goal is relationship building. Moreover, getting water from strangers would jeopardize the collection’s authenticity. Is this really the water they say it is? Call it a weird part of the “collection development policy”.

Pacific Ocean
Pacific Ocean
Rock Run
Rock Run
Salton Sea
Salton Sea

Read all the posts in this series:

  1. This post
  2. How the collection is put on the Web
  3. A summary, future directions, and source code

Visit the water collection.

Web-scale discovery services

Thursday, August 27th, 2009

Last week (Tuesday, August 18) Marshall Breeding and I participated in a webcast sponsored by Serials Solutions and Library Journal on the topic of “‘Web-scale’ discovery services”.

Our presentations complimented one another in that we both described the current library technology environment and described how the creation of amalgamated indexes of book and journal article content have the potential to improve access to library materials.

Dodie Ownes summarized the event in an article for Library Journal. From there you can also gain access to an archive of the one-hour webcast. (Free registration required.) I have made my written remarks available on the Hesburgh Libraries website as well as mirrored them locally. From the remarks:

It is quite possible the do-it-yourself creation and maintenance of an index to local book holdings, institutional repository content, and articles/etexts is not feasible. This may be true for any number of reasons. You may not have the full complement of resources to allocate, whether that be time, money, people, or skills. You and your library may have a set of priorities forcing the do-it-yourself approach lower on the to-do list. You might find yourself stuck in never-ending legal negotiations for content from “closed” access providers. You might liken the process of normalizing myriads of data formats into a single index to Hercules cleaning the Augean stables.

technical expertise
technical expertise
money
money
people with vision
people with vision
energy
energy

If this be the case, then the purchasing (read, “licensing”) of a single index service might be the next best thing — Plan B.

I sincerely believe the creation of these “Web-scale” indexes is a step in the right direction, but I believe just as strongly that the problem to be solved now-a-days does not revolve around search and discovery, but rather use and context.

“Thank you Serials Solutions and Library Journal for the opportunity to share some of my ideas.”

Top Tech Trends for ALA Annual, Summer 2009

Monday, July 20th, 2009

This is a list of Top Tech Trends for the ALA Annual Meeting, Summer 2009.*

Green computing

The amount of computing that gets done on our planet has a measurable carbon footprint, and many of us, myself included, do not know exactly how much heat our computers put off and how much energy they consume. With the help from some folks from the University of Notre Dame’s Center for Research Computing, I learned my laptop computer spikes at 30 watts on boot, slows down to 20 watts during normal use, idles at 2 watts during sleep, and zooms up to 34 watts when the screen saver kicks in. Just think how much energy and heat your computer consumes and generates while waiting for the nightly update from your systems department. But realistically, it is our servers that make the biggest impact, and while energy consumption is one way to be more green, another is to figure out ways to harness the heat the computers generate. One trend is to put computers in places that need to be heated up, like green houses in the winter. Another idea is to put them in places where cool air is exhausted, like building ventilation ducks. What can you do? Turn your computer off when it is not in use since the computer electronics and such are not as sensitive to power on, power off cycles as they used to be.

“Digital Humanities”

There seems to be a growing number of humanities scholars who understand that computers can be applied to their research. See the Digital Humanities Manifesto as an example. With the advent of all the electronic texts being made available, it is not possible to read each and every text individually. In an effort to analyze large copra more quickly, people can create word clouds against these documents to summarize them. They can extract the statistically significant words and phrases to determine their “aboutness”. They can easily compute Fog, Flesch, and Flesch-Kincaid scores denoting the complexity of documents. (”Remember, ‘Why Johnny can’t read’?”) These people understand that humanities scholarship is not necessarily done in isolation, and the codex is not necessarily the medium of the day. They understand the advantages of open access publishing. For our profession, it is difficult to overstate the number of opportunities this trend affords librarianship. Anybody can find information. What people need now are tools to make information easier to analyze and use.

Tweeting with Twitter

Microblogging (think Twitter) is definitely hot. In some situations it can be a really useful application of computer technology. Frankly, I think the fascination will wear off and its functionality will become similar to the use of cellphone photographs at news-breaking events. Tweet, tweet, tweet.

Discovery interfaces and mega-indexes

If I were to pick the hottest trend in library technology, it would be fledgling implementation of large, all-encompassing indexes of journal and book content — integrating mega-indexes into the “discovery” interface. This is exemplified by Serials Solutions’ Summa, hinted at by an OCLC/EBSCO collaboration, and thought about by other library vendors. Google Scholar comes close but could benefit by adding more complete bibliographic data of books. OAIster worked for OAI-accessible content but needed to be indexed with a less proprietary tool. The folks at Index Data created something similar and included additional content, but the idea never seemed to catch on. Federated (broadcast) search tried and has yet to fulfill the promise. The driver behind this idea is the knowledge that many data silos don’t meet the needs of our users. Instead people want one box, one button, and one data set. Combine journal bibliographic data with book bibliographic data into a single index (not database). Sort search results by relevance. Provide a set of time-saving services against the result. In order for this technological technique to work each data set must be normalized into a single data structure and indexed (probably with an open source indexer called Lucene). In other words, there will be a large set of core elements such a title, author, note, subject, etc. All bibliographic data from all sets will be mapped to these fields and what doesn’t fall neatly into any one of them will be mapped to free text fields. Not perfect, not 100 percent, but hugely functional, and it meets user’s expectations. To see how this can be done with the volumes and volumes of medically-related open access content see the good work done by OpenPHI and their HealthLibrarian.

* This posting was originally “published” as a part of litablog.org, and it is duplicated here because many copies keep stuff safe.

Text mining: Books and Perl modules

Wednesday, June 3rd, 2009

This posting simply lists some of the books I’ve read and Perl modules I’ve explored in regards to the field of text mining.

Through my explorations of term frequency/inverse document frequency (TFIDF) I became aware of a relatively new field of study called text mining. In many ways, text mining is similar to data mining only applied to unstructured texts instead of database rows and columns. Think plain text books such as items from Project Gutenberg or the Open Content Alliance. Text mining is a process including automatic classification, clustering (similar but distinct from classification), indexing and searching, entity extraction (names, places, organization, dates, etc.), statistically significant keyword and phrase extraction, parts of speech tagging, and summarization.

As a librarian, I found the whole thing extremely fascinating, consequently I read more.

Books

I have found the following four books helpful. They have enabled me to learn about the principles of text mining.

  • Bilisoly, R. (2008). Practical text mining with Perl. Wiley series on methods and applications in data mining. Hoboken, N.J.: Wiley. – Of all the books listed here, this one includes the most Perl programming examples, and it is not as scholarly as the balance of the list. Much of the book surrounds the description of regular expressions against texts. Its strongest suit is the creation of terminal-based concordance scripts. Very nice. Lot’s of fun. The concordances return very interesting results. The book does describe clustering techniques too, but the on the overall topic of automatic metadata generation the book is not very strong.
  • Konchady, M. (2006). Text mining application programming. Charles River Media programming series. Boston, Mass: Charles River Media. – This book is a readable survey of text mining covering parts of speech (POS) tagging, information extraction, search engines, clustering, classification, summarization, and question/answer processing. Many models for each aspect of text mining are described, compared, and contrasted. To put the author’s knowledge into practice, the book comes with a CD containing a Perl library for text mining, sample applications, and CGI scripts. This library is freely available on the Web.
  • Nugues, P. M. (2006). An introduction to language processing with Perl and Prolog: An outline of theories, implementation, and application with special consideration of English, French, and German. Cognitive technologies. Berlin: Springer. – Of the four books listed here, this one is probably the most dense. I found its Perl scripts used to parse text more useful than the ones in Bilisoly, but this one included no concordance applications. I also found the description of n-grams to be very interesting — the extraction of multi-word phrases. I suspect the model they describe can be extended to n number of words. This book also discusses parts of speech (POS) processing but it is the only one that describes how to really parse language. Think semantics, lexicons, discourse, and dialog. After the first couple of chapters the Perl examples disappear and give way to exclusively Prologue examples.
  • Weiss, S. M. (2005). Text mining: Predictive methods for analyzing unstructured information. New York: Springer. – The complexity of this book lies between Konchady and Nugues; it includes a greater number of mathematical models than Konchady, but it is easier to read than Nugues. Broad topics include textual documents as numeric vectors, using text for prediction, information retrieval, clustering & classification, and looking for information in documents. Each chapter includes a section called “Historical and Bibliographical Remarks” which has proved to be very interesting reading.

When it comes to the process of text mining I found each of these books useful in their own right. Each provided me with ways to reading texts, parsing texts, counting words, counting phrases, and through the application of statistical analysis create lists and readable summaries denoting the “aboutness” of given documents.

Perl modules

As a Perl hacker I am interested in writing scripts putting into practice some of the things I learn. Listed here are a number of modules that have gotten me further along in regard to text mining:

  • Lingua::EN::Fathom – This library outputs interesting statistics regarding a given document: number of words and the number of times each occurs, number of sentences, complexity of words, number of paragraphs, etc. Of greatest interest are numbers (Fog, Flesch, and Flesch-Kincaid) denoting the readability of the text. Quick. Easy. Useful.
  • Lingua::EN::Keywords – Given a text, this library outputs a list of what it thinks are the most significant individual words in a document, sans stop words. Not fancy.
  • Lingua::EN::NamedEntity – Given a text, I believe this library comes pre-trained to extract names, places, and organizations from texts. It returns a Perl data structure listing the probabilities of a word or phrase being any particular entity. It may need to be re-trained to work for your corpus.
  • Lingua::EN::Semtags::Engine – Given text this module will return words and phrases in a relevancy ranked order. Initially, I have had some problems using this module because it seems to take a long time to return. On the other hand, it looks promising since it returns both individual words as well as phrases.
  • Lingua::EN::Summarize – Given a text this library returns sentences it thinks encapsulates the essence of the document. The result is readable — grammatically correct. The process it uses to accomplish its task is self-proclaimed as unscientific.
  • Lingua::EN::Tagger – This library marks up a document in pseudo XML with tags denoting parts of speech in a given document. To do this work it also can extract words, noun phrases, and sentences from a text. Zippy. Probability-based. Developers are expected to parse the tagged output and do analysis against it, such as count the number of times particular parts of speech occur.
  • Lingua::StopWords – Returns a simple list of stop words. Easy, but I can’t figure out how customizable it is. “One person’s stop word list is another person research topic.”
  • Net::Dict – A network interface to DICT (dictionary) servers. While the DICT protocol is a bit long in the tooth, and not quite as cool as Web interfaces to things like Google or Wikipedia, this module does provide a handy way to look up definitions, a complimentary functionality to WordNet.
  • Text::Aspell – A Perl interface to GNU Aspell which is great for spell-checking applications.
  • TextMine – This is a set of modules written by Manu Konchady the author of Text Mining Application Programming. It includes submodules named Cluster, Entity, Index, Pos, Quanda (Q & A), Summary, Tokens, and WordNet. While this set of modules is the most comprehensive I’ve seen, and while they are probably the most theoretically based interfacing with things like WordNet to be thorough, my initial experience has been a bit frustrating since scripts written against the libraries do not turn very quickly. Maybe I’m feeding them documents that are too large and if so, then the libraries are not necessarily scalable.
  • WordNet – There are a bevy of modules providing functionality against WordNet — a “lexical database of English… Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations.” Any truly thorough text mining application of English will take advantage of WordNet.

Text mining and librarianship

Given the volume of “born digital” material being created, it is not possible to apply traditional library methods against them. The hand-crafted, heavy human touch process is not scalable. Given the amounts of mass digitized text being generated from the Google Books Project and the Open Content Alliance, new opportunities for literary analysis make themselves evident. Again, the traditional library processes can not fill the bill in these regards.

Text mining techniques offer possible solutions to these problems. Count words. Count phrases. Compare these words, phrases, and counts to other texts. Determine their statistical significance. Assign them to documents in the form of subject headings, keywords, author names, and other added entries in our metadata formats. Given large numbers of books, articles, and other “wordy” documents, learn how to “save the time of the reader” by summarizing these documents and ranking them in some sort of order in addition to alphabetical or date. Compare and contrast full text works by learning what words and types of words are used in documents. Are the words religious in nature? Mathematic and scientific? Poetic? Such things will provide additional means for understanding and interpreting everything from scholarly journal articles to works of classic fiction and philosophy. These techniques are not intended to replace existing methods of understanding and organization, but rather to supplement and build upon them. This is an evolutionary process.`

If libraries and librarians desire to remain relevant in the evolving information environment, then they will need to do the good work they do differently. The problem to be solved now-a-days is less about access and more about use. Text mining is one way of making the content of libraries more useful.

Interent Archive content in “discovery” systems

Tuesday, June 2nd, 2009

This quick posting describes how Internet Archive content, specifically, content from the Open Content Alliance can be quickly and easily incorporated into local library “discovery” systems. VuFind is used here as the particular example:

  1. Get keys – The first step is to get a set of keys describing the content you desire. This can be acquired through the Internet Archive’s advanced search interface.
  2. Convert keys – The next step is to convert the keys into sets of URLs pointing to the content you want to download. Fortunately, all the URLs have a similar shape: http://www.archive.org/download/KEY/KEY.pdf, http://www.archive.org/download/KEY/KEY_meta.mrc, or http://www.archive.org/download/KEY/KEY__djvu.txt.
  3. Download – Feed the resulting URLs to your favorite spidering/mirroring application. I use wget.
  4. Update – Enhance the downloaded MARC records with 856$u valued denoting the location of your local PDF copy as well as the original (cononical) version.
  5. Index – Add the resulting MARC records to your “discovery” system.

Linked here is a small distribution of shell and Perl scripts that do this work for me and incorporate the content into VuFind. Here is how they can be used:

  $ getkeys.sh > catholic.keys
  $ keys2urls.pl catholic.keys > catholic.urls
  $ mirror.sh catholic.urls
  $ updatemarc.pl
  $ find /usr/var/html/etexts -name '*.marc' /
  -exec cat {} >> /usr/local/vufind/marc/archive.marc \;
  $ cd /usr/local/vufind
  $ ./import.sh marc/archive.marc
  $ sudo ./vufind.sh restart

Cool next steps would be use text mining techniques against the downloaded plain text versions of the documents to create summaries, extract named entities, and identify possible subjects. These items could then be inserted into the MARC records to enhance retrieval. Ideally the full text would be indexed, but alas, MARC does not accomodate that. “MARC must die.”

TFIDF In Libraries: Part III of III (For thinkers)

Sunday, May 31st, 2009

This is the third of the three-part series on the topic of TFIDF in libraries. In Part I the why’s and wherefore’s of TFIDF were outlined. In Part II TFIDF subroutines and programs written in Perl were used to demonstrate how search results can be sorted by relevance and automatic classification can be done. In this last part a few more subroutines and a couple more programs are presented which: 1) weigh search results given an underlying set of themes, and 2) determine similarity between files in a corpus. A distribution including the library of subroutines, Perl scripts, and sample data are available online.

Big Names and Great Ideas

As an intellectual humanist, I have always been interested in “great” ideas. In fact, one of the reasons I became I librarian was because of the profundity of ideas physically located libraries. Manifested in books, libraries are chock full of ideas. Truth. Beauty. Love. Courage. Art. Science. Justice. Etc. As the same time, it is important to understand that books are not source of ideas, nor are they the true source of data, information, knowledge, or wisdom. Instead, people are the real sources of these things. Consequently, I have also always been interested in “big names” too. Plato. Aristotle. Shakespeare. Milton. Newton. Copernicus. And so on.

As a librarian and a liberal artist (all puns intended) I recognize many of these “big names” and “great ideas” are represented in a set of books called the Great Books of the Western World. I then ask myself, “Is there someway I can use my skills as a librarian to help support other people’s understanding and perception of the human condition?” The simple answer is to collection, organize, preserve, and disseminate the things — books — manifesting great ideas and big names. This is a lot what my Alex Catalogue of Electronic Texts is all about. On the other hand, a better answer to my question is to apply and exploit the tools and processes of librarianship to ultimately “save the time of the reader”. This is where the use of computers, computer technology, and TFIDF come into play.

Part II of this series demonstrated how to weigh search results based on the relevancy ranked score of a search term. But what if you were keenly interested in “big names” and “great ideas” as they related to a search term? What if you wanted to know about librarianship and how it related to some of these themes? What if you wanted to learn about the essence of sculpture and how it may (or may not) represent some of the core concepts of Western civilization? To answer such questions a person would have to search for terms like sculpture or three-dimensional works of art in addition to all the words representing the “big names” and “great ideas”. Such a process would be laborious to enter by hand, but trivial with the use of a computer.

Here’s a potential solution. Create a list of “big names” and “great ideas” by copying them from a place such as the Great Books of the Western World. Save the list much like you would save a stop word list. Allow a person to do a search. Calculate the relevancy ranking score for each search result. Loop through the list of names and ideas searching for each of them. Calculate their relevancey. Sum the weight of search terms with the weight of name/ideas terms. Return the weighted list. The result will be a relevancy ranked list reflecting not only the value of the search term but also the values of the names/ideas. This second set of values I call the Great Ideas Coefficient.

To implement this idea, the following subroutine, called great_ideas, was created. Given an index, a list of files, and a set of ideas, it loops through each file calculating the TFIDF score for each name/idea:

  sub great_ideas {

    my $index = shift;
    my $files = shift;
    my $ideas = shift;

    my %coefficients = ();

    # process each file
    foreach $file ( @$files ) {

      my $words = $$index{ $file };
      my $coefficient = 0;

      # process each big idea
      foreach my $idea ( keys %$ideas ) {

        # get n and t for tdidf
        my $n = $$words{ $idea };
        my $t = 0;
        foreach my $word ( keys %$words ) { $t = $t + $$words{ $word } }

          # calculate; sum all tfidf scores for all ideas
          $coefficient = $coefficient + &tfidf( $n, $t, keys %$index, scalar @$files );

        }

      # assign the coefficient to the file
      $coefficients{ $file } = $coefficient;

    }

    return \%coefficients;

  }

A Perl script, ideas.pl, was then written taking advantage of the great_ideas subroutine. As described above, it applies the query to an index, calculates TFIDF for the search terms as well as the names/ideas, sums the results, and lists the results accordingly:

  # define
  use constant STOPWORDS => 'stopwords.inc';
  use constant IDEAS     => 'ideas.inc';

  # use/require
  use strict;
  require 'subroutines.pl';

  # get the input
  my $q = lc( $ARGV[ 0 ] );

  # index, sans stopwords
  my %index = ();
  foreach my $file ( &corpus ) { $index{ $file } = &index( $file, &slurp_words( STOPWORDS ) ) }

  # search
  my ( $hits, @files ) = &search( \%index, $q );
  print "Your search found $hits hit(s)\n";

  # rank
  my $ranks = &rank( \%index, [ @files ], $q );

  # calculate great idea coefficients
  my $coefficients = &great_ideas( \%index, [ @files ], &slurp_words( IDEAS ) );

  # combine ranks and coefficients
  my %scores = ();
  foreach ( keys %$ranks ) { $scores{ $_ } = $$ranks{ $_ } + $$coefficients{ $_ } }

  # sort by score and display
  foreach ( sort { $scores{ $b } <=> $scores{ $a } } keys %scores ) {

    print "\t", $scores{ $_ }, "\t", $_, "\n"

  }

Using the query tool described in Part II, a search for librarianship returns the following results:

  $ ./search.pl books
  Your search found 3 hit(s)
    0.00206045818083232   librarianship.txt
    0.000300606222548807  mississippi.txt
    5.91505974210339e-05  hegel.txt

Using the new program, ideas.pl, the same set of results are returned but in a different order, an order reflecting the existence of “big ideas” and “great ideas” in the texts:

  $ ./ideas.pl books
  Your search found 3 hit(s)
    0.101886904057731   hegel.txt
    0.0420767249559441  librarianship.txt
    0.0279062776599476  mississippi.txt

When it comes to books and “great” ideas, maybe I’d rather read hegel.txt as opposed to librarianship.txt. Hmmm…

Think of the great_ideas subroutine as embodying the opposite functionality as a stop word list. Instead of excluding the words in a given list from search results, use the words to skew search results in a particular direction.

The beauty of the the great_ideas subroutine is that anybody can create their own set of “big names” or “great ideas”. They could be from any topic. Biology. Mathematics. A particular subset of literature. Just as different sets of stop words are used in different domains, so can the application of a Great Ideas Coefficient.

Similarity between documents

TFIDF can be applied to the problem of finding more documents like this one.

The process of finding more documents like this is perennial. The problem is addressed in the field of traditional librarianship through the application of controlled vocabulary terms, author/title authority lists, the collocation of physical materials through the use of classification numbers, and bibliographic instruction as well as information literacy classes.

In the field of information retrieval, the problem is addressed through the application of mathematics. More specifically but simply stated, by plotting the TFIDF scores of two or more terms from a set of documents on a Cartesian plane it is possible to calculate the similarity between said documents by comparing the angle and length of the resulting vectors — a measure called “cosine similarity”. By extending the process to any number of documents and any number of dimensions it is relatively easy to find more documents like this one.

Suppose we have two documents: A and B. Suppose each document contains many words but those words were only science and art. Furthermore, suppose document A contains the word science 9 times and the word art 10 times. Given these values, we can plot the relationship between science and art on a graph, below. Document B can be plotted similarly supposing science occurs 6 times and the word art occurs 14 times. The resulting lines, beginning at the graph’s origin (O) to their end-points (A and B), are called “vectors” and they represent our documents on a Cartesian plane:

  s    |
  c  9 |         * A
  i    |        *
  e    |       *
  n  6 |      *      * B
  c    |     *     *
  e    |    *    *
       |   *   *
       |  *  *
       | * *
       O-----------------------
                10   14

                  art

  Documents A and B represented as vectors

If the lines OA and OB were on top of each other and had the same length, then the documents would be considered equal — exactly similar. In other words, the smaller the angle AOB is as well as the smaller the difference between the length lines OA and OB the more likely the given documents are the same. Conversely, the greater the angle of AOB and the greater the difference of the lengths of lines OA and OB the more unlike the two documents.

This comparison is literally expressed as the inner (dot) product of the vectors divided by the product of the Euclidian magnitudes of the vectors. Mathematically, it is stated in the following form and is called “cosine similarity”:

( ( A.B ) / ( ||A|| * ||B|| ) )

Cosine similarity will return a value between 0 and 1. The closer the result is to 1 the more similar the vectors (documents) compare.

Most cosine similarity applications apply the comparison to every word in a document. Consequently each vector has a large number of dimensions making calculations time consuming. For the purposes of this series, I am only interested in the “big names” and “great ideas”, and since The Great Books of the Western World includes about 150 of such terms, the application of cosine similarity is simplified.

To implement cosine similarity in Perl three additional subroutines needed to be written. One to calculate the inner (dot) product of two vectors. Another was needed to calculate the Euclidian length of a vector. These subroutines are listed below:

  sub dot {

    # dot product = (a1*b1 + a2*b2 ... ) where a and b are equally sized arrays (vectors)
    my $a = shift;
    my $b = shift;
    my $d = 0;
    for ( my $i = 0; $i <= $#$a; $i++ ) { $d = $d + ( $$a[ $i ] * $$b[ $i ] ) }
    return $d;

  }

  sub euclidian {

    # Euclidian length = sqrt( a1^2 + a2^2 ... ) where a is an array (vector)
    my $a = shift;
    my $e = 0;
    for ( my $i = 0; $i <= $#$a; $i++ ) { $e = $e + ( $$a[ $i ] * $$a[ $i ] ) }
    return sqrt( $e );

  }

The subroutine that does the actual comparison is listed below. Given a reference to an array of two books, stop words, and ideas, it indexes each book sans stop words, searches each book for a great idea, uses the resulting TFIDF score to build the vectors, and computes similarity:

  sub compare {

    my $books     = shift;
    my $stopwords = shift;
    my $ideas     = shift;

    my %index = ();
    my @a     = ();
    my @b     = ();

    # index
    foreach my $book ( @$books ) { $index{ $book } = &index( $book, $stopwords ) }

    # process each idea
    foreach my $idea ( sort( keys( %$ideas ))) {

      # search
      my ( $hits, @files ) = &search( \%index, $idea );

      # rank
      my $ranks = &rank( \%index, [ @files ], $idea );

      # build vectors, a & b
      my $index = 0;
      foreach my $file ( @$books ) {

        if    ( $index == 0 ) { push @a, $$ranks{ $file }}
        elsif ( $index == 1 ) { push @b, $$ranks{ $file }}
        $index++;

        }

      }

      # compare; scores closer to 1 approach similarity
      return ( cos( &dot( [ @a ], [ @b ] ) / ( &euclidian( [ @a ] ) * &euclidian( [ @b ] ))));

  }

Finally, a script, compare.pl, was written glueing the whole thing together. It’s heart is listed here:

  # compare each document...
  for ( my $a = 0; $a <= $#corpus; $a++ ) {

    print "\td", $a + 1;

    # ...to every other document
    for ( my $b = 0; $b <= $#corpus; $b++ ) {

      # avoid redundant comparisons
      if ( $b <= $a ) { print "\t - " }

      # process next two documents
      else {

        # (re-)initialize
        my @books = sort( $corpus[ $a ], $corpus[ $b ] );

        # do the work; scores closer to 1000 approach similarity
        print "\t", int(( &compare( [ @books ], $stopwords, $ideas )) * 1000 );

      }

    }

    # next line
    print "\n";

  }

In a nutshell, compare.pl loops through each document in a corpus and compares it to every other document in the corpus while skipping duplicate comparisons. Remember, only the dimensions representing “big names” and “great ideas” are calculated. Finally, it displays a similarity score for each pair of documents. Scores are multiplied by 1000 to make them easier to read. Given the sample data from the distribution, the following matrix is produced:

  $ ./compare.pl
    Comparison: scores closer to 1000 approach similarity

        d1   d2   d3   d4   d5   d6

    d1   -  922  896  858  857  948
    d2   -   -   887  969  944  971
    d3   -   -    -   951  954  964
    d4   -   -    -    -   768  905
    d5   -   -    -    -    -   933
    d6   -   -    -    -    -    - 

    d1 = aristotle.txt
    d2 = hegel.txt
    d3 = kant.txt
    d4 = librarianship.txt
    d5 = mississippi.txt
    d6 = plato.txt

From the matrix is it obvious that documents d2 (hegel.txt) and d6 (plato.txt) are the most similar since their score is the closest to 1000. This means the vectors representing these documents are closer to congruency than the other documents. Notice how all the documents are very close to 1000. This makes sense since all of the documents come from the Alex Catalogue and the Alex Catalogue documents are selected because of the “great idea-ness”. The documents should be similar. Notice which documents are the least similar: d4 (librarianship.txt) and d5 (mississippi.txt). The first is a history of librarianship. The second is a novel called Life on the Mississippi. Intuitively, we would expect this to be true; neither one of these documents are the topic of “great ideas”.

(Argg! Something is incorrect with my trigonometry. When I duplicate a document and run compare.pl the resulting cosine similarity value between the exact same documents is 540, not 1000. What am I doing wrong?)

Summary

This last part in the series demonstrated ways term frequency/inverse document frequency (TFIDF) can be applied to over-arching (or underlying) themes in a corpus of documents, specifically the “big names” and “great ideas” of Western civilization. It also demonstrated how TFIDF scores can be used to create vectors representing documents. These vectors can then be compared for similarity, and, by extension, the documents they represent can be compared for similarity.

The purpose of the entire series was to bring to light and take the magic out of a typical relevancy ranking algorithm. A distribution including all the source code and sample documents is available online. Use the distribution as a learning tool for your own explorations.

As alluded to previously, TFIDF is like any good folk song. It has many variations and applications. TFIDF is also like milled grain because it is a fundemental ingredient to many recipes. Some of these recipies are for bread, but some of them are for pies or just thickener. Librarians and libraries need to incorporate more mathematical methods into their processes. There needs to be a stronger marriage between the social characteristics of librarianship and the logic of mathematics. (Think arscience.) The application of TFIDF in libraries is just one example.

The decline of books

Friday, May 8th, 2009

[This posting is in response to a tiny thread on the NGC4Lib mailing list about the decline of books. --ELM]

Yes, books are on the decline, but in order to keep this trend in perspective it is important to not confuse the medium with the message. The issue is not necessarily about books as much as it is about the stuff inside the books.

Books — codexes — are a particular type of technology. Print words and pictures on leaves of paper. Number the pages. Add an outline of the book’s contents — a table of contents. Make the book somewhat searchable by adding an index. Wrap the whole thing between a couple of boards. The result is a thing that is portable, durable, long- lasting, and relatively free-standing as well as independent of other technology. But all of this is really a transport medium, a container for the content.

Consider the content of books. Upon close examination it is a recorded manifestation of humanity. Books — just like the Web — are a reflection of humankind because just anything you can think of can be manifested in printed form. Birth. Growth. Love. Marriage. Aging. Death. Poetry. Prose. Mathematics. Astronomy. Business. Instructions. Facts. Directories. Gardening. Theses and dissertations. News. White papers. Plans. History. Descriptions. Dreams. Weather. Stock quotes. The price of gold. Things for sale. Stories both real and fictional. Etc. Etc. Etc.

Consider the length of time humankind has been recording things in written form. Maybe five thousand years. What were the mediums used? Stone and clay tablets? Papyrus scrolls. Vellum. Paper. To what extent did people bemoan the death of clay tablets? To what extent did they bemoan the movement from scrolls to codexes? Probably the cultures who valued verbal traditions as opposed to written traditions (think of the American Indians) had more to complain about than the migration from one written from to another. The medium is not as important as the message.

Different types of content lend themselves to different mediums. Music can be communicated via the written score, but music is really intended to be experienced through hearing. Sculpture is, by definition, a three-dimensional medium, yet we take photographs of it, a two-dimensional medium. The poetry and prose lend themselves very well to the written word, but they can be seen as forms of storytelling, and while there are many advantages to stories being written down, there are disadvantages as well. No sound effects. Where to put the emphasis on phrases? Hand gestures to communicate subtle distinctions are lost. It is for all of these reasons that libraries (and museums and archives) also collect the mediums that better represent this content. Paintings. Sound recordings. Artifacts. CDs and DVDs.

The containers of information will continue to change, but I assert that the content will not. The content will continue to be a reflection of humankind. It will represent all of the things that it means to be men, woman, and children. It will continue to be an exposition of our collective thoughts, feelings, beliefs, and experiences.

Libraries and other “cultural heritage institutions” do not have and never did have a monopoly on recorded content, but now, more than ever, and as we have moved away from an industrial-based economy to a more service-based economy whose communication channels are electronic and global, the delivery of recorded content, in whatever form, is more profitable. Consequently there is more competition. Libraries need to get a grip on what they are all about. If it is about the medium — books, CDs, articles — then the future is grim. If it is about content and making that content useful to their clientele, then the opportunities are wide open. Shifting a person’s focus from the how to the what is challenging. Looking at the forest from the trees is sometimes overwhelming. Anybody can get information these days. We are still drinking from the proverbial fire hose. The problem to be solved is less about discovery and more about use. It is about placing content in context. Providing a means to understanding it, manipulating it, and using it to solve the problems revolving around what it means to be human.

We are a set of educated people. If we put our collective minds to the problem, then I sincerely believe libraries can and will remain relevant. In fact, that is why I instituted this [the NGC4Lib] mailing list.

Code4Lib Software Award: Loose ends

Monday, April 27th, 2009

Loose ends make me feel uncomfortable, and one of the loose ends in my professional life is the Code4Lib Software Award.

Code4Lib began as a mailing list in 2003 and has grown to about 1,200 subscribers from all over the world. New people subscribe to the list almost daily. Its Web presence started up in 2005. Our conferences have been stimulating, informative, and productive for all three years of their existence. Our latest venture — the journal — records, documents, and shares the practical experience of our community. Underlying all of this is an IRC channel where answers to library-related computer problems can be answered in real-time. Heck, there even exists three for four Code4Lib “franchises”. In sum, by exploiting both traditional and less traditional mediums the Code4Lib Community has grown and matured quickly over the past five years. In doing so it has provided valuable and long-lasting services to itself as well as the greater library profession.

It is for the reasons outlined above that I believe our community is ripe for an award. Good things happen in Code4Lib. These things begin with individuals, and I believe the good code written by these individuals ought to be formally recognized. Unfortunately, ever since I put forward the idea, I have heard more negative things than positive. To paraphrase, “It would be seen as an endorsement, and we don’t endorse… It would turn out to be just a popularity contest… There are so many characteristics of good software that any decision would seem arbitrary.”

Apparently the place for an award is not as obvious to others as it is to me. Apparently our community is not as ready for an award as I thought we were. That is why, for the time being, I am withdrawing my offer to sponsor one. Considering who I am, I simply don’t have the political wherewithal to make the award a reality, but I do predict there will be an award at some time, just not right now. The idea needs to ferment for a while longer.

TFIDF In Libraries: Part II of III (For programmers)

Monday, April 20th, 2009

This is the second of a three-part series called TFIDF In Libraries, where relevancy ranking techniques are explored through a set of simple Perl programs. In Part I relevancy ranking was introduced and explained. In Part III additional word/document weighting techiques will be explored to the end of filtering search results or addressing the perennial task of “finding more documents like this one.” In the end it is the hoped to demonstrate that relevancy ranking is not magic nor mysterious but rather the process of applying statistical techiques to textual objects.

TFIDF, again

As described in Part I, term frequency/inverse document frequency (TFIDF) is a process of counting words in a document as well as throughout a corpus of documents to the end of sorting documents in statistically relevent ways.

Term frequency (TF) is essencially a percentage denoting the number of times a word appears in a document. It is mathematically expressed as C / T, where C is the number of times a word appears in a document and T is the total number of words in the same document.

Inverse document frequency (IDF) takes into acount that many words occur many times in many documents. Stop words and the word “human” in the MEDLINE database are very good examples. IDF is mathematically expressed as D / DF, where D is the total number of documents in a corpus and DF is the number of document in which a particular word is found. As D / DF increases so does the significance of the given word.

Given these two factors, TFIDF is literally the product of TF and IDF:

TFIDF = ( C / T ) * ( D / DF )

This is the basic form that has been used to denote relevance ranking for more than forty years, and please take note that it requires no advanced mathematical knowledge — basic arithmatic.

Like any good recipe or folk song, TFIDF has many variations. Google, for example, adds additional factors into their weighting scheme based on the popularity of documents. Other possibilities could include factors denoting the characteristics of the person using the texts. In order to accomodate for the wide variety of document sizes, the natural log of IDF will be employed throughout the balance of this demonstration. Therefore, for the purposes used here, TFIDF will be defined thus:

TFIDF = ( C / T ) * log( D / DF )

Simple Perl subroutines

In order to put theory into practice, I wrote a number of Perl subroutines implementing various aspects of relevancy ranking techniques. I then wrote a number of scripts exploiting the subroutines, essencially wrapping them in a user interface.

Two of the routines are trivial and will not be explained in any greater detail than below:

  • corpus – Returns an array of all the .txt files in the current directory, and is used to denote the library of content to be analyzed.
  • slurp_words – Returns a reference to a hash of all the words in a file, specifically for the purposes of implementing a stop word list.

Two more of the routines are used to support indexing and searching the corpus. Again, since neither is the focus of this posting, each will only be outlined:

  • index – Given a file name and a list of stop words, this routine returns a reference to a hash containing all of the words in the file (san stop words) as well as the number of times each word occurs. Strictly speaking, this hash is not an index but it serves our given purpose adequately.
  • search – Given an “index” and a query, this routine returns the number of times the query was found in the index as well as an array of files listing where the term was found. Search is limited. It only supports single-term queries, and there are no fields for limiting.

The heart of the library of subroutines is used to calculate TFIDF, ranks search results, and classify documents. Of course the TFIDF calculation is absolutely necessary, but ironically, it is the most straight-forward routine in the collection. Given values for C, T, D, and DF it returns decimal between 0 and 1. Trivial:

  # calculate tfidf
  sub tfidf {

    my $n = shift;  # C
    my $t = shift;  # T
    my $d = shift;  # D
    my $h = shift;  # DF

    my $tfidf = 0;

    if ( $d == $h ) { $tfidf = ( $n / $t ) }
    else { $tfidf = ( $n / $t ) * log( $d / $h ) }

    return $tfidf;

  }

Many readers will probably be most interested in the rank routine. Given an index, a list of files, and a query, this code calculates TFIDF for each file and returns the results as a reference to a hash. It does this by repeatedly calculating the values for C, T, D, and DF for each of the files and calling tfidf:

  # assign a rank to a given file for a given query
  sub rank {

    my $index = shift;
    my $files = shift;
    my $query = shift;

    my %ranks = ();

    foreach my $file ( @$files ) {

      # calculate n
      my $words = $$index{ $file };
      my $n = $$words{ $query };

      # calculate t
      my $t = 0;
      foreach my $word ( keys %$words ) { $t = $t + $$words{ $word } }

      # assign tfidf to file
      $ranks{ $file } = &tfidf( $n, $t, keys %$index, scalar @$files );

    }

    return \%ranks;

  }

The classify routine is an added bonus. Given the index, a file, and the corpus of files, this function calculates TFIDF for each word in the file and returns a refernece to a hash containing each word and its TFIDF value. In other words, instead of calculating TFIDF for a given query in a subset of documents, it calculates TFIDF for each word in an entire corpus. This proves useful in regards to automatic classification. Like rank, it repeatedly determines values for C, T, D, and DF and calls tfidf:

  # rank each word in a given document compared to a corpus
  sub classify {

    my $index  = shift;
    my $file   = shift;
    my $corpus = shift;

    my %tags = ();

    foreach my $words ( $$index{ $file } ) {

      # calculate t
      my $t = 0;
      foreach my $word ( keys %$words ) { $t = $t + $$words{ $word } }

      foreach my $word ( keys %$words ) {

        # get n
        my $n = $$words{ $word };

        # calculate h
        my ( $h, @files ) = &search( $index, $word );

        # assign tfidf to word
        $tags{ $word } = &tfidf( $n, $t, scalar @$corpus, $h );

      }

    }

    return \%tags;

  }

Search.pl

Two simple Perl scripts are presented, below, taking advantage of the routines described, above. The first is search.pl. Given a single term as input this script indexes the .txt files in the current directory, searches them for the term, assigns TFIDF to each of the results, and displays the results in a relevancy ranked order. The essencial aspects of the script are listed here:

  # define
  use constant STOPWORDS => 'stopwords.inc';

  # include
  require 'subroutines.pl';

  # get the query
  my $q = lc( $ARGV[ 0 ] );

  # index
  my %index = ();
  foreach my $file ( &corpus ) { $index{ $file } = &index( $file, &slurp_words( STOPWORDS ) ) }

  # search
  my ( $hits, @files ) = &search( \%index, $q );
  print "Your search found $hits hit(s)\n";

  # rank
  my $ranks = &rank( \%index, [ @files ], $q );

  # sort by rank and display
  foreach my $file ( sort { $$ranks{ $b } <=> $$ranks{ $a } } keys %$ranks ) {

    print "\t", $$ranks{ $file }, "\t", $file, "\n"

  }

  # done
  print "\n";
  exit;

Output from the script looks something like this:

  $ ./search.pl knowledge
  Your search found 6 hit(s)
    0.0193061840120664    plato.txt
    0.00558586078987563   kant.txt
    0.00299602568022012   aristotle.txt
    0.0010031177985631    librarianship.txt
    0.00059150597421034   hegel.txt
    0.000150303111274403  mississippi.txt

From these results you can see that the document named plato.txt is the most relevent because it has the highest score, in fact, it is almost four times more relevant than the second hit, kant.txt. For extra credit, ask yourself, “At what point do the scores become useless, or when do the scores tell you there is nothing of significance here?”

Classify.pl

As alluded to in Part I of this series, TFIDF can be turned on its head to do automatic classification. Weigh each term in a corpus of documents, and list the most significant words for a given document. Classify.pl does this by denoting a lower bounds for TFIDF scores, indexing an entire corpus, weighing each term, and outputing all the terms whose scores are greater than the lower bounds. If no terms are greater than the lower bounds, then it lists the top N scores as defined by a configuration. The essencial aspects of classify.pl are listed below:

  # define
  use constant STOPWORDS    => 'stopwords.inc';
  use constant LOWERBOUNDS  => .02;
  use constant NUMBEROFTAGS => 5;

  # require
  require 'subroutines.pl';

  # initialize
  my @corpus = &corpus;

  # index
  my %index = ();
  foreach my $file (@corpus ) { $index{ $file } = &index( $file, &slurp_words( STOPWORDS ) ) }

  # classify each document
  foreach my $file ( @corpus ) {

    print $file, "\n";

    # list tags greater than a given score
    my $tags  = &classify( \%index, $file, [ @corpus ] );
    my $found = 0;
    foreach my $tag ( sort { $$tags{ $b } <=> $$tags{ $a } } keys %$tags ) {

      if ( $$tags{ $tag } > LOWERBOUNDS ) {

        print "\t", $$tags{ $tag }, "\t$tag\n";
        $found = 1;

      }

      else { last }

    }

    # accomodate tags with low scores
    if ( ! $found ) {

      my $n = 0;
      foreach my $tag ( sort { $$tags{ $b } <=> $$tags{ $a } } keys %$tags ) {

        print "\t", $$tags{ $tag }, "\t$tag\n";
        $n++;
        last if ( $n == NUMBEROFTAGS );

      }

    }

    print "\n";

  }

  # done
  exit;

For example, sample, yet truncated, output from classify.pl looks like this:

  aristotle.txt
    0.0180678691531642  being
    0.0112840859266579  substances
    0.0110363803118312  number
    0.0106083766432284  matter
    0.0098440843778661  sense

  mississippi.txt
    0.00499714142455761  mississippi
    0.00429324597184886  boat
    0.00418922035591656  orleans
    0.00374087743616293  day
    0.00333830388445574  river

Thus, assuming a lower TFIDF bounds of 0.02, the words being, substance, number, matter, and sense are the most significant in the document named aristotle.txt. But since none of the words in mississippi.txt have a score that high the top five words are returned instead. For more extra credit, think of ways classify.pl can be improved by answering, “How can the output be mapped to controlled vocabulary terms or expanded through the use of some other thesarus?”

Summary

The Perl subroutines and scripts described here implement TFIDF to do rudimentary ranking of search results and automatic classification. They are not designed to be production applications, just example tools for the purposes of learning. Turning the ideas implemented in these scripts into production applications have been the fodder for many people’s careers and entire branches of computer science.

You can download the scripts, subroutines, and sample data in order for you to learn more. You are encouraged to remove the .txt files from the distribution and replace them with your own data. I think your search results and automatic classification output will confirm in your mind that TFIDF is well-worth the time and effort of the library community. Given the amounts of full text books and journal articles freely available on the Internet, it behooves the library profession to learn to exploit these concepts because our traditional practices simply: 1) do not scale, or 2) do not meet with our user’s expectations. Furthermore, farming these sorts of solutions out to vendors is irresponsible.