Archive for April, 2009

Code4Lib Software Award: Loose ends

Monday, April 27th, 2009

Loose ends make me feel uncomfortable, and one of the loose ends in my professional life is the Code4Lib Software Award.

Code4Lib began as a mailing list in 2003 and has grown to about 1,200 subscribers from all over the world. New people subscribe to the list almost daily. Its Web presence started up in 2005. Our conferences have been stimulating, informative, and productive for all three years of their existence. Our latest venture — the journal — records, documents, and shares the practical experience of our community. Underlying all of this is an IRC channel where answers to library-related computer problems can be answered in real-time. Heck, there even exists three for four Code4Lib “franchises”. In sum, by exploiting both traditional and less traditional mediums the Code4Lib Community has grown and matured quickly over the past five years. In doing so it has provided valuable and long-lasting services to itself as well as the greater library profession.

It is for the reasons outlined above that I believe our community is ripe for an award. Good things happen in Code4Lib. These things begin with individuals, and I believe the good code written by these individuals ought to be formally recognized. Unfortunately, ever since I put forward the idea, I have heard more negative things than positive. To paraphrase, “It would be seen as an endorsement, and we don’t endorse… It would turn out to be just a popularity contest… There are so many characteristics of good software that any decision would seem arbitrary.”

Apparently the place for an award is not as obvious to others as it is to me. Apparently our community is not as ready for an award as I thought we were. That is why, for the time being, I am withdrawing my offer to sponsor one. Considering who I am, I simply don’t have the political wherewithal to make the award a reality, but I do predict there will be an award at some time, just not right now. The idea needs to ferment for a while longer.

TFIDF In Libraries: Part II of III (For programmers)

Monday, April 20th, 2009

This is the second of a three-part series called TFIDF In Libraries, where relevancy ranking techniques are explored through a set of simple Perl programs. In Part I relevancy ranking was introduced and explained. In Part III additional word/document weighting techiques will be explored to the end of filtering search results or addressing the perennial task of “finding more documents like this one.” In the end it is the hoped to demonstrate that relevancy ranking is not magic nor mysterious but rather the process of applying statistical techiques to textual objects.

TFIDF, again

As described in Part I, term frequency/inverse document frequency (TFIDF) is a process of counting words in a document as well as throughout a corpus of documents to the end of sorting documents in statistically relevent ways.

Term frequency (TF) is essencially a percentage denoting the number of times a word appears in a document. It is mathematically expressed as C / T, where C is the number of times a word appears in a document and T is the total number of words in the same document.

Inverse document frequency (IDF) takes into acount that many words occur many times in many documents. Stop words and the word “human” in the MEDLINE database are very good examples. IDF is mathematically expressed as D / DF, where D is the total number of documents in a corpus and DF is the number of document in which a particular word is found. As D / DF increases so does the significance of the given word.

Given these two factors, TFIDF is literally the product of TF and IDF:

TFIDF = ( C / T ) * ( D / DF )

This is the basic form that has been used to denote relevance ranking for more than forty years, and please take note that it requires no advanced mathematical knowledge — basic arithmatic.

Like any good recipe or folk song, TFIDF has many variations. Google, for example, adds additional factors into their weighting scheme based on the popularity of documents. Other possibilities could include factors denoting the characteristics of the person using the texts. In order to accomodate for the wide variety of document sizes, the natural log of IDF will be employed throughout the balance of this demonstration. Therefore, for the purposes used here, TFIDF will be defined thus:

TFIDF = ( C / T ) * log( D / DF )

Simple Perl subroutines

In order to put theory into practice, I wrote a number of Perl subroutines implementing various aspects of relevancy ranking techniques. I then wrote a number of scripts exploiting the subroutines, essencially wrapping them in a user interface.

Two of the routines are trivial and will not be explained in any greater detail than below:

  • corpus – Returns an array of all the .txt files in the current directory, and is used to denote the library of content to be analyzed.
  • slurp_words – Returns a reference to a hash of all the words in a file, specifically for the purposes of implementing a stop word list.

Two more of the routines are used to support indexing and searching the corpus. Again, since neither is the focus of this posting, each will only be outlined:

  • index – Given a file name and a list of stop words, this routine returns a reference to a hash containing all of the words in the file (san stop words) as well as the number of times each word occurs. Strictly speaking, this hash is not an index but it serves our given purpose adequately.
  • search – Given an “index” and a query, this routine returns the number of times the query was found in the index as well as an array of files listing where the term was found. Search is limited. It only supports single-term queries, and there are no fields for limiting.

The heart of the library of subroutines is used to calculate TFIDF, ranks search results, and classify documents. Of course the TFIDF calculation is absolutely necessary, but ironically, it is the most straight-forward routine in the collection. Given values for C, T, D, and DF it returns decimal between 0 and 1. Trivial:

  # calculate tfidf
  sub tfidf {
  
    my $n = shift;  # C
    my $t = shift;  # T
    my $d = shift;  # D
    my $h = shift;  # DF
    
    my $tfidf = 0;
    
    if ( $d == $h ) { $tfidf = ( $n / $t ) }
    else { $tfidf = ( $n / $t ) * log( $d / $h ) }
    
    return $tfidf;
    
  }

Many readers will probably be most interested in the rank routine. Given an index, a list of files, and a query, this code calculates TFIDF for each file and returns the results as a reference to a hash. It does this by repeatedly calculating the values for C, T, D, and DF for each of the files and calling tfidf:

  # assign a rank to a given file for a given query
  sub rank {
  
    my $index = shift;
    my $files = shift;
    my $query = shift;
    
    my %ranks = ();
    
    foreach my $file ( @$files ) {
    
      # calculate n
      my $words = $$index{ $file };
      my $n = $$words{ $query };
      
      # calculate t
      my $t = 0;
      foreach my $word ( keys %$words ) { $t = $t + $$words{ $word } }
      
      # assign tfidf to file  
      $ranks{ $file } = &tfidf( $n, $t, keys %$index, scalar @$files );
    
    }
    
    return \%ranks;

  }

The classify routine is an added bonus. Given the index, a file, and the corpus of files, this function calculates TFIDF for each word in the file and returns a refernece to a hash containing each word and its TFIDF value. In other words, instead of calculating TFIDF for a given query in a subset of documents, it calculates TFIDF for each word in an entire corpus. This proves useful in regards to automatic classification. Like rank, it repeatedly determines values for C, T, D, and DF and calls tfidf:

  # rank each word in a given document compared to a corpus
  sub classify {
  
    my $index  = shift;
    my $file   = shift;
    my $corpus = shift;
    
    my %tags = ();
    
    foreach my $words ( $$index{ $file } ) {
    
      # calculate t
      my $t = 0;
      foreach my $word ( keys %$words ) { $t = $t + $$words{ $word } }
      
      foreach my $word ( keys %$words ) {
      
        # get n
        my $n = $$words{ $word };
        
        # calculate h
        my ( $h, @files ) = &search( $index, $word );
        
        # assign tfidf to word
        $tags{ $word } = &tfidf( $n, $t, scalar @$corpus, $h );
      
      }
    
    }
    
    return \%tags;
  
  }

Search.pl

Two simple Perl scripts are presented, below, taking advantage of the routines described, above. The first is search.pl. Given a single term as input this script indexes the .txt files in the current directory, searches them for the term, assigns TFIDF to each of the results, and displays the results in a relevancy ranked order. The essencial aspects of the script are listed here:

  # define
  use constant STOPWORDS => 'stopwords.inc';
  
  # include
  require 'subroutines.pl';
    
  # get the query
  my $q = lc( $ARGV[ 0 ] );

  # index
  my %index = ();
  foreach my $file ( &corpus ) { $index{ $file } = &index( $file, &slurp_words( STOPWORDS ) ) }
  
  # search
  my ( $hits, @files ) = &search( \%index, $q );
  print "Your search found $hits hit(s)\n";
  
  # rank
  my $ranks = &rank( \%index, [ @files ], $q );
  
  # sort by rank and display
  foreach my $file ( sort { $$ranks{ $b } <=> $$ranks{ $a } } keys %$ranks ) {
  
    print "\t", $$ranks{ $file }, "\t", $file, "\n"
  
  }
  
  # done
  print "\n";
  exit;

Output from the script looks something like this:

  $ ./search.pl knowledge
  Your search found 6 hit(s)
    0.0193061840120664    plato.txt
    0.00558586078987563   kant.txt
    0.00299602568022012   aristotle.txt
    0.0010031177985631    librarianship.txt
    0.00059150597421034   hegel.txt
    0.000150303111274403  mississippi.txt

From these results you can see that the document named plato.txt is the most relevent because it has the highest score, in fact, it is almost four times more relevant than the second hit, kant.txt. For extra credit, ask yourself, “At what point do the scores become useless, or when do the scores tell you there is nothing of significance here?”

Classify.pl

As alluded to in Part I of this series, TFIDF can be turned on its head to do automatic classification. Weigh each term in a corpus of documents, and list the most significant words for a given document. Classify.pl does this by denoting a lower bounds for TFIDF scores, indexing an entire corpus, weighing each term, and outputing all the terms whose scores are greater than the lower bounds. If no terms are greater than the lower bounds, then it lists the top N scores as defined by a configuration. The essencial aspects of classify.pl are listed below:

  # define
  use constant STOPWORDS    => 'stopwords.inc';
  use constant LOWERBOUNDS  => .02;
  use constant NUMBEROFTAGS => 5;
  
  # require
  require 'subroutines.pl';
  
  # initialize
  my @corpus = &corpus;
  
  # index
  my %index = ();
  foreach my $file (@corpus ) { $index{ $file } = &index( $file, &slurp_words( STOPWORDS ) ) }
  
  # classify each document
  foreach my $file ( @corpus ) {
  
    print $file, "\n";
    
    # list tags greater than a given score
    my $tags  = &classify( \%index, $file, [ @corpus ] );
    my $found = 0;
    foreach my $tag ( sort { $$tags{ $b } <=> $$tags{ $a } } keys %$tags ) {
    
      if ( $$tags{ $tag } > LOWERBOUNDS ) {
      
        print "\t", $$tags{ $tag }, "\t$tag\n";
        $found = 1;
      
      }
      
      else { last }
      
    }
      
    # accomodate tags with low scores
    if ( ! $found ) {
    
      my $n = 0;
      foreach my $tag ( sort { $$tags{ $b } <=> $$tags{ $a } } keys %$tags ) {
      
        print "\t", $$tags{ $tag }, "\t$tag\n";
        $n++;
        last if ( $n == NUMBEROFTAGS );
      
      }
  
    }
    
    print "\n";
  
  }
  
  # done
  exit;

For example, sample, yet truncated, output from classify.pl looks like this:

  aristotle.txt
    0.0180678691531642  being
    0.0112840859266579  substances
    0.0110363803118312  number
    0.0106083766432284  matter
    0.0098440843778661  sense
  
  mississippi.txt
    0.00499714142455761  mississippi
    0.00429324597184886  boat
    0.00418922035591656  orleans
    0.00374087743616293  day
    0.00333830388445574  river

Thus, assuming a lower TFIDF bounds of 0.02, the words being, substance, number, matter, and sense are the most significant in the document named aristotle.txt. But since none of the words in mississippi.txt have a score that high the top five words are returned instead. For more extra credit, think of ways classify.pl can be improved by answering, “How can the output be mapped to controlled vocabulary terms or expanded through the use of some other thesarus?”

Summary

The Perl subroutines and scripts described here implement TFIDF to do rudimentary ranking of search results and automatic classification. They are not designed to be production applications, just example tools for the purposes of learning. Turning the ideas implemented in these scripts into production applications have been the fodder for many people’s careers and entire branches of computer science.

You can download the scripts, subroutines, and sample data in order for you to learn more. You are encouraged to remove the .txt files from the distribution and replace them with your own data. I think your search results and automatic classification output will confirm in your mind that TFIDF is well-worth the time and effort of the library community. Given the amounts of full text books and journal articles freely available on the Internet, it behooves the library profession to learn to exploit these concepts because our traditional practices simply: 1) do not scale, or 2) do not meet with our user’s expectations. Furthermore, farming these sorts of solutions out to vendors is irresponsible.

Ralph Waldo Emerson’s Essays

Sunday, April 19th, 2009

It was with great anticipation that I read Ralph Waldo Emerson’s Essays (both the First Series as well as the Second Series), but my expectations were not met. In a sentence I thought Emerson used too many words to say things that could have been expressed more succinctly.

The Essays themselves are a set of unsystematic short pieces of literature describing what one man thinks of various classic themes, such as but not limited to: history, intellect, art, experience, gifts, nature, etc. The genre itself — the literary essay or “attempts” — was apparently first popularized by Montaigne and mimicked by other “great” authors in the Western tradition including Bacon, Rousseau, and Thoreau. Considering this, maybe the poetic and circuitous nature of Emerson’s “attempts” should not be considered a fault.

Art

Because it was evident that later essays did not necessarily build on previous ones, I jumped around from chapter to chapter as whimsy dictated. Probably one of the first I read was “Art” where he describes the subject as the product of men detached from society.

It is the habit of certain minds to give an all-excluding fulness to the objects, the thought, the world, they alight upon, and to make that for the time the deputy of the world. These are the artists, the orators, the leaders of society. The power to detach and to magnify by detaching, is the essence of rhetoric in the hands of the orator and the poet.

But at the same time he seems to contradict himself earlier when he says:

No man can quite emancipate himself from the age and country, or produce a model in which the education, the religion, the politics, usages, and arts, of his times shall have not share. Though he were never so original, never so wilful and fantastic, he cannot wipe out of his work every trace of the thoughts amidst which it grew.

How can something be the product of a thing detached from society when it is not possible become detached in the first place?

Intellect

I, myself, being a person of mind more than heart, was keenly interested in the essay entitled “Intellect” where Emerson describes it as something:

…void of affection, and sees an object as it stands in the light of science, cool and disengaged… Intellect pierces the form, overlaps the wall, detects intrinsic likeness between remote things, and reduces all things into a few principles.

At the same time, intellect is not necessarily genius, since genius also requires spontaneity:

…but the power of picture or expression, in the most enriched and flowing nature, implies a mixture of will, a certain control over the spontaneous states, without which no production is possible. It is a conversation of all nature into the rhetoric of thought under the eye of judgement, with the strenuous exercise of choice. And yet the imaginative vocabulary seems to be spontaneous also. It does not flow from experience only or mainly, but from a richer source. Not by any conscious imitation of particular forms are the grand strokes of the painter executed, but by repairing to the fountain-head of all forms in his mind.

The Poet

Emerson apparently carried around his journal wherever he went. He made a living writing and giving talks. Considering this, and considering the nature of his writing, I purposely left his essay entitled “The Poet” until last. Not surprisingly, he had a lot to say on the subject, and I found this to be the hilight of my readings:

The poet is the person in whom these powers [the reproduction of senses] are in balance, the man without impediment, who sees and handles that which others dream of, traverses the whole scale of experience, and is representative of man, in virtue offering the largest power to receive and to impart… The poet is the sayer, the namer, and represents beauty… The poet does not wait for the hero or the sage, but as they act and think primarily, so he writes pirmarily what will and must be spoken, reckoning the others, though primaries also, yet, in repsect to him, secondaries and servants.

I found it encouraging that science was mentioned a few times during his discourse on the poet, since I believe a better understanding of one’s environment comes from the ability to think both artistically as well as scientifically, an idea I call arscience:

…science always goes abreast with the just elevation of the man, keeping step with religion and metaphysics; or, the state of science is an index of our self-knowledge… All the facts of the animal economy, — sex, nutriment, gestation, birth, growth — are symbols of passage of the world into the soul of man, to suffer there a change, and reappear a new and higher fact. He uses forms according to the life, and not according to the form. This is true science.

Back to the beginning

I think Emerson must have been a bit frustrated (or belittling himself in order be percieved as more believable) with a search for truth when he says, “I look in vain for the poet whom I describe.” But later on he summarizes much of what the Essays describe when he says, “Art is the path of the creator to his work,” and he then goes on to say what I said at the beginning of this review:

The poet pours out verses in every solitude. Most of the things he says are conventional, no doubt; but by and by he says something which is original and beautiful. That charms him.

I was hoping to find more inspriation regarding the definition of Unitarianism throughout the book, but alas, the term was only mentioned a couple of times. Instead, I learnd more indirectly that Emerson affected my thinking in more subtle ways. I have incorporated much of his thought into my own without knowing it. Funny how one’s education manifests itself.

Word cloud

Use this word cloud of the combined Essays to get an idea of what they are “about”:

nature  men  life  world  good  shall  soul  great  thought  like  love  power  know  let  mind  truth  make  society  persons  day  old  character  heart  genius  god  come  beauty  law  being  history  fact  true  makes  work  virtue  better  art  laws  self  form  right  eye  best  action  poet  friend  think  feel  eyes  beautiful  words  human  spirit  little  light  facts  speak  person  state  natural  intellect  sense  live  force  use  seen  thou  long  water  people  house  certain  individual  end  comes  whilst  divine  property  experience  look  forms  hour  read  place  present  fine  wise  moral  works  air  poor  need  earth  hand  common  word  thy  conversation  young  stand  

And since a picture is worth a thousand words, here is a simple graph illustrating how the 100 most frequently used words in the Essays (sans stop words) compare to one another:

emerson words

TFIDF In Libraries: Part I of III (For Librarians)

Monday, April 13th, 2009

This is the first of a three-part series called TFIDF In Libraries, where “relevancy ranking” will be introduced. In this part, term frequency/inverse document frequency (TFIDF) — a common mathematical method of weighing texts for automatic classification and sorting search results — will be described. Part II will illustrate an automatic classification system and simple search engine using TFIDF through a computer program written in Perl. Part III will explore the possibility of filtering search results by applying TFIDF against sets of pre-defined “Big Names” and/or “Big Ideas” — an idea apparently called “champion lists”.

The problem, straight Boolean logic

To many of us the phrase “relevancy ranked search results” is a mystery. What does it mean to be “relevant”? How can anybody determine relevance for me? Well, a better phrase might have been “statistically significant search results”. Taking such an approach — the application of statistical analysis against texts — does have its information retrieval advantages over straight Boolean logic. Take for example, the following three documents consisting of a number of words, Table #1:

Document #1 Document #2 Document #3
Word Word Word
airplane book building
blue car car
chair chair carpet
computer justice ceiling
forest milton chair
justice newton cleaning
love pond justice
might rose libraries
perl shakespeare newton
rose slavery perl
shoe thesis rose
thesis truck science

A search for “rose” against the corpus will return three hits, but which one should I start reading? The newest document? The document by a particular author or in a particular format? Even if the corpus contained 2,000,000 documents and a search for “rose” returned a mere 100 the problem would remain. Which ones should I spend my valuable time accessing? Yes, I could limit my search in any number of ways, but unless I am doing a known item search it is quite likely the search results will return more than I can use, and information literacy skills will only go so far. Ranked search results — a list of hits based on term weighting — has proven to be an effective way of addressing this problem. All it requires is the application of basic arithmetic against the documents being searched.

Simple counting

We can begin by counting the number of times each of the words appear in each of the documents, Table #2:

Document #1 Document #2 Document #3
Word C Word C Word C
airplane 5 book 3 building 6
blue 1 car 7 car 1
chair 7 chair 4 carpet 3
computer 3 justice 2 ceiling 4
forest 2 milton 6 chair 6
justice 7 newton 3 cleaning 4
love 2 pond 2 justice 8
might 2 rose 5 libraries 2
perl 5 shakespeare 4 newton 2
rose 6 slavery 2 perl 5
shoe 4 thesis 2 rose 7
thesis 2 truck 1 science 1
Totals (T) 46 41 49

Given this simple counting method, searches for “rose” can be sorted by its “term frequency” (TF) — the quotient of the number of times a word appears in each document (C), and the total number of words in the document (T) — TF = C / T. In the first case, rose has a TF value of 0.13. In the second case TF is 0.12, and in the third case it is 0.14. Thus, by this rudimentary analysis, Document #3 is most significant in terms of the word “rose”, and Document #2 is the least. Document #3 has the highest percentage of content containing the word “rose”.

Accounting for common words

Unfortunately, this simple analysis needs to be offset considering frequently occurring terms across the entire corpus. Good examples are stop words or the word “human” in MEDLINE. Such words are nearly meaningless because they appear so often. Consider Table #3 which includes the number of times each word is found in the entire corpus (DF), and the quotient of the total number of documents (D or in this case, 3) and DF — IDF = D / DF. Words with higher scores are more significant across the entire corpus. Search terms whose IDF (“inverse document frequency”) score approach 1 are close to useless because they exist in just about every document:

Document #1 Document #2 Document #3
Word DF IDF Word DF IDF Word DF IDF
airplane 1 3.0 book 1 3.0 building 1 3.0
blue 1 3.0 car 2 1.5 car 2 1.5
chair 3 1.0 chair 3 1.0 carpet 1 3.0
computer 1 3.0 justice 3 1.0 ceiling 1 3.0
forest 1 3.0 milton 1 3.0 chair 3 1.0
justice 3 1.0 newton 2 1.5 cleaning 1 3.0
love 1 3.0 pond 1 3.0 justice 3 1.0
might 1 3.0 rose 3 1.0 libraries 1 3.0
perl 2 1.5 shakespeare 1 3.0 newton 2 1.5
rose 3 1.0 slavery 1 3.0 perl 2 1.5
shoe 1 3.0 thesis 2 1.5 rose 3 1.0
thesis 2 1.5 truck 1 3.0 science 1 3.0

Term frequency/inverse document frequency (TFIDF)

By taking into account these two factors — term frequency (TF) and inverse document frequency (IDF) — it is possible to assign “weights” to search results and therefore ordering them statistically. Put another way, a search result’s score (“ranking”) is the product of TF and IDF:

TFIDF = TF * IDF where:

  • TF = C / T where C = number of times a given word appears in a document and T = total number of words in a document
  • IDF = D / DF where D = total number of documents in a corpus, and DF = total number of documents containing a given word

Table #4 is a combination of all the previous tables with the addition of the TFIDF score for each term:

Document #1
Word C T TF D DF IDF TFIDF
airplane 5 46 0.109 3 1 3.0 0.326
blue 1 46 0.022 3 1 3.0 0.065
chair 7 46 0.152 3 3 1.0 0.152
computer 3 46 0.065 3 1 3.0 0.196
forest 2 46 0.043 3 1 3.0 0.130
justice 7 46 0.152 3 3 1.0 0.152
love 2 46 0.043 3 1 3.0 0.130
might 2 46 0.043 3 1 3.0 0.130
perl 5 46 0.109 3 2 1.5 0.163
rose 6 46 0.130 3 3 1.0 0.130
shoe 4 46 0.087 3 1 3.0 0.261
thesis 2 46 0.043 3 2 1.5 0.065
Document #2
Word C T TF D DF IDF TFIDF
book 3 41 0.073 3 1 3.0 0.220
car 7 41 0.171 3 2 1.5 0.256
chair 4 41 0.098 3 3 1.0 0.098
justice 2 41 0.049 3 3 1.0 0.049
milton 6 41 0.146 3 1 3.0 0.439
newton 3 41 0.073 3 2 1.5 0.110
pond 2 41 0.049 3 1 3.0 0.146
rose 5 41 0.122 3 3 1.0 0.122
shakespeare 4 41 0.098 3 1 3.0 0.293
slavery 2 41 0.049 3 1 3.0 0.146
thesis 2 41 0.049 3 2 1.5 0.073
truck 1 41 0.024 3 1 3.0 0.073
Document #3
Word C T TF D DF IDF TFIDF
building 6 49 0.122 3 1 3.0 0.367
car 1 49 0.020 3 2 1.5 0.031
carpet 3 49 0.061 3 1 3.0 0.184
ceiling 4 49 0.082 3 1 3.0 0.245
chair 6 49 0.122 3 3 1.0 0.122
cleaning 4 49 0.082 3 1 3.0 0.245
justice 8 49 0.163 3 3 1.0 0.163
libraries 2 49 0.041 3 1 3.0 0.122
newton 2 49 0.041 3 2 1.5 0.061
perl 5 49 0.102 3 2 1.5 0.153
rose 7 49 0.143 3 3 1.0 0.143
science 1 49 0.020 3 1 3.0 0.061

Given TFIDF, a search for “rose” still returns three documents ordered by Documents #3, #1, and #2. A search for “newton” returns only two items ordered by Documents #2 (0.110) and #3 (0.061). In the later case, Document #2 is almost one and a half times more “relevant” than document #3. TFIDF scores can be summed to take into account Boolean unions (or) or intersections (and).

Automatic classification

TDIDF can also be applied a priori to indexing/searching to create browsable lists — hence, automatic classification. Consider Table #5 where each word is listed in a sorted TFIDF order:

Document #1 Document #2 Document #3
Word TFIDF Word TFIDF Word TFIDF
airplane 0.326 milton 0.439 building 0.367
shoe 0.261 shakespeare 0.293 ceiling 0.245
computer 0.196 car 0.256 cleaning 0.245
perl 0.163 book 0.220 carpet 0.184
chair 0.152 pond 0.146 justice 0.163
justice 0.152 slavery 0.146 perl 0.153
forest 0.130 rose 0.122 rose 0.143
love 0.130 newton 0.110 chair 0.122
might 0.130 chair 0.098 libraries 0.122
rose 0.130 thesis 0.073 newton 0.061
blue 0.065 truck 0.073 science 0.061
thesis 0.065 justice 0.049 car 0.031

Given such a list it would be possible to take the first three terms from each document and call them the most significant subject “tags”. Thus, Document #1 is about airplanes, shoes, and computers. Document #2 is about Milton, Shakespeare, and cars. Document #3 is about buildings, ceilings, and cleaning.

Probably a better way to assign “aboutness” to each document is to first denote a TFIDF lower bounds and then assign terms with greater than that score to each document. Assuming a lower bounds of 0.2, Document #1 is about airplanes and shoes. Document #2 is about Milton, Shakespeare, cars, and books. Document #3 is about buildings, ceilings, and cleaning.

Discussion and conclusion

Since the beginning, librarianship has focused on the semantics of words in order to create a cosmos from an apparent chaos. “What is this work about? Read the descriptive information regarding a work (author, title, publisher date, notes, etc.) to workout in your mind its importance.” Unfortunately, this approach leaves much up to interpretation. One person says this document is about horses, and the next person says it is about husbandry.

The mathematic approach is more objective and much more scalable. While not perfect, there is much less interpretation required with TFIDF. It is just about mathematics. Moreover, it is language independent; it is possible to weigh terms and provide relevance ranking without knowing the meaning of a single word in the index.

In actuality, the whole thing is not an either/or sort of question, but instead a both/and sort of question. Human interpretation provides an added value, definitely. At the same time the application of mathematics (“Can you say ‘science?'”) proves to be quite useful too. The approaches compliment each other — they are arscient. Much of how we have used computers in libraries has simply been to automate existing processes. We have still to learn how to truly take advantage of a computer’s functionality. It can remember things a whole lot better than we can. It can add a whole lot faster than we can. Because of this it is almost trivial to calculate ( C / T ) * ( D / DF ) over an entire corpus of 2,000,000 MARC records or even 1,000,000 full text documents.

None of these ideas are new. It is possible to read articles describing these techniques going back about 40 years. Why has our profession not used them to our advantage. Why is it taking us so long? If you have an answer, then enter it in the comment box below.

This first posting has focused on the fundamentals of TFIDF. Part II will describe a Perl program implementing relevancy ranking and automatic classification against sets of given text files. Part III will explore the idea of using TFIDF to enable users to find documents alluding to “great ideas” or “great people”.

A day at CIL 2009

Friday, April 3rd, 2009

This documents my day-long experiences at the Computers in Libraries annual conference, March 31, 2009. In a sentence, the meeting was well-attended and covered a wide range of technology issues.

washington
Washington Monument

The day began with an interview-style keynote address featuring Paul Holdengraber (New York Public Library) interviewed by Erik Boekesteijn (Library Concept Center). As the Director of Public Programs at the Public Library, Holdengraber’s self-defined task is to “levitate the library and make the lions on the front steps roar.” Well-educated, articulate, creative, innovative, humorous, and cosmopolitan, he facilitates sets of programs in the library’s reading room called “Live from the New York Public Library” where he interviews people in an effort to make the library — a cultural heritage institution — less like a mausoleum for the Old Masters and more like a place where great ideas flow freely. A couple of notable quotes included “My mother always told me to be porous because you have two ears and only one mouth” and “I want to take the books from the closed stacks and make people desire them.” Holdengraber’s enthusiasm for his job is contagious. Very engaging as well as interesting.

During the first of the concurrent sessions I gave a presentation called “Open source software: Controlling your computing environment” where I first outlined a number of definitions and core principles of open source software. I then tried to draw a number of parallels between open source software and librarianship. Finally, I described how open source software can be applied in libraries. During the presentation I listed four skills a library needs to become proficient in in order to take advantage of open source software (namely, relational databases, XML, indexing, and some sort of programming language), but in retrospect I believe basic systems administration skills are the things really required since the majority of open source software is simply installed, configured, and used. Few people feel the need to modify its functionality and therefore the aforementioned skills are not critical, only desirable.

washington
Lincoln Memorial

In “Designing the Digital Experience” by David King (Topeka & Shawnee County Public Library) attendees were presented with ways websites can be created in a way that digitally supplements the physical presents of a library. He outlined the structural approaches to Web design such as the ones promoted by Jesse James Garrett, David Armano and 37Signals. He then compared & contrasted these approaches to the “community path” approaches which endeavor to create a memorable experience. Such things can be done, King says, through conversations, invitations, participation, creating a sense of familiarity, and the telling of stories. It is interesting to note that these techniques are not dependent on Web 2.0 widgets, but can certainly be implemented through their use. Throughout the presentation he brought all of his ideas home through the use of examples from the websites of Harley-Davidson, Starbucks, American Girl, and Webkinz. Not ironically, Holdengraber was doing the same thing for the Public Library except in the real world, not through a website.

In a session after lunch called “Go Where The Client Is” Natalie Collins (NRC-CISTI) described how she and a few co-workers converted library catalog data containing institutional repository information as well as SWETS bibliographic data into NLM XML and made it available for indexing by Google Scholar. In the end, she discovered that this approach was much more useful to her constituents when compared to the cool (“kewl”) Web Services-based implementation they had created previously. Holly Hibner (Salem-South Lyon District Library) compared & contrasted the use of tablet PC’s with iPods for use during roaming reference services. My two take-aways from this presentation were cool (“kewl”) services called drop.io and LinkBunch, websites making it easier to convert data from one format into another and bundle lists of link together into a single URL, respectively.

washington
Jefferson Memorial

The last session for me that day was one on open source software implementations of “next generation” library catalogs, specifically Evergreen. Karen Collier and Andrea Neiman (both of Kent County Public Library) outlined their implementation process of Evergreen in rural Michigan. Apparently it began with the re-upping the of their contract for their computer hardware. Such a thing would cost more than they expected. This led to more investigations which ultimately resulted in the selection of Evergreen. “Open source seemd like a logical conclusion.” They appear to be very happy with their decision. Karen Schneider (Equinox Software) gave a five-minute “lightning talk” on the who and what of Equinox and Evergreen. Straight to the point. Very nice. Ruth Dukelow (Michigan Library Consortium) described how participating libraries have been brought on board with Evergreen, and she outlined the reasons why Evergreen fit the bill: it supported MLCat compliance, it offered an affordable hosted integrated library system, it provided access to high quality MARC records, and it offered a functional system to non-technical staff.

I enjoyed my time there in Washington, DC at the conference. Thanks go to Ellyssa Kroski, Steven Cohen, and Jane Dysart for inviting me, and allowing me to share some of my ideas. The attendees at the conference were not as technical as you might find at Access, Code4Lib, and certainly not JCDL nor ECDL. This is not a bad thing. The people were genuinely interested in the things presented, but I did overhear one person say, “This is completely over my head.” The highlight for me took a place during the last session where people were singing the praise of open source software for all the same reasons I had been expressing them over the past twelve years. “It is so much like the principles of librarianship,” she said. That made my day.

Quick Trip to Purdue

Wednesday, April 1st, 2009

Last Friday, March 27, I was invited by Michael Witt (Interdisciplinary Research Librarian) at Purdue University to give a presentation to the library faculty on the topic of “next generation” library catalogs. During the presentation I made an effort to have the participants ask and answer questions such as “What is the catalog?”, “What is it expected to contain?”, “What functions is it expected to perform and for whom?”, and most importantly, “What problems is it expected to solve?”

I then described how most of the current “next generation” library catalog thingees are very similar. Acquire metadata records. Optionally store them in a database. Index them (with Lucene). Provide services against the index (search and browse). I then brought the idea home by describing in more detail how things like VuFind, Primo, Koha, Evergreen, etc. all use this model. I then made an attempt to describe how our “next generation” library catalogs could go so much further by providing services against the texts as well as services against the index. “Discovery is not the problem that needs to be solved.”

Afterwards a number of us went to lunch where we compared & contrasted libraries. It is a shame the Purdue University, University of Indiana, and University of Notre Dame libraries do not work more closely together. Our strengths compliment each other in so many ways.

“Michael, thanks for the opportunity!”


Something I saw on the way back home.

Library Technology Conference, 2009: A Travelogue

Wednesday, April 1st, 2009

This posting documents my experiences at the Library Technology Conference at Macalester  College (St. Paul, Minnesota) on March 18-19, 2009. In a sentence, this well-organized regional conference provided professionals from near-by states an opportunity to listen, share, and discuss ideas concerning the use of computers in libraries.

library
Wallace Library
campus center
Dayton Center

Day #1, Wednesday

The Conference, sponsored by Macalester College — a small, well-respected liberal arts college in St. Paul — began with a keynote presentation by Stacey Greenwell (University of Kentucky) called “Applying the information commons concept in your library”. In her remarks the contagiously energetic Ms. Greenwell described how she and her colleagues implemented the “Hub“, an “active learning place” set in the library. After significant amounts of planning, focus group interviews, committee work, and on-going cooperation with the campus computing center, the Hub opened in March of 2007. The whole thing is designed to be a fun, collaborative learning commons equipped with computer technology and supported by librarian and computer consultant expertise. Some of the real winners in her implementation include the use of white boards, putting every piece of furniture on wheels, including “video walls” (displaying items from special collections, student art, basketball games, etc.), and hosting parties where as many as 800 students attend. Greenswell’s enthusiasm was inspiring.

Most of the Conference was made up of sets of concurrent sessions, and the first one I attended was given by Jason Roy and Shane Nackerund (both of the University of Minnesota) called “What’s cooking in the lab?” Roy began by describing both a top-down and bottom-up approach to the curation and maintenance of special collections content. Technically, their current implementation includes a usual cast of characters (DSpace, finding aids managed with DLXS, sets of images, and staff), but sometime in the near future he plans on implementing a more streamlined approach consisting of Fedora for the storage of content with sets of Web Services on top to provide access. It was also interesting to note their support for user-contributed content. Users supply images. Users tag content. Images and tags are used to supplement more curated content.

Nackerund demonstrated a number of tools he has been working on to provide enhanced library services. One was the Assignment Calculator — a tool to outline what steps need to be done to complete library-related, classroom-related tasks. He has helped implement a mobile library home page by exploiting Web Service interfaces to this underlying systems. While the Web Service APIs are proprietary, they are a step in the right direction for further exploitation. He has implementing sets of course pages — as opposed to subject guides — too. “I am in this class, what library resources should I be using?” (The creation of course guide seems to be a trend.) Finally, he is creating a recommender service of which the core is the creation of “affinity strings” — a set of codes used to denote the characteristics of an individual as opposed to specific identifiers. Of all the things from the Conference, the idea of affinity strings struck me the hardest. Very nice work, and documented in a Code4Lib Journal article too boot.

In the afternoon I gave a presentation called “Technology Trends and Libraries: So many opportunities“. In it I described why mobile computing, content “born digital”, the Semantic Web, search as more important than browse, and the wisdom of crowds represent significant future directions for librarianship. I also described the importance of not loosing the sight of the forest from the trees. Collection, organization, preservation, and dissemination of library content and services are still the core of the profession, and we simply need to figure out new ways to do the work we have traditionally done. “Libraries are not warehouses of data and information as much as they are gateways to learning and knowledge. We must learn to build on the past and evolve, instead of clinging to it like a comfortable sweater.”

Later in the afternoon Marian Rengal and Eric Celeste (both of the Minnesota Digital Library) described the status of the Minnesota Digital Library in a presentation called “Where we are”. Using ContentDM as the software foundation of their implementation, the library includes many images supported by “mostly volunteers just trying to do the right thing for Minnesota.” What was really interesting about their implementation is the way they have employed a building block approach. PMWiki to collaborate. The Flickr API to share. Pachyderm to create learning objects. One of the most notable quotes from the presentation was “Institutions need to let go of their content to a greater degree; let them have a life of their own.” I think this is something that needs to be heard by many of us in cultural heritage institutions. If we make our content freely available, then we will be facilitating the use of the content in unimagined ways. Such is a good thing.

cathedral
St. Paul Cathedral
facade
Balboa facade

Day #2, Thursday

The next day was filled with concurrent sessions. I first attended one by Alec Sonsteby (Concordia College) entitled “VuFind: the MnPALS Experience” where I learned how MnPALS — a library consortium — brought up VuFind as their “discovery” interface. They launched VuFind in August of 2008, and they seem pretty much satisfied with the results.

During the second round of sessions I lead a discussion/workshop regarding “next generation” library catalogs. In it we asked and tried to answer questions such as “What is the catalog?”, “What does it contain?”, “What functions is it expected to fulfill and for whom?”, and most importantly, “What is the problem it is expected to solve?” I then described how many of current crop of implementations function very similarly. Dump metadata records. Often store them in a database. Index them (with Lucene). Provide services against the index (search and browse). I then tried to outline how “next generation” library catalogs could do more, namely provide services against the texts as well as the index.

The last session I attended was about ERMs — Electronic Resource Management systems. Don Zhou (William Mitchel College of Law) described how he implemented Innovative Interface’s ERM. “The hard part was getting the data in.” Dani Roach and Carolyn DeLuca (both of University of St. Thomas) described how they implemented a Serials Solutions… solution. “You need to be adaptive; we decided to do things one way and then went another… It is complex, not difficult, just complex. There have to be many tools to do ERM.” Finally, Galadriel Chilton (University of Wisconsin – La Crosse) described an open source implementation written in Microsoft Access, but “it does not do electronic journals.”

In the afternoon Eric C. was gracious enough to tour me around the Twin Cities. We saw the Cathedral of Saint Paul, the Mississippi River, and a facade by Balboa. But the most impressive thing I saw was the University of Minnesota’s “cave” — an onsite storage facility for the University’s libraries. All the books they want to withdraw go here where they are sorted by size, placed into cardboard boxes assigned to a bar code, and put into rooms 100 yards long and three stories high. The facility is manned by two people, and in ten years they have only lost two books out of the 1.3 million. The place is so huge you can literally drive a tractor trail truck into the place. Very impressive, and I got a personal tour. “Thanks Eric!”

eric and eric
Eric and Eric
water
St. Anthony Falls

Summary

I sincerely enjoyed the opportunity to attend this conference. Whenever I give talks I feel the need to write up a one-page handout. That process forces me to articulate my ideas in writing. When I give the presentation it is not all about me, but rather learning about the environments of my peers. It is an education all around. This particular regional conference was the right size, about 250. Many of the attendees knew each other. They caught up and learned things along the way. “Good job Ron Joslin!” The only thing I missed was a photograph of Mary Tyler Moore. Maybe next time.


Creator: Eric Lease Morgan <eric_morgan@infomotions.com>
Date created: 2008-05-26
Date updated: 2010-05-09
URL: ./