Archive for the ‘Hacks’ Category

ISBN numbers

Monday, February 2nd, 2009

I’m beginning to think about ISBN numbers and the Alex Catalogue of Electronic Texts. For example, I can add ISBN numbers to Alex, link them to my (fledgling) LibraryThing collection, and display lists of recently added items here:

Interesting, but I think the list will change over time, as new things get added to my collection. It would be nice to link to a specific item. Hmm…

Tarzan Of The Apes (Barnes & Noble Classics Series) (Barnes & Noble Classics)
Edgar Rice Burroughs; Barnes & Noble Classics 2006

On the other hand, I could exploit ISBN numbers and OpenLibrary using a WordPress plug-in called OpenBook Book Data by John Miedema. It displays cover art, a link to OpenLibrary as well as WorldCat

Again, very interesting. For more details, see the “OpenBook WordPress Plugin: Open Source Access to Bibliographic Data” in Code4Lib Journal.

A while ago I wrote a CGI script that took ISBN numbers as input, fed them to xISBN and/or ThingISBN to suggest alternative titles. I called it Send It To Me.

Then of course there is the direct link to Amazon.com.

I suppose it is nice to have choice.

Fun with WebService::Solr, Part III of III

Thursday, January 22nd, 2009

This is the last of a three-part series providing an overview of a set of Perl modules called WebService::Solr. In Part I, WebService::Solr was introduced with two trivial scripts. Part II put forth two command line driven scripts to index and search content harvested via OAI. Part III illustrates how to implement an Search/Retrieve via URL (SRU) search interface against an index created by WebService::Solr.

Search/Retrieve via URL

SRU

Search/Retrieve via URL (SRU) is a REST-like Web Service-based protocol designed to query remote indexes. The protocol essentially consists of three functions or “operations”. The first, explain, provides a mechanism to auto-discover the type of content and capabilities of an SRU server. The second, scan, provide a mechanism to browse an index’s content much like perusing the back-of-a-book index. The third, searchRetrieve, provides the means for sending a query to the index and getting back a response. Many of the librarians in the crowd will recognize SRU as the venerable Z39.50 protocol redesigned for the Web.

During the past year, time has been spent joining the SRU community with the OpenSearch community to form a single, more unified set of Search Web Service protocols. OpenSearch has very similar goals to SRU — to provide standardized interfaces for searching indexes — but the techniques between it an SRU are different. Where OpenSearch’s query language is simple, SRU’s is expressive. Where OpenSearch returns an RSS-like data stream, SRU includes the ability to return just about any XML format. OpenSearch may be easier to implement, but SRU is suited for a wider number of applications. To bring SRU and OpenSearch together, and to celebrate similarities as opposed to differences, an OASIS Abstract Protocol Definition has been drafted defining how the searching of Web-based databases and indexes can be done in a standardized way.

SRU is an increasingly important protocol for the library community because of a growing number of the WorldCat Grid Services are implemented using SRU. The Grid supports indexes such lists of library holdings (WorldCat), name and subject authority files (Identities), as well as names of libraries (the Registry). By sending SRU queries to these services and mashing up the results with the output of other APIs, all sorts of library and bibliographic applications can be created.

Integrating WebService::Solr into SRU

Personally, I have been creating SRU interfaces to many of my indexes for about four years. I have created these interfaces against mailing list archives, OAI-harvested content, and MARC records. The underlying content has been indexed with swish-e, Plucene, KinoSearch, and now Lucene through WebService::Solr.

Ironic or not, I use yet another set of Perl modules — available on CPAN and called SRU — written by Brian Cassidy to implement my SRU servers. The form of my implementations is rather simple. Get the input. Determine what operation is requested. Branch accordingly. Do the necessary processing. Return a response.

The heart of my SRU implementation is a subroutine called search. It is within this subroutine where indexer-specific hacking takes place. For example and considering WebService::Solr:

sub search {

  # initialize
  my $query   = shift;
  my $request = shift;
  my @results;

  # set up Solr
  my $solr = WebService::Solr->new( SOLR );

  # calculate start record and number of records
  my $start_record = 0;
  if ( $request->startRecord ) { $start_record = $request->startRecord - 1 }
  my $maximum_records = MAX; $maximum_records = $request->maximumRecords
     unless ( ! $request->maximumRecords );

  # search
  my $response   = $solr->search( $query, {
                                  'start' => $start_record,
                                  'rows'  => $maximum_records });
  my @hits       = $response->docs;
  my $total_hits = $response->pager->total_entries;

  # display the number of hits
  if ( $total_hits ) {

    foreach my $doc ( @hits ) {

      # slurp
      my $id          = $doc->value_for(  'id' );
      my $name        = &escape_entities( $doc->value_for(  'title' ));
      my $publisher   = &escape_entities( $doc->value_for(  'publisher' ));
      my $description = &escape_entities( $doc->value_for(  'description' ));
      my @creator     = $doc->values_for( 'creator' );
      my $contributor = &escape_entities( $doc->value_for(  'contributor' ));
      my $url         = &escape_entities( $doc->value_for(  'url' ));
      my @subjects    = $doc->values_for( 'subject' );
      my $source      = &escape_entities( $doc->value_for(  'source' ));
      my $format      = &escape_entities( $doc->value_for(  'format' ));
      my $type        = &escape_entities( $doc->value_for(  'type' ));
      my $relation    = &escape_entities( $doc->value_for(  'relation' ));
      my $repository  = &escape_entities( $doc->value_for(  'repository' ));

      # full results, but included entities; hmmm...
      my $record  = '<srw_dc:dc xmlns="http://www.w3.org/TR/xhtml1/strict"
                      xmlns:dc="http://purl.org/dc/elements/1.1/"
                      xmlns:srw_dc="info:srw/schema/1/dc-v1.1">';
      $record .= '<dc:title>' .  $name . '</dc:title>';
      $record .= '<dc:publisher>' .  $publisher . '</dc:publisher>';
      $record .= '<dc:identifier>' .  $url . '</dc:identifier>';
      $record .= '<dc:description>' .  $description . '</dc:description>';
      $record .= '<dc:source>' . $source . '</dc:source>';
      $record .= '<dc:format>' .  $format . '</dc:format>';
      $record .= '<dc:type>' .  $type . '</dc:type>';
      $record .= '<dc:contributor>' .   $contributor . '</dc:contributor>';
      $record .= '<dc:relation>' .   $relation . '</dc:relation>';
      foreach ( @creator ) { $record .= '<dc:creator>' .  $_ . '</dc:creator>' }
      foreach ( @subjects ) { $record .= '<dc:subject>' . $_ . '</dc:subject>' }
      $record .= '</srw_dc:dc>';
      push @results, $record;

    }

  }

  # done; return it
  return ( $total_hits, @results );

}

The subroutine is not unlike the search script outlined in Part II of this series. First the query, SRU::Request object, results, and local Solr objects are locally initialized. A pointer to the first desired hit as well as the maximum number of records to return are calculated. The search is done, and the total number of search results is saved for future reference. If the search was a success, then each of the hits are looped through while stuffing them into an XML element named record and scoped with a Dublin Core name space. Finally, the total number of records as well as the records themselves are returned to the main module where they are added to an SRU::Response object and returned to the SRU client.

This particular implementation is pretty rudimentary, and it does not really exploit the underlying functionality of Solr/Lucene. For example, it does not support facets, spell check, suggestions, etc. On the other hand, it does support paging, and since it is implemented under mod_perl it is just about as fast as it can get on my hardware.

Give the implementation a whirl. The underlying index includes about 20,000 records of various electronic books (from the Alex Catalogue of Electronic Texts, Project Gutenberg, and the HathiTrust), photographs (from my own adventures), journal titles, and journal articles (both from the Directory of Open Access Journals).

Summary

It is difficult for me to overstate the number of possibilities for librarianship considering the current information environment. Data and information abound! Learning has not stopped. It is sexy to be in the information business. All of the core principles of librarianship are at play in this environment. Collection. Preservation. Organization. Dissemination. The application of relational databases combined with indexers provide the means to put into practice these core principles in today’s world.

The Solr/Lucene combination is an excellent example, and WebService::Solr is just one way to get there. Again, I don’t expect every librarian to know and understand all of things outlined in this series of essays. On the other hand, I do think it is necessary for the library community as a whole to understand this technology in the same way they understand bibliography, conservation, cataloging, and reference. Library schools need to teach it, and librarians need to explore it.

Source code

Finally, plain text versions of this series’ postings, the necessary Solr schema.xml files, as well as all the source code is available for downloading. Spend about an hour putzing around. I’m sure you will come out the other end learning something.

Fun with WebService::Solr, Part II of III

Monday, January 12th, 2009

In this posting (Part II), I will demonstrate how to use WebService::Solr to create and search a more substantial index, specifically an index of metadata describing the content of the Directory of Open Access Journals. Part I of these series introduced Lucene, Solr, and WebService::Solr with two trivial examples. Part III will describe how to create an SRU interface using WebService::Solr.

Directory of Open Access Journals

solr logoThe Directory of Open Access Journals (DOAJ) is a list of freely available scholarly journals. As of this writing the Directory contains approximately 3,900 titles organized into eighteen broad categories such as Arts and Architecture, Law and Political Science, and General Science. Based on my tertiary examination, a large percentage of the titles are in the area of medicine.

Not only is it great that such a directory exists, but it is even greater that the Directory’s metadata — the data describing the titles in the Directory — is available for harvesting via OAI-PMH. While the metadata is rather sparse, it is more than adequate for creating rudimentary MARC records for importing into library catalogs, or better yet, incorporating into some other Web service. (No puns intended.)

In my opinion, the Directory is a especially underutilized. For example, not only are the Directory’s journal titles available for download, but so is the metadata of about 25,000 journal articles. Given these two things (metadata describing titles as well as articles) it would be entirely possible to seed a locally maintained index of scholarly journal content and incorporate that into library “holdings”. But alas, that is another posting and another story.

Indexing the DOAJ

It is almost trivial to create a search engine against DOAJ content when you know how to implement an OAI-PMH harvester and indexer. First, you need to know the OAI-PMH root URL for the Directory, and it happens to be http://www.doaj.org/oai Second, you need to peruse the OAI-PMH output sent by the Directory and map it to fields you will be indexing. In the case of this demonstration, the fields are id, title, publisher, subject, and URL. Consequently, I updated the schema from the first demonstration to look like this:

<!-- DC-like fields -->
<fields>
  <field name="id" type="string" indexed="true" stored="true" required="true" />
  <field name="title" type="text" indexed="true" stored="true" />
  <field name="publisher" type="text" indexed="true" stored="true" />
  <field name="subject" type="text" indexed="true" stored="true" multiValued="true" />
  <field name="url" type="text" indexed="false" stored="true" />
  <field name="text" type="text" indexed="true" stored="false" multiValued="true" />
  <field name="facet_subject" type="string" indexed="true" stored="true" multiValued="true" />
</fields>

<!-- key; for updating purposes -->
<uniqueKey>id</uniqueKey>

<!-- for non-field searches -->
<defaultSearchField>text</defaultSearchField>

<!-- AND is more precise -->
<solrQueryParser defaultOperator="AND" />

<!-- what gets searched by default -->
<copyField source="title" dest="text" />
<copyField source="subject" dest="text" />
<copyField source="publisher" dest="text" />

The astute reader will notice the addition of a field named facet_subject. This field, denoted as a string and therefore not parsed by the indexer, is destined to be a browsable facet in the search engine. By including this sort of field in the index it is be possible to return results like, “Your search identified 100 items, and 25 of them are associated with the subject Philosophy.” A very nice feature. Think of it as the explicit exploitation of controlled vocabulary terms for search results. Facets turn the use of controlled vocabularies inside out. The library community has something to learn here.

Once the schema was updated, I wrote the following script to index the journal title content from the Directory:

#!/usr/bin/perl

# index-doaj.pl - get doaj content and index it

# Eric Lease Morgan <eric_morgan@infomotions.com>
# January  12, 2009 - version 1.0

# define
use constant OAIURL => 'http://www.doaj.org/oai';
use constant PREFIX => 'oai_dc';
use constant SOLR   => 'http://localhost:210/solr';

# require
use Net::OAI::Harvester;
use strict;
use WebService::Solr;

# initialize oai and solr
my $harvester = Net::OAI::Harvester->new( baseURL => OAIURL );
my $solr      = WebService::Solr->new( SOLR );

# get all records and loop through them
my $records = $harvester->listAllRecords( metadataPrefix => PREFIX );
my $id      = 0;
while ( my $record = $records->next ) {

  # increment
  $id++;
  last if ( $id > 100 );  # comment this out to get everything

  # extract the desired metadata
  my $metadata     = $record->metadata;
  my $identifier   = $record->header->identifier;
  my $title        = $metadata->title      ? &strip( $metadata->title )     : '';
  my $url          = $metadata->identifier ? $metadata->identifier          : '';
  my $publisher    = $metadata->publisher  ? &strip( $metadata->publisher ) : '';
  my @all_subjects = $metadata->subject    ? $metadata->subject             : ();

  # normalize subjects
  my @subjects = ();
  foreach ( @all_subjects ) {

    s/DoajSubjectTerm: //;  # remove DOAJ label
    next if ( /LCC: / );    # don't want call numbers
    push @subjects, $_;

  }

  # echo
  print "      record: $id\n";
  print "  identifier: $identifier\n";
  print "       title: $title\n";
  print "   publisher: $publisher\n";
  foreach ( @subjects ) { print "     subject: $_\n" }
  print "         url: $url\n";
  print "\n";

  # create solr/lucene document
  my $solr_id        = WebService::Solr::Field->new( id        => $identifier );
  my $solr_title     = WebService::Solr::Field->new( title     => $title );
  my $solr_publisher = WebService::Solr::Field->new( publisher => $publisher );
  my $solr_url       = WebService::Solr::Field->new( url       => $url );

  # fill up a document
  my $doc = WebService::Solr::Document->new;
  $doc->add_fields(( $solr_id, $solr_title, $solr_publisher, $solr_url ));
  foreach ( @subjects ) {

    $doc->add_fields(( WebService::Solr::Field->new( subject => &strip( $_ ))));
    $doc->add_fields(( WebService::Solr::Field->new( facet_subject => &strip( $_ ))));

  }

  # save; no need for commit because it comes for free
  $solr->add( $doc );

}

# done
exit;

sub strip {

  # strip non-ascii characters; bogus since the OAI output is suppose to be UTF-8
  # see: http://www.perlmonks.org/?node_id=613773
  my $s =  shift;
  $s    =~ s/[^[:ascii:]]+//g;
  return $s;

}

The script is very much like the trivial example from Part I. It first defines a few constants. It then initializes both an OAI-PMH harvester as well as a Solr object. It then loops through each record of the harvested content extracting the desired data. The subject data, in particular, is normalized. The data is then inserted into WebService::Solr::Field objects which in turn are inserted into WebService::Solr::Document objects and added to the underlying Lucene index.

Searching the index

Searching the index is less trivial than the example in Part I because of the facets, below:

#!/usr/bin/perl

# search-doaj.pl - query a solr/lucene index of DOAJ content

# Eric Lease Morgan <eric_morgan@infomotions.com>
# January 12, 2009 - version 1.0

# define
use constant SOLR => 'http://localhost:210/solr';
use constant ROWS => 100;
use constant MIN  => 5;

# require
use strict;
use WebService::Solr;

# initalize
my $solr = WebService::Solr->new( SOLR );

# sanity check
my $query = $ARGV[ 0 ];
if ( ! $query ) {

  print "Usage: $0 <query>\n";
  exit;

}

# search; get no more than ROWS records and subject facets occuring MIN times
my $response  = $solr->search( $query, { 'rows'           => ROWS,
                                         'facet'          => 'true',
                                         'facet.field'    => 'facet_subject',
                                         'facet.mincount' => MIN });

# get the number of hits, and start display
my $hit_count = $response->pager->total_entries;
print "Your search ($query) found $hit_count document(s).\n\n";

# extract subject facets, and display
my %subjects = &get_facets( $response->facet_counts->{ facet_fields }->{ facet_subject } );
if ( $hit_count ) {

  print "  Subject facets: ";
  foreach ( sort( keys( %subjects ))) { print "$_ (" . $subjects{ $_ } . "); " }
  print "\n\n";

}

# display each hit
my $index = 0;
foreach my $doc ( $response->docs ) {

  # slurp
  my $id        = $doc->value_for( 'id' );
  my $title     = $doc->value_for( 'title' );
  my $publisher = $doc->value_for( 'publisher' );
  my $url       = $doc->value_for( 'url' );
  my @subjects  = $doc->values_for( 'subject' );

  # increment
  $index++;

  #echo
  print "     record: $index\n";
  print "         id: $id\n";
  print "      title: $title\n";
  print "  publisher: $publisher\n";
  foreach ( @subjects ) { print "    subject: $_\n" }
  print "        url: $url\n";
  print "\n";

}

# done
exit;

sub get_facets {

  # convert array of facet/hit-count pairs into a hash; obtuse
  my $array_ref = shift;
  my %facet;
  my $i = 0;
  foreach ( @$array_ref ) {

    my $k = $array_ref->[ $i ]; $i++;
    my $v = $array_ref->[ $i ]; $i++;
    next if ( ! $v );
    $facet{ $k } = $v;

  }

  return %facet;

}

The script needs a bit of explaining. Like before, a few constants are defined. A Solr object is initialized, and the existence of a query string is verified. The search method makes use of a few options, specifically, options to return ROW number of search results as well as specific facets occurring MIN number of times. The whole thing is stuffed into a WebService::Solr::Response object, which is, for better or for worse, a JSON data structure. Using the pager method against the response object, the number hits are returned which is assigned to a scalar and displayed.

The trickiest part of the script is the extraction of the facets done by the get_facets subroutine. In WebService::Solr, facets names and their values are returned in an array reference. get_facets converts this array reference into a hash, and is then displayed. Finally, each WebService::Solr::Response object is looped through and echoed. Notice how the the subject field is handled. It contains multiple values which are retrieved through the values_for method which returns an array, not a scalar. Below is sample output for the search “library”:

Your search (library) found 84 document(s).

  Subject facets: Computer Science (7); Library and Information
Science (68); Medicine (General) (7); information science (19);
information technology (8); librarianship (16); libraries (6);
library and information science (14); library science (5);

     record: 1
         id: oai:doaj.org:0029-2540
      title: North Carolina Libraries
  publisher: North Carolina Library Association
    subject: libraries
    subject: librarianship
    subject: media centers
    subject: academic libraries
    subject: Library and Information Science
        url: http://www.nclaonline.org/NCL/

     record: 2
         id: oai:doaj.org:1311-8803
      title: Bibliosphere
  publisher: NBU Library
    subject: Bulgarian libraries
    subject: librarianship
    subject: Library and Information Science
        url: http://www.bibliosphere.eu/ 

     record: 3
         id: ...

In a hypertext environment, each of the titles in the returned records would be linked with their associated URLs. Each of the subject facets listed at the beginning of the output would be hyperlinked to subsequent searches combining the original query plus the faceted term, such as “library AND subject:’Computer Science’”. An even more elaborate search interface would allow the user to page through search results and/or modify the value of MIN to increase or decrease the number of relevant facets displayed.

Making lists searchable

Librarians love lists. We create lists of books. Lists of authors of books. Lists of journals. Lists of journal articles. Recently we have become enamored with lists of Internet resources. We pay other people for lists, and we call these people bibliographic index vendors. OCLC’s bread and butter is a list of library holdings. Librarians love lists.

Lists aren’t very useful unless they are: 1) short, 2) easily sortable, or 3) searchable. For the most part, the profession has mastered the short, sortable list, but we are challenged when it comes to searching our lists. We insist on using database applications for this, even when we don’t know how to design a (relational) database. Our searching mentality is stuck in the age of mediated online search services such as DIALOG and BRS. The profession has not come to grips with the advances in information retrieval. Keyword searching, as opposed to field searching, has its merits. Tools like Lucene, KinoSearch, Zebra, swish-e, and a host of predecessors like Harvest, WAIS, and Veronica all facilitate(d) indexing/searching.

As well as organizing information — the creation of lists — the profession needs to learn how to create its own indexes and make them searchable. While I do not advocate every librarian know how to exploit things like WebService::Solr, I do advocate the use of these technologies to a much greater degree. Without them the library profession will always be a follower in the field of information technology as opposed to a leader.

Summary

This posting, Part II of III, illustrated how to index and search content from an OAI-PMH data repository. It also advocated the increased use of indexer/search engines by the library profession. In the next and last part of this series WebService::Solr will be used as a part of an Search/Retrieve via URL (SRU) interface.

Acknowledgements

Special thanks go to Brian Cassidy and Kirk Beers who wrote WebService::Solr. Additional thanks go to Ed Summers and Thomas Berger who wrote Net::OAI::Harvester. I am simply standing on the shoulders of giants.

Fun with WebService::Solr, Part I of III

Monday, January 5th, 2009

solr logo
This posting (Part I) is an introduction to a Perl module called WebService::Solr. In it you will learn a bit of what Solr is, how it interacts with Lucene (an indexer), and how to write two trivial Perl scripts: 1) an indexer, and 2) a search engine. Part II of this series will introduce less trivial scripts — programs to index and search content from the Directory of Open Access Journals (DOAJ). Part III will demonstrate how to use WebService::Solr to implement an SRU interface against the index of DOAJ content. After reading each Part you should have a good overview of what WebService::Solr can do, but more importantly, you should have a better understanding of the role indexers/search engines play in the world of information retrieval.

Solr, Lucene, and WebService::Solr

I must admit, I’m coming to the Solr party at least one year late, and as you may or may not know, Solr is a Java-based, Web Services interface to the venerable Lucene — the current gold standard when it comes to indexers/search engines. In such an environment, Lucene (also a Java-based system) is used to first create inverted indexes from texts or numbers, and second, provide a means for searching the index. Solr is a Web Services interface to Lucene. Instead of writing applications reading and writing Lucene indexes directly, you can send Solr HTTP requests which are parsed and passed on to Lucene. For example, one could feed Solr sets of metadata describing, say, books, and provide a way to search the metadata to identify items of interest. (”What a novel idea!”) Using such a Web Servcies technique the programmer is free to use the programming/scripting language of their choice. No need to know Java, although Java-based programs would definitely be faster and more efficient.

For better or for worse, my programming language of choice is Perl, and upon perusing CPAN I discovered WebService::Solr — a module making it easy to interface with Solr (and therefore Lucene). After playing with WebService::Solr for a few days I became pretty impressed, thus, this posting.

Installing and configuring Solr

Installing Solr is relatively easy. Download the distribution. Save it in a convenient location on your file system. Unpack/uncompress it. Change directories to the example directory, and fire up Solr by typing java -jar start.jar at the command line. Since the distribution includes Jetty (a pint-sized HTTP server), and as long as you have not made any configuration changes, you should now be able to connect to your locally hosted Solr administrative interface through your favorite Web browser. Try, http://localhost:8983/solr/

When it comes to configuring Solr, the most important files are found in the conf directory, specifically, solrconfig.xml and schema.xml. I haven’t tweaked the former. The later denotes the types and names of fields that will ultimately be in your index. Describing in detail the in’s and out’s of solrconfig.xml and schema.xml are beyond the scope of this posting, but for our purposes here, it is important to note two things. First I modified schema.xml to include the following Dublin Core-like fields:

  <!-- a set of "Dublin Core-lite" fields -->
  <fields>
    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="title" type="text" indexed="true" stored="true" />
   <field name="text" type="text" indexed="true" stored="false" />
  </fields>

  <!-- what field is the key, very important! -->
  <uniqueKey>id</uniqueKey>

  <!-- field to search by default; the power of an index -->
  <defaultSearchField>text</defaultSearchField>

  <!-- how to deal with multiple terms -->
  <solrQueryParser defaultOperator="AND" />

  <!-- copy content into the default field -->
  <copyField source="title" dest="text" />

Second, I edited a Jetty configuration file (jetty.xml) so it listens on port 210 instead of the default port, 8983. “Remember Z39.50?”

There is a whole lot more to configuring Solr than what is outlined above. To really get a handle on the indexing process the Solr documentation is required reading.

Installing WebService::Solr

Written by Brian Cassidy and Kirk Beers, WebService::Solr is a set Perl modules used to interface with Solr. Create various WebService::Solr objects (such as fields, documents, requests, and responses), and apply methods against them to create, modify, find, add, delete, query, and optimize aspects of your underlying Lucene index.

Since WebService::Solr requires a large number of supporting modules, installing WebService::Solr is best done with using CPAN. From the CPAN command line, enter install WebService::Solr. It worked perfectly for me.

Indexing content

My first WebService::Solr script, an indexer, is a trivial example, below:

 #!/usr/bin/perl

 # trivial-index.pl - index a couple of documents

 # define
 use constant SOLR => 'http://localhost:210/solr';
 use constant DATA => ( 'Hello, World!', 'It is nice to meet you.' );

 # require
 use strict;
 use WebService::Solr;

 # initialize
 my $solr = WebService::Solr->new( SOLR );

 # process each data item
 my $index = 0;
 foreach ( DATA ) {

   # increment
   $index++;

   # populate solr fields
   my $id  = WebService::Solr::Field->new( id  => $index );
   my $title = WebService::Solr::Field->new( title => $_ );

   # fill a document with the fields
   my $doc = WebService::Solr::Document->new;
   $doc->add_fields(( $id, $title ));

   # save
   $solr->add( $doc );
   $solr->commit;

 }

 # done
 exit;

To elaborate, the script first defines the (HTTP) location of our Solr instance as well as array of data containing two elements. It then includes/requires the necessary Perl modules. One to keep our programming technique honest, and the other is our reason de existence. Third, a WebService::Solr object is created. Fourth, a pointer is initialized, and a loop instantiated reading each data element. Inside the loop the pointer is incremented and local WebService::Solr::Field objects are created using the values of the pointer and the current data element. The next step is to instantiate a WebService::Solr:Document object and fill it up with the Field objects. Finally, the Document is added to the index, and the update is committed.

If everything went according to plan, the Lucene index should now contain two documents. The first with an id equal to 1 and a title equal to “Hello, World!”. The second with an id equal to 2 and a title equal to “It is nice to meet you.” To verify this you should be able to use the following script to search your index:

  #!/usr/bin/perl

  # trivial-search.pl - query a lucene index through solr

  # define
  use constant SOLR => 'http://localhost:210/solr';

  # require
  use strict;
  use WebService::Solr;

  # initialize
  my $solr = WebService::Solr->new( SOLR );

  # sanity check
  my $query = $ARGV[ 0 ];
  if ( ! $query ) {

    print "Usage: $0 <query>\n";
    exit;

  }

  # search & get hits
  my $response = $solr->search( $query );
  my @hits = $response->docs;

  # display
  print "Your search ($query) found " . ( $#hits + 1 ) . " document(s).\n\n";
  foreach my $doc ( @hits ) {

    # slurp
    my $id    = $doc->value_for( 'id' );
    my $title = $doc->value_for( 'title' );

    # echo
    print "     id: $id\n";
    print "  title: $title\n";
    print "\n";

  }

Try queries such as hello, “hello OR meet”, or “title: world” will return results. Because the field named text includes the content of the title field, as per our definition, queries without field specifications default to the text field. Nice. The power of an index.

Here is how the script works. It first denotes the location of Solr. It then includes/requires the necessary modules. Next, it creates a WebService::Solr object. Fourth, it makes sure there is a query on the command line. Fifth, it queries Solr creating a WebService::Solr::Response object, and this object is queried for an array of hits. Finally, the hits are looped through, creating and displaying the contents of each WebService::Solr::Document object (hit) found.

Summary

This posting provided an overview of Lucene, Solr, and a set of Perl modules called WebService::Solr. It also introduced the use of the modules to index content and search it. Part II will provide a more in-depth introduction to the use of WebService::Solr and Solr in general.

Fun with the Internet Archive

Wednesday, December 10th, 2008

I’ve been having some fun with Internet Archive content.

The process

cover artMore specifically, I have created a tiny system for copying scanned materials locally, enhancing it with a word cloud, indexing it, and providing access to whole thing. There is how it works:

  1. Identify materials of interest from the Archive and copy their URLs to a text file.
  2. Feed the text file to a wget (wget.sh) which copies the plain text, PDF, XML metadata, and GIF cover art locally.
  3. Create a rudumentary word cloud (cloud.pl) against each full text version of a document in an effort to suppliment the MARC metadata.
  4. Index each item using the MARC metadata and full text (index.pl). Each index entry also includes the links to the word cloud, GIF image, PDF file, and MARC data.
  5. Provide a simple one-box, one-button interface to the index (search.pl & search.cgi). Search results appear much like the Internet Archive’s but also include the word cloud.
  6. Go to Step #1; rinse, shampoo, and repeat.

The demonstration

Attached are all the scripts I’ve written for the as-of-yet-unamed process, and you can try the demonstration at http://dewey.library.nd.edu/hacks/ia/search.cgi, but remember, there are only about two dozen items presently in the index.

The possibilities

There are many ways the system can be improved, and they can be divided into two types: 1) servcies against the index, and 2) services against the items. Services against the index include things like paging search results, making the interface “smarter”, adding things like faceted browse, implementing an advaced search, etc.

Services against the items interest me more. Given the full text it might be possible to do things like: compare & contrast documents, cite documents, convert documents into many formats, trace idea forward & backward, do morphology against words, add or subtract from “my” collection, search “my” collection, share, annotate, rank & review, summarize, create relationships between documents, etc. These sort of features I believe to be a future direction for the library profession. It is more than just get the document; it is also about doing things with them once they are acquired. The creation of the word clouds is a step in that direction. It assists in the compare & contrast of documents.

The Internet Archive makes many of these things possible because they freely distribute their content — including the full text.

InternetArchive++

WorldCat Hackathon

Sunday, November 9th, 2008

I attended the first-ever WorldCat Hackathon on Friday and Saturday (November 7 & 8), and us attendees explored ways to take advantage of various public application programmer interfaces (APIs) supported by OCLC.

Web Services

logoThe WorldCat Hackathon was an opportunity for people to get together, learn about a number of OCLC-supported APIs, and take time to explore how they can be used. These APIs are a direct outgrowth of something that started at least 6 years ago with an investigation of how OCLC’s data can be exposed through Web Service computing techniques. To date OCLC’s services fall into the following categories, and they are described in greater detail as a part of the OCLC Grid Services Web page:

  • WorldCat Search API – Search and display content from WorldCat — a collection of mostly books owned by libraries
  • Registry Services – Search and display names, addresses, and information about libraries
  • Identifier Services – Given unique keys, find similar items found in WorldCat
  • WorldCat Identities – Search and display information about authors from a name authority list
  • Terminology Services – Search and display subject authority information
  • Metadata Crosswalk Service – Convert one metadata format (MARC, MARCXML, XML/DC, MODS, etc.) into another. (For details of how this works, see “Toward element-level interoperability in bibliographic metadata” in Issue #2 of the Code4Lib Journal).

The Hacks

The event was attended by approximately fifty (50) people. The prize going to the person coming the furthest went to someone from France. A number of OCLC employees attended. Most people were from academic libraries, and most people were from the surrounding states. About three-quarters of the attendees were “hackers”, and the balance were there to learn.

Taking place in the Science, Industry and Business Library (New York Public Library), the event began with an overview of each of the Web Services and the briefest outline of how they might be used. We then quickly broke into smaller groups to “hack” away. The groups fell into a number of categories: Drupal, VUFind, Find More Like This One/Miscellaneous, and language-specific hacks. We reconvened after lunch on the second day sharing what we had done as well as what we had learned. Some of the hacks included:

  • Term Finder – Enter a term. Query the Terminology Services. Get back a list of broader and narrower terms. Select items from results. Repeat. Using such a service a person can navigate a controlled vocabulary space to select the most appropriate subject heading.
  • Name Finder – Enter a first name and a last name. Get back a list of WorldCat Identities matching the queries. Display the subject terms associated with the works of this author. Select subject terms results are displayed in Term Finder.
  • Send It To Me – Enter an ISBN number. Determine whether or not the item is held locally. If so, then allow the user to borrow the item. If not, then allow the user to find other items like that item, purchase it, and/or facilitate an interlibrary load request. All three of these services were written by myself. The first two were written at during the Hackathon. The last was written more than a year ago. All three could be used on their own or incorporated into a search results page.
  • Find More Like This One in VUFind – Written by Scott Mattheson (Yale University Library) this prototype was in the form of a number of screen shots. It allows the user to first do a search in VUFind. If desired items are checked out, then it will search for other local copies.
  • Google Map Libraries – Greg McClellan (Brandeis University) combined the WorldCat Search API, Registries Services, the Google Maps to display the locations of nearby libraries who reportably own a particular item.
  • Recommend Tags – Chad Fennell (University of Minnesota Libraries) overrode a Drupal tagging function to work with MeSH controlled vocabulary terms. In other words, as items in Drupal are being tagged, this hack leads the person doing data entry to use MeSH headings.
  • Enhancing Metadata – Piotr Adamzyk (Metropolitan Museum of Art) has access to both bibliographic and image materials. Through the use of Yahoo Pipes technology he was able to read metadata from an OAI repository, map it to metadata found in WorldCat, and ultimately supplement the metadata describing the content of his collections.
  • Pseudo-Metasearch in VUFind – Andrew Nagy (Villanova University) demonstrated how a search could be first done in VUFind, and have subsequent searches done against WorldCat by simply clicking on a tabbed interface.
  • Find More Like This One – Mark Matienzo (NYPL Labs) created an interface garnering an OCLC number as input. Given this it returned subject headings an effort to return other items. It was at this point Ralph LeVan (OCLC) said, “Why does everybody use subject headings to find similar items? Why not map your query to Dewey numbers and find items expected to be placed right next to the given item on the shelf?” Good food for thought.
  • xISBN Bookmarklette – Liu Xiaoming (OCLC) demonstrated a Web browser tool. Enter your institution’s name. Get back a browser bookmarklette. Drag bookmarklette to your toolbar. Search things like Amazon. Select ISBN number from the Web page. Click bookmarklette. Determine whether or not your local library owns the item.

Summary

Obviously the hacks created in this short period of time by a small number of people illustrate just a tiny bit of what could be done with the APIs. More importantly and IMHO, what these APIs really demonstrate is the ways librarians can have more control over their computing environment if they were to learn to exploit these tools to their greatest extent. Web Service computing techniques are particularly powerful because they are not wedded to any specific user interface. They simply provide the means to query remote services and get back sets of data. It is then up to librarians and developers — working together — to figure out what to do the the data. As I’ve said somewhere previously, “Just give me the data.”

I believe the Hackathon was a success, and I encourage OCLC to sponsor more of them.

MBooks, revisited

Monday, September 8th, 2008

This posting makes available a stylesheet to render MARCXML from a collection of records called MBooks.

In a previous post — get-mbooks.pl — I described how to use OAI-PMH to harvest MARC records from the MBooks project. The program works; it does what it is suppose to do.

The MBooks collection is growing so I harvested the content again, but this time I wanted to index it. Using an indexer/search engine called Zebra, the process was almost trivial. (See “Getting Started With Zebra” for details.)

Since Zebra supports SRU (Search/Retrieve via URL) out of the box, searches against the index return MARCXML. This will be a common returned XML stream for a while, so I needed to write an XSLT stylesheet to render the output. Thus, mbooks.xsl was born.

What is really “kewl” about the stylesheet is the simple inline Javascript allowing the librarian to view the MARC tags in all their glory. For a little while you can see how this all fits together in a simple interface to the index.

Use mbooks.xsl as you see fit, but remember “Give back to the ‘Net.”

wordcloud.pl

Monday, August 25th, 2008

Attached should be simple Perl script called wordcloud.pl. Initialize it with a hash of words and associated integers. Output rudimentary HTML in the form of a word cloud. This hack was used to create the word cloud in a posting called “Last of the Mohicans and services against texts“.

Last of the Mohicans and services against texts

Monday, August 25th, 2008

Here is a word cloud representing James Fenimore Cooper’s The Last of the Mohicans; A narrative of 1757. It is a trivial example of how libraries can provide services against documents, not just the documents themselves.

scout  heyward  though  duncan  uncas  little  without  own  eyes  before  hawkeye  indian  young  magua  much  place  long  time  moment  cora  hand  again  after  head  returned  among  most  air  huron  toward  well  few  seen  many  found  alice  manner  david  hurons  voice  chief  see  words  about  know  never  woods  great  rifle  here  until  just  left  soon  white  heard  father  look  eye  savage  side  yet  already  first  whole  party  delawares  enemy  light  continued  warrior  water  within  appeared  low  seemed  turned  once  same  dark  must  passed  short  friend  back  instant  project  around  people  against  between  enemies  way  form  munro  far  feet  nor  

About the story

While I am not a literary scholar, I am able to read a book and write a synopsis.

Set during the French And Indian War in what was to become upper New York State, two young women are being escorted from one military camp to another. Along the way the hero, Natty Bumppo (also known by quite a number of other names, most notably “Hawkeye” or the “scout”), alerts the convoy that their guide, Magua, is treacherous. Sure enough, Magua kidnaps the women. Fights and battles ensue in a pristine and idyllic setting. Heroic deeds are accomplished by Hawkeye and the “last of the Mohicans” — Uncas. Everybody puts on disguises. In the end, good triumphs over evil but not completely.

Cooper’s style is verbose. Expressive. Flowery. On this level it was difficult to read. Too many words. In the other hand the style was consistent, provided a sort of pattern, and enabled me to read the novel with a certain rhythm.

There were a couple of things I found particularly interesting. First, the allusion to “relish“. I consider this to be a common term now-a-days, but Cooper thought it needed elaboration when used to describe food. Cooper used the word within a relatively short span of text to describe condiment as well as a feeling. Second, I wonder whether or not Cooper’s description of Indians built on existing stereotypes or created them. “Hugh!”

Services against texts

The word cloud I created is simple and rudimentary. From my perspective, it is just a graphical representation of a concordance, and a concordance has to be one of the most basic of indexes. This particular word cloud (read “concordance” or “index”) allows the reader to get a sense of a text. It puts words in context. It allows the would-be reader to get an overview of the document.

This particular implementation is not pretty, nor is it quick, but it is functional. How could libraries create other services such as these? Everybody can find and get data and information these days. What people desire is help understanding and using the documents. Providing services against texts such as word clouds (concordances) might be one example.

Alex Lite: A Tiny, standards-compliant, and portable catalogue of electronic texts

Saturday, July 12th, 2008

One the beauties of XML its ability to be transformed into other plain text files, and that is what I have done with a simple software distribution called Alex Lite.

My TEI publishing system(s)

A number of years ago I created a Perl-based TEI publishing system called “My personal TEI publishing system“. Create a database designed to maintain authority lists (titles and subjects), sets of XSLT files, and TEI/XML snippets. Run reports against the database to create complete TEI files, XHTML files, RSS files, and files designed to be disseminated via OAI-PMH. Once the XHTML files are created, use an indexer to index them and provide a Web-based interface to the index. Using this system I have made accessible more than 150 of my essays, travelogues, and workshop handouts retrospectively converted as far back as 1989. Using this system, many (if not most) of my writings have been available via RSS and OAI-PMH since October 2004.

A couple of years later I morphed the TEI publishing system to enable me to mark-up content from an older version of my Alex Catalogue of Electronic Texts. Once marked up I planned to transform the TEI into a myriad of ebook formats: plain text, plain HTML, “smart” HTML, PalmPilot DOC and eReader, Rocket eBook, Newton Paperback, PDF, and TEI/XML. The mark-up process was laborious and I have only marked up about 100 texts, and you can see the fruits of these labors, but the combination of database and XML technology has enabled me to create Alex Lite.

Alex Lite

Alex Lite the result of a report written against my second TEI publishing system. Loop through each item in the database and update an index of titles. Create a TEI file against each item. Using XSLT, convert each TEI file into a plain HTML file, a “pretty” XHTML file, and a FO (Formatting Objects) file. Use a FO processor (like FOP) to convert the FO into PDF. Loop through each creator in the database to create an author index. Glue the whole thing together with an index.html file. Save all the files to a single directory and tar up the directory.

The result is a single file that can be downloaded, unpacked, and provide immediate access to sets of electronic books in an standards-compliant, operating system independent manner. Furthermore, no network connection is necessary except for the initial acquisition of the distribution. This directory can then be networked or saved to a CD-ROM. Think of the whole thing as if it were a library.

Give it a whirl; download a version of Alex Lite. Here is a list of all the items in the tiny collection:

  1. Alger Jr., Horatio (1834-1899)
    • The Cash Boy
    • Cast Upon The Breakers
  2. Bacon, Francis (1561-1626)
    • The Essays
    • The New Atlantis
  3. Burroughs, Edgar Rice (1875-1850)
    • At The Earth’s Core
    • The Beasts Of Tarzan
    • The Gods Of Mars
    • The Jungle Tales Of Tarzan
    • The Monster Men
    • A Princess Of Mars
    • The Return Of Tarzan
    • The Son Of Tarzan
    • Tarzan And The Jewels Of Opar
    • Tarzan Of The Apes
    • The Warlord Of Mars
  4. Conrad, Joseph (1857-1924)
    • The Heart Of Darkness
    • Lord Jim
    • The Secret Sharer
  5. Doyle, Arthur Conan (1859-1930)
    • The Adventures Of Sherlock Holmes
    • The Case Book Of Sherlock Holmes
    • His Last Bow
    • The Hound Of The Baskervilles
    • The Memoirs Of Sherlock Holmes
  6. Machiavelli, Niccolo (1469-1527)
    • The Prince
  7. Plato (428-347 B.C.)
    • Charmides, Or Temperance
    • Cratylus
    • Critias
    • Crito
    • Euthydemus
    • Euthyphro
    • Gorgias
  8. Poe, Edgar Allan (1809-1849)
    • The Angel Of The Odd–An Extravaganza
    • The Balloon-Hoax
    • Berenice
    • The Black Cat
    • The Cask Of Amontillado
  9. Stoker, Bram (1847-1912)
    • Dracula
    • Dracula’s Guest
  10. Twain, Mark (1835-1910)
    • The Adventures Of Huckleberry Finn
    • A Connecticut Yankee In King Arthur’s Court
    • Extracts From Adam’s Diary
    • A Ghost Story
    • The Great Revolution In Pitcairn
    • My Watch: An Instructive Little Tale
    • A New Crime
    • Niagara
    • Political Economy

XSLT

As alluded to above, the beauty of XML is its ability to be transformed into other plain text formats. XSLT allows me to convert the TEI files into other files for different mediums. The distribution includes only simple HTML, “pretty” XHTML, and PDF versions of the texts, but for the XSLT affectionatos in the crowd who may want to see the XSLT files, I have included them here:

  • tei2htm.xsl – used to create plain HTML files complete with metadata
  • tei2html.xsl – used to create XHTML files complete with metadata as well as simple CSS-enabled navigation
  • tei2fo.xsl – used to create FO files which were fed to FOP in order to create things designed for printing on paper

Here’s a sample TEI file, Edgar Allen Poe’s The Cask Of Amontillado.

Future work

I believe there is a lot of promise in the marking-up of plain text into XML, specifically works of fiction and non-fictin into TEI. Making available such marked-up texts paves the way for doing textual analysis against them and for enhancing them with personal commentary. It is too bad that the mark-up process, even simple mark-up, is so labor intensive. Maybe I’ll do more of this sort of thing in my copius spare time.