Search/Retrieve via URL (SRU) « Infomotions Mini-Musings

Posts Tagged ‘Search/Retrieve via URL (SRU)’

Fun with WebService::Solr, Part III of III

Thursday, January 22nd, 2009

This is the last of a three-part series providing an overview of a set of Perl modules called WebService::Solr. In Part I, WebService::Solr was introduced with two trivial scripts. Part II put forth two command line driven scripts to index and search content harvested via OAI. Part III illustrates how to implement an Search/Retrieve via URL (SRU) search interface against an index created by WebService::Solr.

Search/Retrieve via URL

Search/Retrieve via URL (SRU) is a REST-like Web Service-based protocol designed to query remote indexes. The protocol essentially consists of three functions or “operations”. The first, explain, provides a mechanism to auto-discover the type of content and capabilities of an SRU server. The second, scan, provide a mechanism to browse an index’s content much like perusing the back-of-a-book index. The third, searchRetrieve, provides the means for sending a query to the index and getting back a response. Many of the librarians in the crowd will recognize SRU as the venerable Z39.50 protocol redesigned for the Web.

During the past year, time has been spent joining the SRU community with the OpenSearch community to form a single, more unified set of Search Web Service protocols. OpenSearch has very similar goals to SRU — to provide standardized interfaces for searching indexes — but the techniques between it an SRU are different. Where OpenSearch’s query language is simple, SRU’s is expressive. Where OpenSearch returns an RSS-like data stream, SRU includes the ability to return just about any XML format. OpenSearch may be easier to implement, but SRU is suited for a wider number of applications. To bring SRU and OpenSearch together, and to celebrate similarities as opposed to differences, an OASIS Abstract Protocol Definition has been drafted defining how the searching of Web-based databases and indexes can be done in a standardized way.

SRU is an increasingly important protocol for the library community because of a growing number of the WorldCat Grid Services are implemented using SRU. The Grid supports indexes such lists of library holdings (WorldCat), name and subject authority files (Identities), as well as names of libraries (the Registry). By sending SRU queries to these services and mashing up the results with the output of other APIs, all sorts of library and bibliographic applications can be created.

Integrating WebService::Solr into SRU

Personally, I have been creating SRU interfaces to many of my indexes for about four years. I have created these interfaces against mailing list archives, OAI-harvested content, and MARC records. The underlying content has been indexed with swish-e, Plucene, KinoSearch, and now Lucene through WebService::Solr.

Ironic or not, I use yet another set of Perl modules — available on CPAN and called SRU — written by Brian Cassidy to implement my SRU servers. The form of my implementations is rather simple. Get the input. Determine what operation is requested. Branch accordingly. Do the necessary processing. Return a response.

The heart of my SRU implementation is a subroutine called search. It is within this subroutine where indexer-specific hacking takes place. For example and considering WebService::Solr:

sub search {

  # initialize
  my $query   = shift;
  my $request = shift;
  my @results;
  
  # set up Solr
  my $solr = WebService::Solr->new( SOLR );
    
  # calculate start record and number of records
  my $start_record = 0;
  if ( $request->startRecord ) { $start_record = $request->startRecord - 1 }
  my $maximum_records = MAX; $maximum_records = $request->maximumRecords 
     unless ( ! $request->maximumRecords );

  # search
  my $response   = $solr->search( $query, {
                                  'start' => $start_record,
                                  'rows'  => $maximum_records });
  my @hits       = $response->docs;
  my $total_hits = $response->pager->total_entries;
  
  # display the number of hits
  if ( $total_hits ) {
  
    foreach my $doc ( @hits ) {
                
      # slurp
      my $id          = $doc->value_for(  'id' );
      my $name        = &escape_entities( $doc->value_for(  'title' ));
      my $publisher   = &escape_entities( $doc->value_for(  'publisher' ));
      my $description = &escape_entities( $doc->value_for(  'description' ));
      my @creator     = $doc->values_for( 'creator' );
      my $contributor = &escape_entities( $doc->value_for(  'contributor' ));
      my $url         = &escape_entities( $doc->value_for(  'url' ));
      my @subjects    = $doc->values_for( 'subject' );
      my $source      = &escape_entities( $doc->value_for(  'source' ));
      my $format      = &escape_entities( $doc->value_for(  'format' ));
      my $type        = &escape_entities( $doc->value_for(  'type' ));
      my $relation    = &escape_entities( $doc->value_for(  'relation' ));
      my $repository  = &escape_entities( $doc->value_for(  'repository' ));

      # full results, but included entities; hmmm...
      my $record  = '<srw_dc:dc xmlns="http://www.w3.org/TR/xhtml1/strict"
                      xmlns:dc="http://purl.org/dc/elements/1.1/"
                      xmlns:srw_dc="info:srw/schema/1/dc-v1.1">';
      $record .= '<dc:title>' .  $name . '</dc:title>';
      $record .= '<dc:publisher>' .  $publisher . '</dc:publisher>';
      $record .= '<dc:identifier>' .  $url . '</dc:identifier>';
      $record .= '<dc:description>' .  $description . '</dc:description>';
      $record .= '<dc:source>' . $source . '</dc:source>';
      $record .= '<dc:format>' .  $format . '</dc:format>';
      $record .= '<dc:type>' .  $type . '</dc:type>';
      $record .= '<dc:contributor>' .   $contributor . '</dc:contributor>';
      $record .= '<dc:relation>' .   $relation . '</dc:relation>';
      foreach ( @creator ) { $record .= '<dc:creator>' .  $_ . '</dc:creator>' }
      foreach ( @subjects ) { $record .= '<dc:subject>' . $_ . '</dc:subject>' }
      $record .= '</srw_dc:dc>';
      push @results, $record;
            
    }
    
  }
  
  # done; return it
  return ( $total_hits, @results );
  
}

The subroutine is not unlike the search script outlined in Part II of this series. First the query, SRU::Request object, results, and local Solr objects are locally initialized. A pointer to the first desired hit as well as the maximum number of records to return are calculated. The search is done, and the total number of search results is saved for future reference. If the search was a success, then each of the hits are looped through while stuffing them into an XML element named record and scoped with a Dublin Core name space. Finally, the total number of records as well as the records themselves are returned to the main module where they are added to an SRU::Response object and returned to the SRU client.

This particular implementation is pretty rudimentary, and it does not really exploit the underlying functionality of Solr/Lucene. For example, it does not support facets, spell check, suggestions, etc. On the other hand, it does support paging, and since it is implemented under mod_perl it is just about as fast as it can get on my hardware.

Give the implementation a whirl. The underlying index includes about 20,000 records of various electronic books (from the Alex Catalogue of Electronic Texts, Project Gutenberg, and the HathiTrust), photographs (from my own adventures), journal titles, and journal articles (both from the Directory of Open Access Journals).

Summary

It is difficult for me to overstate the number of possibilities for librarianship considering the current information environment. Data and information abound! Learning has not stopped. It is sexy to be in the information business. All of the core principles of librarianship are at play in this environment. Collection. Preservation. Organization. Dissemination. The application of relational databases combined with indexers provide the means to put into practice these core principles in today’s world.

The Solr/Lucene combination is an excellent example, and WebService::Solr is just one way to get there. Again, I don’t expect every librarian to know and understand all of things outlined in this series of essays. On the other hand, I do think it is necessary for the library community as a whole to understand this technology in the same way they understand bibliography, conservation, cataloging, and reference. Library schools need to teach it, and librarians need to explore it.

Source code

Finally, plain text versions of this series’ postings, the necessary Solr schema.xml files, as well as all the source code is available for downloading. Spend about an hour putzing around. I’m sure you will come out the other end learning something.

Tags: indexing, Search/Retrieve via URL (SRU), WebService::Solr
Posted in Hacks | 7 Comments »

Creator: Eric Lease Morgan <eric_morgan@infomotions.com>
Date created: 2008-05-26
Date updated: 2010-05-09
URL: ./

Archives

Categories