Fun with WebService::Solr, Part III of III

This is the last of a three-part series providing an overview of a set of Perl modules called WebService::Solr. In Part I, WebService::Solr was introduced with two trivial scripts. Part II put forth two command line driven scripts to index and search content harvested via OAI. Part III illustrates how to implement an Search/Retrieve via URL (SRU) search interface against an index created by WebService::Solr.

Search/Retrieve via URL

Search/Retrieve via URL (SRU) is a REST-like Web Service-based protocol designed to query remote indexes. The protocol essentially consists of three functions or “operations”. The first, explain, provides a mechanism to auto-discover the type of content and capabilities of an SRU server. The second, scan, provide a mechanism to browse an index’s content much like perusing the back-of-a-book index. The third, searchRetrieve, provides the means for sending a query to the index and getting back a response. Many of the librarians in the crowd will recognize SRU as the venerable Z39.50 protocol redesigned for the Web.

During the past year, time has been spent joining the SRU community with the OpenSearch community to form a single, more unified set of Search Web Service protocols. OpenSearch has very similar goals to SRU — to provide standardized interfaces for searching indexes — but the techniques between it an SRU are different. Where OpenSearch’s query language is simple, SRU’s is expressive. Where OpenSearch returns an RSS-like data stream, SRU includes the ability to return just about any XML format. OpenSearch may be easier to implement, but SRU is suited for a wider number of applications. To bring SRU and OpenSearch together, and to celebrate similarities as opposed to differences, an OASIS Abstract Protocol Definition has been drafted defining how the searching of Web-based databases and indexes can be done in a standardized way.

SRU is an increasingly important protocol for the library community because of a growing number of the WorldCat Grid Services are implemented using SRU. The Grid supports indexes such lists of library holdings (WorldCat), name and subject authority files (Identities), as well as names of libraries (the Registry). By sending SRU queries to these services and mashing up the results with the output of other APIs, all sorts of library and bibliographic applications can be created.

Integrating WebService::Solr into SRU

Personally, I have been creating SRU interfaces to many of my indexes for about four years. I have created these interfaces against mailing list archives, OAI-harvested content, and MARC records. The underlying content has been indexed with swish-e, Plucene, KinoSearch, and now Lucene through WebService::Solr.

Ironic or not, I use yet another set of Perl modules — available on CPAN and called SRU — written by Brian Cassidy to implement my SRU servers. The form of my implementations is rather simple. Get the input. Determine what operation is requested. Branch accordingly. Do the necessary processing. Return a response.

The heart of my SRU implementation is a subroutine called search. It is within this subroutine where indexer-specific hacking takes place. For example and considering WebService::Solr:

sub search {

  # initialize
  my $query   = shift;
  my $request = shift;
  my @results;
  
  # set up Solr
  my $solr = WebService::Solr->new( SOLR );
    
  # calculate start record and number of records
  my $start_record = 0;
  if ( $request->startRecord ) { $start_record = $request->startRecord - 1 }
  my $maximum_records = MAX; $maximum_records = $request->maximumRecords 
     unless ( ! $request->maximumRecords );

  # search
  my $response   = $solr->search( $query, {
                                  'start' => $start_record,
                                  'rows'  => $maximum_records });
  my @hits       = $response->docs;
  my $total_hits = $response->pager->total_entries;
  
  # display the number of hits
  if ( $total_hits ) {
  
    foreach my $doc ( @hits ) {
                
      # slurp
      my $id          = $doc->value_for(  'id' );
      my $name        = &escape_entities( $doc->value_for(  'title' ));
      my $publisher   = &escape_entities( $doc->value_for(  'publisher' ));
      my $description = &escape_entities( $doc->value_for(  'description' ));
      my @creator     = $doc->values_for( 'creator' );
      my $contributor = &escape_entities( $doc->value_for(  'contributor' ));
      my $url         = &escape_entities( $doc->value_for(  'url' ));
      my @subjects    = $doc->values_for( 'subject' );
      my $source      = &escape_entities( $doc->value_for(  'source' ));
      my $format      = &escape_entities( $doc->value_for(  'format' ));
      my $type        = &escape_entities( $doc->value_for(  'type' ));
      my $relation    = &escape_entities( $doc->value_for(  'relation' ));
      my $repository  = &escape_entities( $doc->value_for(  'repository' ));

      # full results, but included entities; hmmm...
      my $record  = '<srw_dc:dc xmlns="http://www.w3.org/TR/xhtml1/strict"
                      xmlns:dc="http://purl.org/dc/elements/1.1/"
                      xmlns:srw_dc="info:srw/schema/1/dc-v1.1">';
      $record .= '<dc:title>' .  $name . '</dc:title>';
      $record .= '<dc:publisher>' .  $publisher . '</dc:publisher>';
      $record .= '<dc:identifier>' .  $url . '</dc:identifier>';
      $record .= '<dc:description>' .  $description . '</dc:description>';
      $record .= '<dc:source>' . $source . '</dc:source>';
      $record .= '<dc:format>' .  $format . '</dc:format>';
      $record .= '<dc:type>' .  $type . '</dc:type>';
      $record .= '<dc:contributor>' .   $contributor . '</dc:contributor>';
      $record .= '<dc:relation>' .   $relation . '</dc:relation>';
      foreach ( @creator ) { $record .= '<dc:creator>' .  $_ . '</dc:creator>' }
      foreach ( @subjects ) { $record .= '<dc:subject>' . $_ . '</dc:subject>' }
      $record .= '</srw_dc:dc>';
      push @results, $record;
            
    }
    
  }
  
  # done; return it
  return ( $total_hits, @results );
  
}

The subroutine is not unlike the search script outlined in Part II of this series. First the query, SRU::Request object, results, and local Solr objects are locally initialized. A pointer to the first desired hit as well as the maximum number of records to return are calculated. The search is done, and the total number of search results is saved for future reference. If the search was a success, then each of the hits are looped through while stuffing them into an XML element named record and scoped with a Dublin Core name space. Finally, the total number of records as well as the records themselves are returned to the main module where they are added to an SRU::Response object and returned to the SRU client.

This particular implementation is pretty rudimentary, and it does not really exploit the underlying functionality of Solr/Lucene. For example, it does not support facets, spell check, suggestions, etc. On the other hand, it does support paging, and since it is implemented under mod_perl it is just about as fast as it can get on my hardware.

Give the implementation a whirl. The underlying index includes about 20,000 records of various electronic books (from the Alex Catalogue of Electronic Texts, Project Gutenberg, and the HathiTrust), photographs (from my own adventures), journal titles, and journal articles (both from the Directory of Open Access Journals).

Summary

It is difficult for me to overstate the number of possibilities for librarianship considering the current information environment. Data and information abound! Learning has not stopped. It is sexy to be in the information business. All of the core principles of librarianship are at play in this environment. Collection. Preservation. Organization. Dissemination. The application of relational databases combined with indexers provide the means to put into practice these core principles in today’s world.

The Solr/Lucene combination is an excellent example, and WebService::Solr is just one way to get there. Again, I don’t expect every librarian to know and understand all of things outlined in this series of essays. On the other hand, I do think it is necessary for the library community as a whole to understand this technology in the same way they understand bibliography, conservation, cataloging, and reference. Library schools need to teach it, and librarians need to explore it.

Source code

Finally, plain text versions of this series’ postings, the necessary Solr schema.xml files, as well as all the source code is available for downloading. Spend about an hour putzing around. I’m sure you will come out the other end learning something.

Tags: indexing, Search/Retrieve via URL (SRU), WebService::Solr

This entry was posted on Thursday, January 22nd, 2009 at 9:46 pm and is filed under Hacks. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

7 Responses to “Fun with WebService::Solr, Part III of III”

Jonathan Rochkind says:

January 23, 2009 at 11:42 am

I’ve been spending some time with the OpenSearch spec lately in anticipation of using it in a particular future project I have in mind.

I’ve been quite impressed with it. I think the idea of merging SRU into OpenSearch is a good one.

It’s not really true to say that “OpenSearch returns an RSS-like data stream” — rather an OpenSearch description can specify exactly what format is returned using a MIME type, and can even specify different URLs for retrieving different formats. So I think it’s rather compatible with SRU here. There’s no reason an OpenSearch description can’t declare results returned in MARC-XML or anything else. (I hope MARCXML, MODS, etc, have declared mime types? If not they need to asap for a billion reasons!)

Also, while an OpenSearch description document doesn’t declare any particular syntax for it’s query, there’s no reason an OpenSearch query _couldn’t_ be in CQL.

It would seem useful to me to extend OpenSearch to allow a description document to specify that CQL is supported, and specify what search indexes the server provides, or any other CQL-related metadata.

One of the most useful parts of OpenSearch is how it allows easy extensibility using custom namespaces. I am actually not that familiar with SRU, and haven’t read that OASIS document yet, but I hope they take the approach of trying to fit SRU into the existing OpenSearch standard, rather than creating a new ‘umbrella’ standard on top of it. I think the OpenSearch standard would be quite suited to this.

I perused the OpenSearch listserv recently, and I didn’t see any mention of OASIS or SRU. I hope the OASIS folks are actually talking to the OpenSearch folks about this, rather than just doing things in their own silo. The OpenSearch folks seem to me to be quite on top of things, and interested in making sure OpenSearch supports new use cases in the ‘right’ way.
Ian Ibbotson says:

January 23, 2009 at 7:05 pm

Heya Jonathan.. Don’t suppose you could enumerate some of your arguments as to why there should be -mime- types for mods/marcxml etc?

There has always been a bit of an elephant in the room with the issue of identifying schemas, from the “XML” and element set name wranglings of Z3950 to the advanced SRU “It’s just a URI (Lets hope naming the problem happens to solve it too)”

If the answer to these problems is “Just give everything a mime type” then thats cool (We certainly need some registry) but what should the mime type for mods actually look like?
Brian Cassidy says:

January 24, 2009 at 12:07 am

To be fair, Ed Summers wrote the SRU modules — I’ve just helped maintain them.
Dave says:

May 10, 2010 at 2:41 pm

Hi there,

I have problem with solr fq (facet query). If I use q=system&fq=systype:LA on SOLR admin page, it works.

I tried both of these methods using Webservice and it failed . Can you give some inputs to solve this problem.

Thanks

my $response = $solr->search( $query, {
‘start’ => $start_record,
‘rows’ => $maximum_records ,
‘fq’ => ‘systype:LA’
});

my $response = $solr->search( $query, {
‘start’ => $start_record,
‘rows’ => $maximum_records ,
‘facet.query’ => ‘systype’,
‘systype’ => LA
});
dave says:

May 13, 2010 at 12:30 am

Hi there,

On my previous post, I forgot the put multiple fq. Basic I want to do multiple fq on webservice::solr. Is that possible? If so can you give me some advise.

my $response = $solr->search( $query, {
’start’ => $start_record,
‘rows’ => $maximum_records ,
‘fq’ => ’systype:LA’,
‘fq’ => ‘location:US’
});

This code failed.

Eric Lease Morgan says:

May 19, 2010 at 10:54 pm

Dave, I think are having problems for two reasons. First, since it is possible to specify multiple facets in a query, you need to associate the value of fq as a reference to an array. Second you need to pass the search option of your query as a reference to a hash. I think I learned these tricks from Brian Cassidy.

I had similar issues in my Alex Catalogue. Specifically, it is possible to enter a query, get a search result, and refine it through any number of facets (Solr fq values). To accommodate this functionality I wrote the following to build my Solr query:

# build the search options
my %search_options = ();
$search_options{ 'start' }          = $start;
$search_options{ 'rows' }           = ROWS;
$search_options{ 'facet' }          = 'true';
$search_options{ 'facet.field' }    = [ 'type', 'facet_creator', 'facet_subject', 'repository' ];
$search_options{ 'facet.mincount' } = MINCOUNT;
$search_options{ 'sort' }           = $sort;
if ( $cgi->param( 'fq' ) ) { $search_options{ 'fq' } = [ $cgi->param( 'fq') ] };

# do the work; search
my $response = $solr->search( $query, \%search_options );

Notice how $search_options{ ‘fq’ } points to a reference to an array. And notice later how the search method includes a reference to %search_options.

Maybe the following will work for you:

# build an array of facets
my @fqs = ( 'systype:LA', 'location:US' );

# build the search options
my %search_options         = ();
$search_options{ 'start' } = $start_record,
$search_options{ 'rows' }  = $maximum_records;
$search_options{ 'fq' }    = [ @fqs ];

# do the work; search
my $response = $solr->search( $query, \%search_options );

Good luck.

dave says:

May 27, 2010 at 12:49 am

Hi Eric

Thanks for your inputs. However I still unable get to work. I tried both methods. Did it works for you? My search found 0 document when I multiple fqs.

# build an array of facets
my @fqs = ( ‘systype:LA’, ‘location:US’ );

# build the search options
my %search_options = ();
$search_options{ ‘start’ } = $start_record,
$search_options{ ‘rows’ } = $maximum_records;
$search_options{ ‘fq’ } = [ @fqs ];

# do the work; search
my $response = $solr->search( $query, \%search_options );

Creator: Eric Lease Morgan <eric_morgan@infomotions.com>
Date created: 2008-05-26
Date updated: 2010-05-09
URL: ./

Archives

Categories