Fun with WebService::Solr, Part II of III « Infomotions Mini-Musings

Fun with WebService::Solr, Part II of III

In this posting (Part II), I will demonstrate how to use WebService::Solr to create and search a more substantial index, specifically an index of metadata describing the content of the Directory of Open Access Journals. Part I of these series introduced Lucene, Solr, and WebService::Solr with two trivial examples. Part III will describe how to create an SRU interface using WebService::Solr.

Directory of Open Access Journals

solr logo The Directory of Open Access Journals (DOAJ) is a list of freely available scholarly journals. As of this writing the Directory contains approximately 3,900 titles organized into eighteen broad categories such as Arts and Architecture, Law and Political Science, and General Science. Based on my tertiary examination, a large percentage of the titles are in the area of medicine.

Not only is it great that such a directory exists, but it is even greater that the Directory’s metadata — the data describing the titles in the Directory — is available for harvesting via OAI-PMH. While the metadata is rather sparse, it is more than adequate for creating rudimentary MARC records for importing into library catalogs, or better yet, incorporating into some other Web service. (No puns intended.)

In my opinion, the Directory is a especially underutilized. For example, not only are the Directory’s journal titles available for download, but so is the metadata of about 25,000 journal articles. Given these two things (metadata describing titles as well as articles) it would be entirely possible to seed a locally maintained index of scholarly journal content and incorporate that into library “holdings”. But alas, that is another posting and another story.

Indexing the DOAJ

It is almost trivial to create a search engine against DOAJ content when you know how to implement an OAI-PMH harvester and indexer. First, you need to know the OAI-PMH root URL for the Directory, and it happens to be http://www.doaj.org/oai Second, you need to peruse the OAI-PMH output sent by the Directory and map it to fields you will be indexing. In the case of this demonstration, the fields are id, title, publisher, subject, and URL. Consequently, I updated the schema from the first demonstration to look like this:

<!-- DC-like fields -->
<fields>
  <field name="id" type="string" indexed="true" stored="true" required="true" />
  <field name="title" type="text" indexed="true" stored="true" />
  <field name="publisher" type="text" indexed="true" stored="true" />
  <field name="subject" type="text" indexed="true" stored="true" multiValued="true" />
  <field name="url" type="text" indexed="false" stored="true" />
  <field name="text" type="text" indexed="true" stored="false" multiValued="true" />
  <field name="facet_subject" type="string" indexed="true" stored="true" multiValued="true" />
</fields>

<!-- key; for updating purposes -->
<uniqueKey>id</uniqueKey>

<!-- for non-field searches -->
<defaultSearchField>text</defaultSearchField>

<!-- AND is more precise -->
<solrQueryParser defaultOperator="AND" />

<!-- what gets searched by default -->
<copyField source="title" dest="text" />
<copyField source="subject" dest="text" />
<copyField source="publisher" dest="text" />

The astute reader will notice the addition of a field named facet_subject. This field, denoted as a string and therefore not parsed by the indexer, is destined to be a browsable facet in the search engine. By including this sort of field in the index it is be possible to return results like, “Your search identified 100 items, and 25 of them are associated with the subject Philosophy.” A very nice feature. Think of it as the explicit exploitation of controlled vocabulary terms for search results. Facets turn the use of controlled vocabularies inside out. The library community has something to learn here.

Once the schema was updated, I wrote the following script to index the journal title content from the Directory:

#!/usr/bin/perl

# index-doaj.pl - get doaj content and index it

# Eric Lease Morgan <eric_morgan@infomotions.com>
# January  12, 2009 - version 1.0


# define
use constant OAIURL => 'http://www.doaj.org/oai';
use constant PREFIX => 'oai_dc';
use constant SOLR   => 'http://localhost:210/solr';

# require
use Net::OAI::Harvester;
use strict;
use WebService::Solr;

# initialize oai and solr
my $harvester = Net::OAI::Harvester->new( baseURL => OAIURL );
my $solr      = WebService::Solr->new( SOLR );

# get all records and loop through them
my $records = $harvester->listAllRecords( metadataPrefix => PREFIX );
my $id      = 0;
while ( my $record = $records->next ) {

  # increment
  $id++;
  last if ( $id > 100 );  # comment this out to get everything

  # extract the desired metadata
  my $metadata     = $record->metadata;
  my $identifier   = $record->header->identifier;
  my $title        = $metadata->title      ? &strip( $metadata->title )     : '';
  my $url          = $metadata->identifier ? $metadata->identifier          : '';
  my $publisher    = $metadata->publisher  ? &strip( $metadata->publisher ) : '';
  my @all_subjects = $metadata->subject    ? $metadata->subject             : ();

  # normalize subjects
  my @subjects = ();
  foreach ( @all_subjects ) {

    s/DoajSubjectTerm: //;  # remove DOAJ label
    next if ( /LCC: / );    # don't want call numbers
    push @subjects, $_;

  }

  # echo
  print "      record: $id\n";
  print "  identifier: $identifier\n";
  print "       title: $title\n";
  print "   publisher: $publisher\n";
  foreach ( @subjects ) { print "     subject: $_\n" }
  print "         url: $url\n";
  print "\n";

  # create solr/lucene document
  my $solr_id        = WebService::Solr::Field->new( id        => $identifier );
  my $solr_title     = WebService::Solr::Field->new( title     => $title );
  my $solr_publisher = WebService::Solr::Field->new( publisher => $publisher );
  my $solr_url       = WebService::Solr::Field->new( url       => $url );

  # fill up a document
  my $doc = WebService::Solr::Document->new;
  $doc->add_fields(( $solr_id, $solr_title, $solr_publisher, $solr_url ));
  foreach ( @subjects ) {

    $doc->add_fields(( WebService::Solr::Field->new( subject => &strip( $_ ))));
    $doc->add_fields(( WebService::Solr::Field->new( facet_subject => &strip( $_ ))));

  }

  # save; no need for commit because it comes for free
  $solr->add( $doc );

}

# done
exit;


sub strip {

  # strip non-ascii characters; bogus since the OAI output is suppose to be UTF-8
  # see: http://www.perlmonks.org/?node_id=613773
  my $s =  shift;
  $s    =~ s/[^[:ascii:]]+//g;
  return $s;

}

The script is very much like the trivial example from Part I. It first defines a few constants. It then initializes both an OAI-PMH harvester as well as a Solr object. It then loops through each record of the harvested content extracting the desired data. The subject data, in particular, is normalized. The data is then inserted into WebService::Solr::Field objects which in turn are inserted into WebService::Solr::Document objects and added to the underlying Lucene index.

Searching the index

Searching the index is less trivial than the example in Part I because of the facets, below:

#!/usr/bin/perl

# search-doaj.pl - query a solr/lucene index of DOAJ content

# Eric Lease Morgan <eric_morgan@infomotions.com>
# January 12, 2009 - version 1.0


# define
use constant SOLR => 'http://localhost:210/solr';
use constant ROWS => 100;
use constant MIN  => 5;

# require
use strict;
use WebService::Solr;

# initalize
my $solr = WebService::Solr->new( SOLR );

# sanity check
my $query = $ARGV[ 0 ];
if ( ! $query ) {

  print "Usage: $0 <query>\n";
  exit;

}

# search; get no more than ROWS records and subject facets occuring MIN times
my $response  = $solr->search( $query, { 'rows'           => ROWS,
                                         'facet'          => 'true', 
                                         'facet.field'    => 'facet_subject', 
                                         'facet.mincount' => MIN });

# get the number of hits, and start display
my $hit_count = $response->pager->total_entries;
print "Your search ($query) found $hit_count document(s).\n\n";

# extract subject facets, and display
my %subjects = &get_facets( $response->facet_counts->{ facet_fields }->{ facet_subject } );
if ( $hit_count ) {

  print "  Subject facets: ";
  foreach ( sort( keys( %subjects ))) { print "$_ (" . $subjects{ $_ } . "); " }
  print "\n\n";
  
}

# display each hit
my $index = 0;
foreach my $doc ( $response->docs ) {

  # slurp
  my $id        = $doc->value_for( 'id' );
  my $title     = $doc->value_for( 'title' );
  my $publisher = $doc->value_for( 'publisher' );
  my $url       = $doc->value_for( 'url' );
  my @subjects  = $doc->values_for( 'subject' );

  # increment
  $index++;

  #echo
  print "     record: $index\n";
  print "         id: $id\n";
  print "      title: $title\n";
  print "  publisher: $publisher\n";
  foreach ( @subjects ) { print "    subject: $_\n" }
  print "        url: $url\n";
  print "\n";

}

# done 
exit;


sub get_facets {

  # convert array of facet/hit-count pairs into a hash; obtuse
  my $array_ref = shift;
  my %facet;
  my $i = 0;
  foreach ( @$array_ref ) {

    my $k = $array_ref->[ $i ]; $i++;
    my $v = $array_ref->[ $i ]; $i++;
    next if ( ! $v );
    $facet{ $k } = $v;

  }

  return %facet;

}

The script needs a bit of explaining. Like before, a few constants are defined. A Solr object is initialized, and the existence of a query string is verified. The search method makes use of a few options, specifically, options to return ROW number of search results as well as specific facets occurring MIN number of times. The whole thing is stuffed into a WebService::Solr::Response object, which is, for better or for worse, a JSON data structure. Using the pager method against the response object, the number hits are returned which is assigned to a scalar and displayed.

The trickiest part of the script is the extraction of the facets done by the get_facets subroutine. In WebService::Solr, facets names and their values are returned in an array reference. get_facets converts this array reference into a hash, and is then displayed. Finally, each WebService::Solr::Response object is looped through and echoed. Notice how the the subject field is handled. It contains multiple values which are retrieved through the values_for method which returns an array, not a scalar. Below is sample output for the search “library”:

Your search (library) found 84 document(s).

  Subject facets: Computer Science (7); Library and Information
Science (68); Medicine (General) (7); information science (19);
information technology (8); librarianship (16); libraries (6);
library and information science (14); library science (5);

     record: 1
         id: oai:doaj.org:0029-2540
      title: North Carolina Libraries
  publisher: North Carolina Library Association
    subject: libraries
    subject: librarianship
    subject: media centers
    subject: academic libraries
    subject: Library and Information Science
        url: http://www.nclaonline.org/NCL/

     record: 2
         id: oai:doaj.org:1311-8803
      title: Bibliosphere
  publisher: NBU Library
    subject: Bulgarian libraries
    subject: librarianship
    subject: Library and Information Science
        url: http://www.bibliosphere.eu/ 

     record: 3
         id: ...

In a hypertext environment, each of the titles in the returned records would be linked with their associated URLs. Each of the subject facets listed at the beginning of the output would be hyperlinked to subsequent searches combining the original query plus the faceted term, such as “library AND subject:’Computer Science'”. An even more elaborate search interface would allow the user to page through search results and/or modify the value of MIN to increase or decrease the number of relevant facets displayed.

Making lists searchable

Librarians love lists. We create lists of books. Lists of authors of books. Lists of journals. Lists of journal articles. Recently we have become enamored with lists of Internet resources. We pay other people for lists, and we call these people bibliographic index vendors. OCLC’s bread and butter is a list of library holdings. Librarians love lists.

Lists aren’t very useful unless they are: 1) short, 2) easily sortable, or 3) searchable. For the most part, the profession has mastered the short, sortable list, but we are challenged when it comes to searching our lists. We insist on using database applications for this, even when we don’t know how to design a (relational) database. Our searching mentality is stuck in the age of mediated online search services such as DIALOG and BRS. The profession has not come to grips with the advances in information retrieval. Keyword searching, as opposed to field searching, has its merits. Tools like Lucene, KinoSearch, Zebra, swish-e, and a host of predecessors like Harvest, WAIS, and Veronica all facilitate(d) indexing/searching.

As well as organizing information — the creation of lists — the profession needs to learn how to create its own indexes and make them searchable. While I do not advocate every librarian know how to exploit things like WebService::Solr, I do advocate the use of these technologies to a much greater degree. Without them the library profession will always be a follower in the field of information technology as opposed to a leader.

Summary

This posting, Part II of III, illustrated how to index and search content from an OAI-PMH data repository. It also advocated the increased use of indexer/search engines by the library profession. In the next and last part of this series WebService::Solr will be used as a part of an Search/Retrieve via URL (SRU) interface.

Acknowledgements

Special thanks go to Brian Cassidy and Kirk Beers who wrote WebService::Solr. Additional thanks go to Ed Summers and Thomas Berger who wrote Net::OAI::Harvester. I am simply standing on the shoulders of giants.

Tags: indexing, OAI-PMH, WebService::Solr

This entry was posted on Monday, January 12th, 2009 at 7:50 pm and is filed under Hacks. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

5 Responses to “Fun with WebService::Solr, Part II of III”

Erik Hatcher says:

January 13, 2009 at 6:51 am

Commit doesn’t come for free. It seems you have autocommit enabled in solrconfig.xml – though it usually more advisable to control this from the indexing client and disable Solr-side autocommit.
Erik Hatcher says:

January 13, 2009 at 6:53 am

Regarding adding facet constraints to a new search when drilling down, rather than ANDing it together like you have as “library AND subject:’Computer Science’”, I recommend using Solr’s fq (filter query) parameter. Using fq, your example would look like this: &q=library&fq=subject:”Computer Science”
Eric Lease Morgan says:

January 13, 2009 at 10:01 am

Thank you for taking a detailed look at the code.

Alas, no and yes regarding autocommit. No, my solrconfig.xml has not turned on autocommit. Yes, despite what the WebService::Solr pod says about the commit method, the underlying WebService::Solr code turns on autocommit by default. For better or for worse, WebService::Solr gives you autocommit for free.

I believe the best course of action here is to update the WebService::Solr pod to reflect what the code actually does.
Eric Lease Morgan says:

January 13, 2009 at 10:01 am

Thanks bunches! Now that I’m getting the indexing process under control, my next step is to learn how to exploit the searching mechanism.
Eric Lease Morgan says:

January 22, 2009 at 9:50 pm

The source code to Parts I, II, and III are available at ./wp-content/uploads/2009/01/fun-with-webservice-solr.tar.gz

Creator: Eric Lease Morgan <eric_morgan@infomotions.com>
Date created: 2008-05-26
Date updated: 2010-05-09
URL: ./

Archives

Categories