WebService::Solr « Infomotions Mini-Musings

Posts Tagged ‘WebService::Solr’

Fun with WebService::Solr, Part III of III

Thursday, January 22nd, 2009

This is the last of a three-part series providing an overview of a set of Perl modules called WebService::Solr. In Part I, WebService::Solr was introduced with two trivial scripts. Part II put forth two command line driven scripts to index and search content harvested via OAI. Part III illustrates how to implement an Search/Retrieve via URL (SRU) search interface against an index created by WebService::Solr.

Search/Retrieve via URL

Search/Retrieve via URL (SRU) is a REST-like Web Service-based protocol designed to query remote indexes. The protocol essentially consists of three functions or “operations”. The first, explain, provides a mechanism to auto-discover the type of content and capabilities of an SRU server. The second, scan, provide a mechanism to browse an index’s content much like perusing the back-of-a-book index. The third, searchRetrieve, provides the means for sending a query to the index and getting back a response. Many of the librarians in the crowd will recognize SRU as the venerable Z39.50 protocol redesigned for the Web.

During the past year, time has been spent joining the SRU community with the OpenSearch community to form a single, more unified set of Search Web Service protocols. OpenSearch has very similar goals to SRU — to provide standardized interfaces for searching indexes — but the techniques between it an SRU are different. Where OpenSearch’s query language is simple, SRU’s is expressive. Where OpenSearch returns an RSS-like data stream, SRU includes the ability to return just about any XML format. OpenSearch may be easier to implement, but SRU is suited for a wider number of applications. To bring SRU and OpenSearch together, and to celebrate similarities as opposed to differences, an OASIS Abstract Protocol Definition has been drafted defining how the searching of Web-based databases and indexes can be done in a standardized way.

SRU is an increasingly important protocol for the library community because of a growing number of the WorldCat Grid Services are implemented using SRU. The Grid supports indexes such lists of library holdings (WorldCat), name and subject authority files (Identities), as well as names of libraries (the Registry). By sending SRU queries to these services and mashing up the results with the output of other APIs, all sorts of library and bibliographic applications can be created.

Integrating WebService::Solr into SRU

Personally, I have been creating SRU interfaces to many of my indexes for about four years. I have created these interfaces against mailing list archives, OAI-harvested content, and MARC records. The underlying content has been indexed with swish-e, Plucene, KinoSearch, and now Lucene through WebService::Solr.

Ironic or not, I use yet another set of Perl modules — available on CPAN and called SRU — written by Brian Cassidy to implement my SRU servers. The form of my implementations is rather simple. Get the input. Determine what operation is requested. Branch accordingly. Do the necessary processing. Return a response.

The heart of my SRU implementation is a subroutine called search. It is within this subroutine where indexer-specific hacking takes place. For example and considering WebService::Solr:

sub search {

  # initialize
  my $query   = shift;
  my $request = shift;
  my @results;
  
  # set up Solr
  my $solr = WebService::Solr->new( SOLR );
    
  # calculate start record and number of records
  my $start_record = 0;
  if ( $request->startRecord ) { $start_record = $request->startRecord - 1 }
  my $maximum_records = MAX; $maximum_records = $request->maximumRecords 
     unless ( ! $request->maximumRecords );

  # search
  my $response   = $solr->search( $query, {
                                  'start' => $start_record,
                                  'rows'  => $maximum_records });
  my @hits       = $response->docs;
  my $total_hits = $response->pager->total_entries;
  
  # display the number of hits
  if ( $total_hits ) {
  
    foreach my $doc ( @hits ) {
                
      # slurp
      my $id          = $doc->value_for(  'id' );
      my $name        = &escape_entities( $doc->value_for(  'title' ));
      my $publisher   = &escape_entities( $doc->value_for(  'publisher' ));
      my $description = &escape_entities( $doc->value_for(  'description' ));
      my @creator     = $doc->values_for( 'creator' );
      my $contributor = &escape_entities( $doc->value_for(  'contributor' ));
      my $url         = &escape_entities( $doc->value_for(  'url' ));
      my @subjects    = $doc->values_for( 'subject' );
      my $source      = &escape_entities( $doc->value_for(  'source' ));
      my $format      = &escape_entities( $doc->value_for(  'format' ));
      my $type        = &escape_entities( $doc->value_for(  'type' ));
      my $relation    = &escape_entities( $doc->value_for(  'relation' ));
      my $repository  = &escape_entities( $doc->value_for(  'repository' ));

      # full results, but included entities; hmmm...
      my $record  = '<srw_dc:dc xmlns="http://www.w3.org/TR/xhtml1/strict"
                      xmlns:dc="http://purl.org/dc/elements/1.1/"
                      xmlns:srw_dc="info:srw/schema/1/dc-v1.1">';
      $record .= '<dc:title>' .  $name . '</dc:title>';
      $record .= '<dc:publisher>' .  $publisher . '</dc:publisher>';
      $record .= '<dc:identifier>' .  $url . '</dc:identifier>';
      $record .= '<dc:description>' .  $description . '</dc:description>';
      $record .= '<dc:source>' . $source . '</dc:source>';
      $record .= '<dc:format>' .  $format . '</dc:format>';
      $record .= '<dc:type>' .  $type . '</dc:type>';
      $record .= '<dc:contributor>' .   $contributor . '</dc:contributor>';
      $record .= '<dc:relation>' .   $relation . '</dc:relation>';
      foreach ( @creator ) { $record .= '<dc:creator>' .  $_ . '</dc:creator>' }
      foreach ( @subjects ) { $record .= '<dc:subject>' . $_ . '</dc:subject>' }
      $record .= '</srw_dc:dc>';
      push @results, $record;
            
    }
    
  }
  
  # done; return it
  return ( $total_hits, @results );
  
}

The subroutine is not unlike the search script outlined in Part II of this series. First the query, SRU::Request object, results, and local Solr objects are locally initialized. A pointer to the first desired hit as well as the maximum number of records to return are calculated. The search is done, and the total number of search results is saved for future reference. If the search was a success, then each of the hits are looped through while stuffing them into an XML element named record and scoped with a Dublin Core name space. Finally, the total number of records as well as the records themselves are returned to the main module where they are added to an SRU::Response object and returned to the SRU client.

This particular implementation is pretty rudimentary, and it does not really exploit the underlying functionality of Solr/Lucene. For example, it does not support facets, spell check, suggestions, etc. On the other hand, it does support paging, and since it is implemented under mod_perl it is just about as fast as it can get on my hardware.

Give the implementation a whirl. The underlying index includes about 20,000 records of various electronic books (from the Alex Catalogue of Electronic Texts, Project Gutenberg, and the HathiTrust), photographs (from my own adventures), journal titles, and journal articles (both from the Directory of Open Access Journals).

Summary

It is difficult for me to overstate the number of possibilities for librarianship considering the current information environment. Data and information abound! Learning has not stopped. It is sexy to be in the information business. All of the core principles of librarianship are at play in this environment. Collection. Preservation. Organization. Dissemination. The application of relational databases combined with indexers provide the means to put into practice these core principles in today’s world.

The Solr/Lucene combination is an excellent example, and WebService::Solr is just one way to get there. Again, I don’t expect every librarian to know and understand all of things outlined in this series of essays. On the other hand, I do think it is necessary for the library community as a whole to understand this technology in the same way they understand bibliography, conservation, cataloging, and reference. Library schools need to teach it, and librarians need to explore it.

Source code

Finally, plain text versions of this series’ postings, the necessary Solr schema.xml files, as well as all the source code is available for downloading. Spend about an hour putzing around. I’m sure you will come out the other end learning something.

Tags: indexing, Search/Retrieve via URL (SRU), WebService::Solr
Posted in Hacks | 7 Comments »

Fun with WebService::Solr, Part II of III

Monday, January 12th, 2009

In this posting (Part II), I will demonstrate how to use WebService::Solr to create and search a more substantial index, specifically an index of metadata describing the content of the Directory of Open Access Journals. Part I of these series introduced Lucene, Solr, and WebService::Solr with two trivial examples. Part III will describe how to create an SRU interface using WebService::Solr.

Directory of Open Access Journals

solr logo The Directory of Open Access Journals (DOAJ) is a list of freely available scholarly journals. As of this writing the Directory contains approximately 3,900 titles organized into eighteen broad categories such as Arts and Architecture, Law and Political Science, and General Science. Based on my tertiary examination, a large percentage of the titles are in the area of medicine.

Not only is it great that such a directory exists, but it is even greater that the Directory’s metadata — the data describing the titles in the Directory — is available for harvesting via OAI-PMH. While the metadata is rather sparse, it is more than adequate for creating rudimentary MARC records for importing into library catalogs, or better yet, incorporating into some other Web service. (No puns intended.)

In my opinion, the Directory is a especially underutilized. For example, not only are the Directory’s journal titles available for download, but so is the metadata of about 25,000 journal articles. Given these two things (metadata describing titles as well as articles) it would be entirely possible to seed a locally maintained index of scholarly journal content and incorporate that into library “holdings”. But alas, that is another posting and another story.

Indexing the DOAJ

It is almost trivial to create a search engine against DOAJ content when you know how to implement an OAI-PMH harvester and indexer. First, you need to know the OAI-PMH root URL for the Directory, and it happens to be http://www.doaj.org/oai Second, you need to peruse the OAI-PMH output sent by the Directory and map it to fields you will be indexing. In the case of this demonstration, the fields are id, title, publisher, subject, and URL. Consequently, I updated the schema from the first demonstration to look like this:

<!-- DC-like fields -->
<fields>
  <field name="id" type="string" indexed="true" stored="true" required="true" />
  <field name="title" type="text" indexed="true" stored="true" />
  <field name="publisher" type="text" indexed="true" stored="true" />
  <field name="subject" type="text" indexed="true" stored="true" multiValued="true" />
  <field name="url" type="text" indexed="false" stored="true" />
  <field name="text" type="text" indexed="true" stored="false" multiValued="true" />
  <field name="facet_subject" type="string" indexed="true" stored="true" multiValued="true" />
</fields>

<!-- key; for updating purposes -->
<uniqueKey>id</uniqueKey>

<!-- for non-field searches -->
<defaultSearchField>text</defaultSearchField>

<!-- AND is more precise -->
<solrQueryParser defaultOperator="AND" />

<!-- what gets searched by default -->
<copyField source="title" dest="text" />
<copyField source="subject" dest="text" />
<copyField source="publisher" dest="text" />

The astute reader will notice the addition of a field named facet_subject. This field, denoted as a string and therefore not parsed by the indexer, is destined to be a browsable facet in the search engine. By including this sort of field in the index it is be possible to return results like, “Your search identified 100 items, and 25 of them are associated with the subject Philosophy.” A very nice feature. Think of it as the explicit exploitation of controlled vocabulary terms for search results. Facets turn the use of controlled vocabularies inside out. The library community has something to learn here.

Once the schema was updated, I wrote the following script to index the journal title content from the Directory:

#!/usr/bin/perl

# index-doaj.pl - get doaj content and index it

# Eric Lease Morgan <eric_morgan@infomotions.com>
# January  12, 2009 - version 1.0


# define
use constant OAIURL => 'http://www.doaj.org/oai';
use constant PREFIX => 'oai_dc';
use constant SOLR   => 'http://localhost:210/solr';

# require
use Net::OAI::Harvester;
use strict;
use WebService::Solr;

# initialize oai and solr
my $harvester = Net::OAI::Harvester->new( baseURL => OAIURL );
my $solr      = WebService::Solr->new( SOLR );

# get all records and loop through them
my $records = $harvester->listAllRecords( metadataPrefix => PREFIX );
my $id      = 0;
while ( my $record = $records->next ) {

  # increment
  $id++;
  last if ( $id > 100 );  # comment this out to get everything

  # extract the desired metadata
  my $metadata     = $record->metadata;
  my $identifier   = $record->header->identifier;
  my $title        = $metadata->title      ? &strip( $metadata->title )     : '';
  my $url          = $metadata->identifier ? $metadata->identifier          : '';
  my $publisher    = $metadata->publisher  ? &strip( $metadata->publisher ) : '';
  my @all_subjects = $metadata->subject    ? $metadata->subject             : ();

  # normalize subjects
  my @subjects = ();
  foreach ( @all_subjects ) {

    s/DoajSubjectTerm: //;  # remove DOAJ label
    next if ( /LCC: / );    # don't want call numbers
    push @subjects, $_;

  }

  # echo
  print "      record: $id\n";
  print "  identifier: $identifier\n";
  print "       title: $title\n";
  print "   publisher: $publisher\n";
  foreach ( @subjects ) { print "     subject: $_\n" }
  print "         url: $url\n";
  print "\n";

  # create solr/lucene document
  my $solr_id        = WebService::Solr::Field->new( id        => $identifier );
  my $solr_title     = WebService::Solr::Field->new( title     => $title );
  my $solr_publisher = WebService::Solr::Field->new( publisher => $publisher );
  my $solr_url       = WebService::Solr::Field->new( url       => $url );

  # fill up a document
  my $doc = WebService::Solr::Document->new;
  $doc->add_fields(( $solr_id, $solr_title, $solr_publisher, $solr_url ));
  foreach ( @subjects ) {

    $doc->add_fields(( WebService::Solr::Field->new( subject => &strip( $_ ))));
    $doc->add_fields(( WebService::Solr::Field->new( facet_subject => &strip( $_ ))));

  }

  # save; no need for commit because it comes for free
  $solr->add( $doc );

}

# done
exit;


sub strip {

  # strip non-ascii characters; bogus since the OAI output is suppose to be UTF-8
  # see: http://www.perlmonks.org/?node_id=613773
  my $s =  shift;
  $s    =~ s/[^[:ascii:]]+//g;
  return $s;

}

The script is very much like the trivial example from Part I. It first defines a few constants. It then initializes both an OAI-PMH harvester as well as a Solr object. It then loops through each record of the harvested content extracting the desired data. The subject data, in particular, is normalized. The data is then inserted into WebService::Solr::Field objects which in turn are inserted into WebService::Solr::Document objects and added to the underlying Lucene index.

Searching the index

Searching the index is less trivial than the example in Part I because of the facets, below:

#!/usr/bin/perl

# search-doaj.pl - query a solr/lucene index of DOAJ content

# Eric Lease Morgan <eric_morgan@infomotions.com>
# January 12, 2009 - version 1.0


# define
use constant SOLR => 'http://localhost:210/solr';
use constant ROWS => 100;
use constant MIN  => 5;

# require
use strict;
use WebService::Solr;

# initalize
my $solr = WebService::Solr->new( SOLR );

# sanity check
my $query = $ARGV[ 0 ];
if ( ! $query ) {

  print "Usage: $0 <query>\n";
  exit;

}

# search; get no more than ROWS records and subject facets occuring MIN times
my $response  = $solr->search( $query, { 'rows'           => ROWS,
                                         'facet'          => 'true', 
                                         'facet.field'    => 'facet_subject', 
                                         'facet.mincount' => MIN });

# get the number of hits, and start display
my $hit_count = $response->pager->total_entries;
print "Your search ($query) found $hit_count document(s).\n\n";

# extract subject facets, and display
my %subjects = &get_facets( $response->facet_counts->{ facet_fields }->{ facet_subject } );
if ( $hit_count ) {

  print "  Subject facets: ";
  foreach ( sort( keys( %subjects ))) { print "$_ (" . $subjects{ $_ } . "); " }
  print "\n\n";
  
}

# display each hit
my $index = 0;
foreach my $doc ( $response->docs ) {

  # slurp
  my $id        = $doc->value_for( 'id' );
  my $title     = $doc->value_for( 'title' );
  my $publisher = $doc->value_for( 'publisher' );
  my $url       = $doc->value_for( 'url' );
  my @subjects  = $doc->values_for( 'subject' );

  # increment
  $index++;

  #echo
  print "     record: $index\n";
  print "         id: $id\n";
  print "      title: $title\n";
  print "  publisher: $publisher\n";
  foreach ( @subjects ) { print "    subject: $_\n" }
  print "        url: $url\n";
  print "\n";

}

# done 
exit;


sub get_facets {

  # convert array of facet/hit-count pairs into a hash; obtuse
  my $array_ref = shift;
  my %facet;
  my $i = 0;
  foreach ( @$array_ref ) {

    my $k = $array_ref->[ $i ]; $i++;
    my $v = $array_ref->[ $i ]; $i++;
    next if ( ! $v );
    $facet{ $k } = $v;

  }

  return %facet;

}

The script needs a bit of explaining. Like before, a few constants are defined. A Solr object is initialized, and the existence of a query string is verified. The search method makes use of a few options, specifically, options to return ROW number of search results as well as specific facets occurring MIN number of times. The whole thing is stuffed into a WebService::Solr::Response object, which is, for better or for worse, a JSON data structure. Using the pager method against the response object, the number hits are returned which is assigned to a scalar and displayed.

The trickiest part of the script is the extraction of the facets done by the get_facets subroutine. In WebService::Solr, facets names and their values are returned in an array reference. get_facets converts this array reference into a hash, and is then displayed. Finally, each WebService::Solr::Response object is looped through and echoed. Notice how the the subject field is handled. It contains multiple values which are retrieved through the values_for method which returns an array, not a scalar. Below is sample output for the search “library”:

Your search (library) found 84 document(s).

  Subject facets: Computer Science (7); Library and Information
Science (68); Medicine (General) (7); information science (19);
information technology (8); librarianship (16); libraries (6);
library and information science (14); library science (5);

     record: 1
         id: oai:doaj.org:0029-2540
      title: North Carolina Libraries
  publisher: North Carolina Library Association
    subject: libraries
    subject: librarianship
    subject: media centers
    subject: academic libraries
    subject: Library and Information Science
        url: http://www.nclaonline.org/NCL/

     record: 2
         id: oai:doaj.org:1311-8803
      title: Bibliosphere
  publisher: NBU Library
    subject: Bulgarian libraries
    subject: librarianship
    subject: Library and Information Science
        url: http://www.bibliosphere.eu/ 

     record: 3
         id: ...

In a hypertext environment, each of the titles in the returned records would be linked with their associated URLs. Each of the subject facets listed at the beginning of the output would be hyperlinked to subsequent searches combining the original query plus the faceted term, such as “library AND subject:’Computer Science'”. An even more elaborate search interface would allow the user to page through search results and/or modify the value of MIN to increase or decrease the number of relevant facets displayed.

Making lists searchable

Librarians love lists. We create lists of books. Lists of authors of books. Lists of journals. Lists of journal articles. Recently we have become enamored with lists of Internet resources. We pay other people for lists, and we call these people bibliographic index vendors. OCLC’s bread and butter is a list of library holdings. Librarians love lists.

Lists aren’t very useful unless they are: 1) short, 2) easily sortable, or 3) searchable. For the most part, the profession has mastered the short, sortable list, but we are challenged when it comes to searching our lists. We insist on using database applications for this, even when we don’t know how to design a (relational) database. Our searching mentality is stuck in the age of mediated online search services such as DIALOG and BRS. The profession has not come to grips with the advances in information retrieval. Keyword searching, as opposed to field searching, has its merits. Tools like Lucene, KinoSearch, Zebra, swish-e, and a host of predecessors like Harvest, WAIS, and Veronica all facilitate(d) indexing/searching.

As well as organizing information — the creation of lists — the profession needs to learn how to create its own indexes and make them searchable. While I do not advocate every librarian know how to exploit things like WebService::Solr, I do advocate the use of these technologies to a much greater degree. Without them the library profession will always be a follower in the field of information technology as opposed to a leader.

Summary

This posting, Part II of III, illustrated how to index and search content from an OAI-PMH data repository. It also advocated the increased use of indexer/search engines by the library profession. In the next and last part of this series WebService::Solr will be used as a part of an Search/Retrieve via URL (SRU) interface.

Acknowledgements

Special thanks go to Brian Cassidy and Kirk Beers who wrote WebService::Solr. Additional thanks go to Ed Summers and Thomas Berger who wrote Net::OAI::Harvester. I am simply standing on the shoulders of giants.

Tags: indexing, OAI-PMH, WebService::Solr
Posted in Hacks | 5 Comments »

Fun with WebService::Solr, Part I of III

Monday, January 5th, 2009

solr logo
This posting (Part I) is an introduction to a Perl module called WebService::Solr. In it you will learn a bit of what Solr is, how it interacts with Lucene (an indexer), and how to write two trivial Perl scripts: 1) an indexer, and 2) a search engine. Part II of this series will introduce less trivial scripts — programs to index and search content from the Directory of Open Access Journals (DOAJ). Part III will demonstrate how to use WebService::Solr to implement an SRU interface against the index of DOAJ content. After reading each Part you should have a good overview of what WebService::Solr can do, but more importantly, you should have a better understanding of the role indexers/search engines play in the world of information retrieval.

Solr, Lucene, and WebService::Solr

I must admit, I’m coming to the Solr party at least one year late, and as you may or may not know, Solr is a Java-based, Web Services interface to the venerable Lucene — the current gold standard when it comes to indexers/search engines. In such an environment, Lucene (also a Java-based system) is used to first create inverted indexes from texts or numbers, and second, provide a means for searching the index. Solr is a Web Services interface to Lucene. Instead of writing applications reading and writing Lucene indexes directly, you can send Solr HTTP requests which are parsed and passed on to Lucene. For example, one could feed Solr sets of metadata describing, say, books, and provide a way to search the metadata to identify items of interest. (“What a novel idea!”) Using such a Web Servcies technique the programmer is free to use the programming/scripting language of their choice. No need to know Java, although Java-based programs would definitely be faster and more efficient.

For better or for worse, my programming language of choice is Perl, and upon perusing CPAN I discovered WebService::Solr — a module making it easy to interface with Solr (and therefore Lucene). After playing with WebService::Solr for a few days I became pretty impressed, thus, this posting.

Installing and configuring Solr

Installing Solr is relatively easy. Download the distribution. Save it in a convenient location on your file system. Unpack/uncompress it. Change directories to the example directory, and fire up Solr by typing java -jar start.jar at the command line. Since the distribution includes Jetty (a pint-sized HTTP server), and as long as you have not made any configuration changes, you should now be able to connect to your locally hosted Solr administrative interface through your favorite Web browser. Try, http://localhost:8983/solr/

When it comes to configuring Solr, the most important files are found in the conf directory, specifically, solrconfig.xml and schema.xml. I haven’t tweaked the former. The later denotes the types and names of fields that will ultimately be in your index. Describing in detail the in’s and out’s of solrconfig.xml and schema.xml are beyond the scope of this posting, but for our purposes here, it is important to note two things. First I modified schema.xml to include the following Dublin Core-like fields:

  <!-- a set of "Dublin Core-lite" fields -->
  <fields>
    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="title" type="text" indexed="true" stored="true" />
   <field name="text" type="text" indexed="true" stored="false" />
  </fields>

  <!-- what field is the key, very important! -->
  <uniqueKey>id</uniqueKey>

  <!-- field to search by default; the power of an index -->
  <defaultSearchField>text</defaultSearchField>

  <!-- how to deal with multiple terms -->
  <solrQueryParser defaultOperator="AND" />

  <!-- copy content into the default field -->
  <copyField source="title" dest="text" />

Second, I edited a Jetty configuration file (jetty.xml) so it listens on port 210 instead of the default port, 8983. “Remember Z39.50?”

There is a whole lot more to configuring Solr than what is outlined above. To really get a handle on the indexing process the Solr documentation is required reading.

Installing WebService::Solr

Written by Brian Cassidy and Kirk Beers, WebService::Solr is a set Perl modules used to interface with Solr. Create various WebService::Solr objects (such as fields, documents, requests, and responses), and apply methods against them to create, modify, find, add, delete, query, and optimize aspects of your underlying Lucene index.

Since WebService::Solr requires a large number of supporting modules, installing WebService::Solr is best done with using CPAN. From the CPAN command line, enter install WebService::Solr. It worked perfectly for me.

Indexing content

My first WebService::Solr script, an indexer, is a trivial example, below:

 #!/usr/bin/perl
 
 # trivial-index.pl - index a couple of documents
 
 # define
 use constant SOLR => 'http://localhost:210/solr';
 use constant DATA => ( 'Hello, World!', 'It is nice to meet you.' );
 
 # require
 use strict;
 use WebService::Solr;
 
 # initialize
 my $solr = WebService::Solr->new( SOLR );
 
 # process each data item
 my $index = 0;
 foreach ( DATA ) {
 
   # increment
   $index++;
     
   # populate solr fields
   my $id  = WebService::Solr::Field->new( id  => $index );
   my $title = WebService::Solr::Field->new( title => $_ );
 
   # fill a document with the fields
   my $doc = WebService::Solr::Document->new;
   $doc->add_fields(( $id, $title ));
 
   # save
   $solr->add( $doc );
   $solr->commit;
 
 }
 
 # done
 exit;

To elaborate, the script first defines the (HTTP) location of our Solr instance as well as array of data containing two elements. It then includes/requires the necessary Perl modules. One to keep our programming technique honest, and the other is our reason de existence. Third, a WebService::Solr object is created. Fourth, a pointer is initialized, and a loop instantiated reading each data element. Inside the loop the pointer is incremented and local WebService::Solr::Field objects are created using the values of the pointer and the current data element. The next step is to instantiate a WebService::Solr:Document object and fill it up with the Field objects. Finally, the Document is added to the index, and the update is committed.

If everything went according to plan, the Lucene index should now contain two documents. The first with an id equal to 1 and a title equal to “Hello, World!”. The second with an id equal to 2 and a title equal to “It is nice to meet you.” To verify this you should be able to use the following script to search your index:

  #!/usr/bin/perl
  
  # trivial-search.pl - query a lucene index through solr
  
  # define
  use constant SOLR => 'http://localhost:210/solr';
  
  # require
  use strict;
  use WebService::Solr;
  
  # initialize
  my $solr = WebService::Solr->new( SOLR );
  
  # sanity check
  my $query = $ARGV[ 0 ];
  if ( ! $query ) {
  
    print "Usage: $0 <query>\n";
    exit;
    
  }
  
  # search & get hits
  my $response = $solr->search( $query );
  my @hits = $response->docs;
  
  # display
  print "Your search ($query) found " . ( $#hits + 1 ) . " document(s).\n\n";
  foreach my $doc ( @hits ) {
  
    # slurp
    my $id    = $doc->value_for( 'id' );
    my $title = $doc->value_for( 'title' );
    
    # echo
    print "     id: $id\n";
    print "  title: $title\n";
    print "\n";
      
  }

Try queries such as hello, “hello OR meet”, or “title: world” will return results. Because the field named text includes the content of the title field, as per our definition, queries without field specifications default to the text field. Nice. The power of an index.

Here is how the script works. It first denotes the location of Solr. It then includes/requires the necessary modules. Next, it creates a WebService::Solr object. Fourth, it makes sure there is a query on the command line. Fifth, it queries Solr creating a WebService::Solr::Response object, and this object is queried for an array of hits. Finally, the hits are looped through, creating and displaying the contents of each WebService::Solr::Document object (hit) found.

Summary

This posting provided an overview of Lucene, Solr, and a set of Perl modules called WebService::Solr. It also introduced the use of the modules to index content and search it. Part II will provide a more in-depth introduction to the use of WebService::Solr and Solr in general.

Tags: indexing, Lucene, Solr, WebService::Solr
Posted in Hacks | 4 Comments »

Creator: Eric Lease Morgan <eric_morgan@infomotions.com>
Date created: 2008-05-26
Date updated: 2010-05-09
URL: ./

Archives

Categories

Posts Tagged ‘WebService::Solr’

Fun with WebService::Solr, Part III of III

Search/Retrieve via URL

Integrating WebService::Solr into SRU

Summary

Source code

Fun with WebService::Solr, Part II of III

Directory of Open Access Journals

Indexing the DOAJ

Searching the index

Making lists searchable

Summary

Acknowledgements

Fun with WebService::Solr, Part I of III

Solr, Lucene, and WebService::Solr

Installing and configuring Solr

Installing WebService::Solr

Indexing content

Summary