Fun with WebService::Solr, Part I of III « Infomotions Mini-Musings

Fun with WebService::Solr, Part I of III

solr logo
This posting (Part I) is an introduction to a Perl module called WebService::Solr. In it you will learn a bit of what Solr is, how it interacts with Lucene (an indexer), and how to write two trivial Perl scripts: 1) an indexer, and 2) a search engine. Part II of this series will introduce less trivial scripts — programs to index and search content from the Directory of Open Access Journals (DOAJ). Part III will demonstrate how to use WebService::Solr to implement an SRU interface against the index of DOAJ content. After reading each Part you should have a good overview of what WebService::Solr can do, but more importantly, you should have a better understanding of the role indexers/search engines play in the world of information retrieval.

Solr, Lucene, and WebService::Solr

I must admit, I’m coming to the Solr party at least one year late, and as you may or may not know, Solr is a Java-based, Web Services interface to the venerable Lucene — the current gold standard when it comes to indexers/search engines. In such an environment, Lucene (also a Java-based system) is used to first create inverted indexes from texts or numbers, and second, provide a means for searching the index. Solr is a Web Services interface to Lucene. Instead of writing applications reading and writing Lucene indexes directly, you can send Solr HTTP requests which are parsed and passed on to Lucene. For example, one could feed Solr sets of metadata describing, say, books, and provide a way to search the metadata to identify items of interest. (“What a novel idea!”) Using such a Web Servcies technique the programmer is free to use the programming/scripting language of their choice. No need to know Java, although Java-based programs would definitely be faster and more efficient.

For better or for worse, my programming language of choice is Perl, and upon perusing CPAN I discovered WebService::Solr — a module making it easy to interface with Solr (and therefore Lucene). After playing with WebService::Solr for a few days I became pretty impressed, thus, this posting.

Installing and configuring Solr

Installing Solr is relatively easy. Download the distribution. Save it in a convenient location on your file system. Unpack/uncompress it. Change directories to the example directory, and fire up Solr by typing java -jar start.jar at the command line. Since the distribution includes Jetty (a pint-sized HTTP server), and as long as you have not made any configuration changes, you should now be able to connect to your locally hosted Solr administrative interface through your favorite Web browser. Try, http://localhost:8983/solr/

When it comes to configuring Solr, the most important files are found in the conf directory, specifically, solrconfig.xml and schema.xml. I haven’t tweaked the former. The later denotes the types and names of fields that will ultimately be in your index. Describing in detail the in’s and out’s of solrconfig.xml and schema.xml are beyond the scope of this posting, but for our purposes here, it is important to note two things. First I modified schema.xml to include the following Dublin Core-like fields:

  <!-- a set of "Dublin Core-lite" fields -->
  <fields>
    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="title" type="text" indexed="true" stored="true" />
   <field name="text" type="text" indexed="true" stored="false" />
  </fields>

  <!-- what field is the key, very important! -->
  <uniqueKey>id</uniqueKey>

  <!-- field to search by default; the power of an index -->
  <defaultSearchField>text</defaultSearchField>

  <!-- how to deal with multiple terms -->
  <solrQueryParser defaultOperator="AND" />

  <!-- copy content into the default field -->
  <copyField source="title" dest="text" />

Second, I edited a Jetty configuration file (jetty.xml) so it listens on port 210 instead of the default port, 8983. “Remember Z39.50?”

There is a whole lot more to configuring Solr than what is outlined above. To really get a handle on the indexing process the Solr documentation is required reading.

Installing WebService::Solr

Written by Brian Cassidy and Kirk Beers, WebService::Solr is a set Perl modules used to interface with Solr. Create various WebService::Solr objects (such as fields, documents, requests, and responses), and apply methods against them to create, modify, find, add, delete, query, and optimize aspects of your underlying Lucene index.

Since WebService::Solr requires a large number of supporting modules, installing WebService::Solr is best done with using CPAN. From the CPAN command line, enter install WebService::Solr. It worked perfectly for me.

Indexing content

My first WebService::Solr script, an indexer, is a trivial example, below:

 #!/usr/bin/perl
 
 # trivial-index.pl - index a couple of documents
 
 # define
 use constant SOLR => 'http://localhost:210/solr';
 use constant DATA => ( 'Hello, World!', 'It is nice to meet you.' );
 
 # require
 use strict;
 use WebService::Solr;
 
 # initialize
 my $solr = WebService::Solr->new( SOLR );
 
 # process each data item
 my $index = 0;
 foreach ( DATA ) {
 
   # increment
   $index++;
     
   # populate solr fields
   my $id  = WebService::Solr::Field->new( id  => $index );
   my $title = WebService::Solr::Field->new( title => $_ );
 
   # fill a document with the fields
   my $doc = WebService::Solr::Document->new;
   $doc->add_fields(( $id, $title ));
 
   # save
   $solr->add( $doc );
   $solr->commit;
 
 }
 
 # done
 exit;

To elaborate, the script first defines the (HTTP) location of our Solr instance as well as array of data containing two elements. It then includes/requires the necessary Perl modules. One to keep our programming technique honest, and the other is our reason de existence. Third, a WebService::Solr object is created. Fourth, a pointer is initialized, and a loop instantiated reading each data element. Inside the loop the pointer is incremented and local WebService::Solr::Field objects are created using the values of the pointer and the current data element. The next step is to instantiate a WebService::Solr:Document object and fill it up with the Field objects. Finally, the Document is added to the index, and the update is committed.

If everything went according to plan, the Lucene index should now contain two documents. The first with an id equal to 1 and a title equal to “Hello, World!”. The second with an id equal to 2 and a title equal to “It is nice to meet you.” To verify this you should be able to use the following script to search your index:

  #!/usr/bin/perl
  
  # trivial-search.pl - query a lucene index through solr
  
  # define
  use constant SOLR => 'http://localhost:210/solr';
  
  # require
  use strict;
  use WebService::Solr;
  
  # initialize
  my $solr = WebService::Solr->new( SOLR );
  
  # sanity check
  my $query = $ARGV[ 0 ];
  if ( ! $query ) {
  
    print "Usage: $0 <query>\n";
    exit;
    
  }
  
  # search & get hits
  my $response = $solr->search( $query );
  my @hits = $response->docs;
  
  # display
  print "Your search ($query) found " . ( $#hits + 1 ) . " document(s).\n\n";
  foreach my $doc ( @hits ) {
  
    # slurp
    my $id    = $doc->value_for( 'id' );
    my $title = $doc->value_for( 'title' );
    
    # echo
    print "     id: $id\n";
    print "  title: $title\n";
    print "\n";
      
  }

Try queries such as hello, “hello OR meet”, or “title: world” will return results. Because the field named text includes the content of the title field, as per our definition, queries without field specifications default to the text field. Nice. The power of an index.

Here is how the script works. It first denotes the location of Solr. It then includes/requires the necessary modules. Next, it creates a WebService::Solr object. Fourth, it makes sure there is a query on the command line. Fifth, it queries Solr creating a WebService::Solr::Response object, and this object is queried for an array of hits. Finally, the hits are looped through, creating and displaying the contents of each WebService::Solr::Document object (hit) found.

Summary

This posting provided an overview of Lucene, Solr, and a set of Perl modules called WebService::Solr. It also introduced the use of the modules to index content and search it. Part II will provide a more in-depth introduction to the use of WebService::Solr and Solr in general.

Tags: indexing, Lucene, Solr, WebService::Solr

This entry was posted on Monday, January 5th, 2009 at 7:23 pm and is filed under Hacks. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

4 Responses to “Fun with WebService::Solr, Part I of III”

Eric Lease Morgan says:

January 10, 2009 at 2:48 pm

The following is a comment from Brian Cassidy, a co-author of WebService::Solr. “Re-printed” with permission:
I just noticed your 1st of a series of articles on WebService::Solr. Looks great so far! :)

Here are some comments about the code:

In the first script, you could probably use the much simpler syntax for add() rather than then full-on OOP way with Document and Field objects:
```
my $id  = WebService::Solr::Field->new( id  => $index );
my $title = WebService::Solr::Field->new( title => $_ );
my $doc = WebService::Solr::Document->new;
$doc->add_fields(( $id, $title ));
$solr->add( $doc );
$solr->commit;
```
could be as simple as:

$solr->add( { id => $index, title => $_ } );

The full object notation is available when you need it, but for simple docs, I wouldn’t bother with it.

Also, note that I’ve left off the call to commit(). WebService::Solr has an “autocommit” bit that is on by default, so, after operations that require a commit, it’s already done for you.

I should probably also mention that it would be more efficient to send multiple documents over the line rather than one at a time — though this case wouldn’t show much difference.

In the searching script you write:

print "Your search ($query) found " . ( $#hits + 1 ) . " document(s).\n\n";

While this is true for this particular index, in reality

my @hits = $response->docs;

only returns the documents attached to this particular response — which could be, for example, 10 out of 100 actual results.

The true number of matches can be found with the pager object:

$response->pager->total_entries

Hopefully I’m not giving you a bunch of details that you’re already aware of — If i am, I apologize.

Anyway, I can’t wait for the rest of the series! :)
“Thanks, Brian!
Eric Lease Morgan says:

January 22, 2009 at 9:49 pm

The source code to Parts I, II, and III are available at ./wp-content/uploads/2009/01/fun-with-webservice-solr.tar.gz
Infomotions Mini-Musings » Blog Archive » Indexing and searching the Alex Catalogue / Eric Lease Morgan says:

August 17, 2009 at 9:24 pm

[…] well, and interfacing with it was made much simpler through the use of a set of Perl modules called WebService::Solr. On the other hand, there are many ways the index could be improved such as implementing […]
dave says:

April 7, 2010 at 2:14 am

Hi there,

I can’t figure out this error. Can you help? I didn’t get any error on install of WebService::Solr.

Thanks

./trivial-search.pl Hello
Apr 6, 2010 11:09:40 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select params={wt=json&q=Hello} hits=2 status=0 QTime=1
Could not parse JSON response: malformed JSON string, neither array, object, number, string or atom, at character offset 0 (before “\n\n<met…") at /usr/lib/perl5/site_perl/5.8.8/WebService/Solr/Response.pm line 42.

Error 400

HTTP ERROR: 400undefined field name
RequestURI=/solr/selectPowered by Jetty://

Creator: Eric Lease Morgan <eric_morgan@infomotions.com>
Date created: 2008-05-26
Date updated: 2010-05-09
URL: ./

Archives

Categories