Lucene « Infomotions Mini-Musings

Posts Tagged ‘Lucene’

Fun with WebService::Solr, Part I of III

Monday, January 5th, 2009

solr logo
This posting (Part I) is an introduction to a Perl module called WebService::Solr. In it you will learn a bit of what Solr is, how it interacts with Lucene (an indexer), and how to write two trivial Perl scripts: 1) an indexer, and 2) a search engine. Part II of this series will introduce less trivial scripts — programs to index and search content from the Directory of Open Access Journals (DOAJ). Part III will demonstrate how to use WebService::Solr to implement an SRU interface against the index of DOAJ content. After reading each Part you should have a good overview of what WebService::Solr can do, but more importantly, you should have a better understanding of the role indexers/search engines play in the world of information retrieval.

Solr, Lucene, and WebService::Solr

I must admit, I’m coming to the Solr party at least one year late, and as you may or may not know, Solr is a Java-based, Web Services interface to the venerable Lucene — the current gold standard when it comes to indexers/search engines. In such an environment, Lucene (also a Java-based system) is used to first create inverted indexes from texts or numbers, and second, provide a means for searching the index. Solr is a Web Services interface to Lucene. Instead of writing applications reading and writing Lucene indexes directly, you can send Solr HTTP requests which are parsed and passed on to Lucene. For example, one could feed Solr sets of metadata describing, say, books, and provide a way to search the metadata to identify items of interest. (“What a novel idea!”) Using such a Web Servcies technique the programmer is free to use the programming/scripting language of their choice. No need to know Java, although Java-based programs would definitely be faster and more efficient.

For better or for worse, my programming language of choice is Perl, and upon perusing CPAN I discovered WebService::Solr — a module making it easy to interface with Solr (and therefore Lucene). After playing with WebService::Solr for a few days I became pretty impressed, thus, this posting.

Installing and configuring Solr

Installing Solr is relatively easy. Download the distribution. Save it in a convenient location on your file system. Unpack/uncompress it. Change directories to the example directory, and fire up Solr by typing java -jar start.jar at the command line. Since the distribution includes Jetty (a pint-sized HTTP server), and as long as you have not made any configuration changes, you should now be able to connect to your locally hosted Solr administrative interface through your favorite Web browser. Try, http://localhost:8983/solr/

When it comes to configuring Solr, the most important files are found in the conf directory, specifically, solrconfig.xml and schema.xml. I haven’t tweaked the former. The later denotes the types and names of fields that will ultimately be in your index. Describing in detail the in’s and out’s of solrconfig.xml and schema.xml are beyond the scope of this posting, but for our purposes here, it is important to note two things. First I modified schema.xml to include the following Dublin Core-like fields:

  <!-- a set of "Dublin Core-lite" fields -->
  <fields>
    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="title" type="text" indexed="true" stored="true" />
   <field name="text" type="text" indexed="true" stored="false" />
  </fields>

  <!-- what field is the key, very important! -->
  <uniqueKey>id</uniqueKey>

  <!-- field to search by default; the power of an index -->
  <defaultSearchField>text</defaultSearchField>

  <!-- how to deal with multiple terms -->
  <solrQueryParser defaultOperator="AND" />

  <!-- copy content into the default field -->
  <copyField source="title" dest="text" />

Second, I edited a Jetty configuration file (jetty.xml) so it listens on port 210 instead of the default port, 8983. “Remember Z39.50?”

There is a whole lot more to configuring Solr than what is outlined above. To really get a handle on the indexing process the Solr documentation is required reading.

Installing WebService::Solr

Written by Brian Cassidy and Kirk Beers, WebService::Solr is a set Perl modules used to interface with Solr. Create various WebService::Solr objects (such as fields, documents, requests, and responses), and apply methods against them to create, modify, find, add, delete, query, and optimize aspects of your underlying Lucene index.

Since WebService::Solr requires a large number of supporting modules, installing WebService::Solr is best done with using CPAN. From the CPAN command line, enter install WebService::Solr. It worked perfectly for me.

Indexing content

My first WebService::Solr script, an indexer, is a trivial example, below:

 #!/usr/bin/perl
 
 # trivial-index.pl - index a couple of documents
 
 # define
 use constant SOLR => 'http://localhost:210/solr';
 use constant DATA => ( 'Hello, World!', 'It is nice to meet you.' );
 
 # require
 use strict;
 use WebService::Solr;
 
 # initialize
 my $solr = WebService::Solr->new( SOLR );
 
 # process each data item
 my $index = 0;
 foreach ( DATA ) {
 
   # increment
   $index++;
     
   # populate solr fields
   my $id  = WebService::Solr::Field->new( id  => $index );
   my $title = WebService::Solr::Field->new( title => $_ );
 
   # fill a document with the fields
   my $doc = WebService::Solr::Document->new;
   $doc->add_fields(( $id, $title ));
 
   # save
   $solr->add( $doc );
   $solr->commit;
 
 }
 
 # done
 exit;

To elaborate, the script first defines the (HTTP) location of our Solr instance as well as array of data containing two elements. It then includes/requires the necessary Perl modules. One to keep our programming technique honest, and the other is our reason de existence. Third, a WebService::Solr object is created. Fourth, a pointer is initialized, and a loop instantiated reading each data element. Inside the loop the pointer is incremented and local WebService::Solr::Field objects are created using the values of the pointer and the current data element. The next step is to instantiate a WebService::Solr:Document object and fill it up with the Field objects. Finally, the Document is added to the index, and the update is committed.

If everything went according to plan, the Lucene index should now contain two documents. The first with an id equal to 1 and a title equal to “Hello, World!”. The second with an id equal to 2 and a title equal to “It is nice to meet you.” To verify this you should be able to use the following script to search your index:

  #!/usr/bin/perl
  
  # trivial-search.pl - query a lucene index through solr
  
  # define
  use constant SOLR => 'http://localhost:210/solr';
  
  # require
  use strict;
  use WebService::Solr;
  
  # initialize
  my $solr = WebService::Solr->new( SOLR );
  
  # sanity check
  my $query = $ARGV[ 0 ];
  if ( ! $query ) {
  
    print "Usage: $0 <query>\n";
    exit;
    
  }
  
  # search & get hits
  my $response = $solr->search( $query );
  my @hits = $response->docs;
  
  # display
  print "Your search ($query) found " . ( $#hits + 1 ) . " document(s).\n\n";
  foreach my $doc ( @hits ) {
  
    # slurp
    my $id    = $doc->value_for( 'id' );
    my $title = $doc->value_for( 'title' );
    
    # echo
    print "     id: $id\n";
    print "  title: $title\n";
    print "\n";
      
  }

Try queries such as hello, “hello OR meet”, or “title: world” will return results. Because the field named text includes the content of the title field, as per our definition, queries without field specifications default to the text field. Nice. The power of an index.

Here is how the script works. It first denotes the location of Solr. It then includes/requires the necessary modules. Next, it creates a WebService::Solr object. Fourth, it makes sure there is a query on the command line. Fifth, it queries Solr creating a WebService::Solr::Response object, and this object is queried for an array of hits. Finally, the hits are looped through, creating and displaying the contents of each WebService::Solr::Document object (hit) found.

Summary

This posting provided an overview of Lucene, Solr, and a set of Perl modules called WebService::Solr. It also introduced the use of the modules to index content and search it. Part II will provide a more in-depth introduction to the use of WebService::Solr and Solr in general.

Tags: indexing, Lucene, Solr, WebService::Solr
Posted in Hacks | 4 Comments »

Indexing MARC records with MARC4J and Lucene

Wednesday, July 9th, 2008

In anticipation of the eXtensible Catalog (XC) project, I wrote my first Java programs a few months ago to index MARC records, and you can download them from here.

The first uses MARC4J and Lucene to parse and index MARC records. The second uses Lucene to search the index created from the first program. They are very simple programs — functional and not feature-rich. For the budding Java programmer in libraries, these programs could be used as a part a rudimentary self-paced tutorial. From the distribution’s README:

This is the README file for two Java programs called Index and Search.

Index and Search are my first (real) Java programs. Using Marc4J, Index
reads a set of MARC records, parses them (for authors, titles, and call
numbers), and feeds the data to Lucene for indexing. To get the program
going you will need to:

Get the MARC4J .jar files, and make sure they are in your CLASSPATH.

Get the Lucene .jar files, and make sure they are in your CLASSPATH.

Edit Index.java so the value of InputStream points to a set of MARC records.

Create a directory named index in the same directory as the source code.

Compile the source (javac Index.java).

Run the program (java Index).

The program should echo the parsed data to the screen and create an
index in the index directory. It takes me about fifteen minutes to index
700,000 records.

The second program, Search, is designed to query the index created by
the first program. To get it to run you will need to:

Get the Lucene .jar files, and make sure they are in your CLASSPATH.

Make sure the index created by Index is located in the same directory as the source code.

Compile the source (javac Search.java).

Run the program (java Search where is a word or phrase).

The result should be a list items from the index. Simple.

Enjoy?!

Tags: Java, Lucene, MARC4J
Posted in Hacks | Comments Off on Indexing MARC records with MARC4J and Lucene

Creator: Eric Lease Morgan <eric_morgan@infomotions.com>
Date created: 2008-05-26
Date updated: 2010-05-09
URL: ./

Archives

Categories