Posts Tagged ‘Java’

Indexing MARC records with MARC4J and Lucene

Wednesday, July 9th, 2008

In anticipation of the eXtensible Catalog (XC) project, I wrote my first Java programs a few months ago to index MARC records, and you can download them from here.

The first uses MARC4J and Lucene to parse and index MARC records. The second uses Lucene to search the index created from the first program. They are very simple programs — functional and not feature-rich. For the budding Java programmer in libraries, these programs could be used as a part a rudimentary self-paced tutorial. From the distribution’s README:

This is the README file for two Java programs called Index and Search.

Index and Search are my first (real) Java programs. Using Marc4J, Index
reads a set of MARC records, parses them (for authors, titles, and call
numbers), and feeds the data to Lucene for indexing. To get the program
going you will need to:

  1. Get the MARC4J .jar files, and make sure they are in your CLASSPATH.
  2. Get the Lucene .jar files, and make sure they are in your CLASSPATH.
  3. Edit Index.java so the value of InputStream points to a set of MARC records.
  4. Create a directory named index in the same directory as the source code.
  5. Compile the source (javac Index.java).
  6. Run the program (java Index).

The program should echo the parsed data to the screen and create an
index in the index directory. It takes me about fifteen minutes to index
700,000 records.

The second program, Search, is designed to query the index created by
the first program. To get it to run you will need to:

  1. Get the Lucene .jar files, and make sure they are in your CLASSPATH.
  2. Make sure the index created by Index is located in the same directory as the source code.
  3. Compile the source (javac Search.java).
  4. Run the program (java Search where is a word or phrase).

The result should be a list items from the index. Simple.

Enjoy?!