Indexing, indexing, indexing

The ability to create your own index of electronic texts is an ability frequently under-utilized in Library Land. Its a shame since the creation of your own indexes empowers you to create focused, customizable information services that would otherwise wait for a commercial vendor to provide, maybe. This column describes what indexing is and why it should be an integral part of your information services. Second, this column reviews a number of free, Unix-based indexing systems: freewais-sf [1], Harvest [2], SWISH-E [3], and ht://Dig [4].

What is an index and indexing?

An index is a list of pointers. More specifically, it is an ordered list of terms or phrases (denoting authors, titles, or concepts) paired with pointers to content within a set of one or more documents where the terms or phrases are the focus of the content.

According to the history books, indexing has been taking place ever since the creation of the written word. Although, for all intensive purposes, I don't think wide spread indexing existed until Mr. H. W. Wilson started his book-selling efforts from his college dormitory in the early part of this century. [5] Only in the last half of this century have we seen the formal creation of an indexing profession and publications describing it.

You are probably familiar with many of the principles of indexing since they are very much akin to the principles of cataloging and classification. But unlike cataloging and classification, indexing focuses less of its efforts on describing the "aboutness" of documents and more on the extraction of terms and page numbers from texts and arranging these items into an organized structure. This structure is used to refer the reader to particular sections of the text. Indexing provides "searchablity" to a text, and think how much more you get out of Encyclopedia Britanica because it has an index.

With the advent of computers, indexing, like everything else, is going through a few changes. In the early days of computerized indexing, researchers tried to automatically create back-of-book indexes and title indexes. Well-known solutions included the KWIC (key word in context) implementations. Many of solutions were not as effective as desired since the computer did not know the "important" terms. Even with the use of some linguistic analysis and thesaurus applications, early implementations were considered less than effective.

In the late 60's and early 70's, computerized indexes focused their efforts not on creating back-of-book indexes, but rather entire searchable databases of terms from indexed documents. These databases are lists of terms, positions, and pointers (page numbers, file names, URLs, paragraphs numbers, etc.) to the original document(s). This brute force method allowed the searcher to apply sophisticated queries including Boolean logic, nesting, adjacency, and truncation techniques against all the terms in the database. The MEDLARS family of databases and ERIC are examples of early and ongoing, successful applications of this technique.

Despite the seemingly inefficiencies of automated indexing techniques, creating your own indexes of electronic texts (or structured database records) is very inexpensive and provide access, if not perfect access, to collections hitherto unavailable. Used correctly, an indexer can provide enhanced access to a year's worth of email, the complete works of Shakespear, a dictionary, your World Wide Web pages, pathfinders, and finding aids. The list can go on and on. I'm sure you have a number of collections of electronic texts you would like have searchable for you as well as your clientele. Using any of the tools reviewed below can help you accomplish this goal.

Some indexing systems reviewed

There are a number of indexing systems freely available on the Internet. There are a growing number of commercial solutions as well. Below are brief reviews of four of the free solutions. They all require the Unix operating system to run.

The freewais-sf is the grand-daddy of the selections beginning its life as WAIS, the infamous wide area information server. After WAIS became a commercial application, a number of people took up the freely available source code in the effort to make it a better system. Freewais-sf is the result of one of these efforts, the work of Ulrich Pfeifer for his Ph.D. dissertation. Freewais-sf's strength lies in its field searching abilities and documented relevance ranking.

Harvest began as a federally funded project at the University of Colorado. Its a modular system of applications consisting primarily of a "gatherer" and "broker." Given URLs or file system specifications, the gatherer collects documents and summarizes them into a format called SOIF (summary object interchange format). SOIF is a meta-data structure similar to MARC. The broker's job is to actually index the SOIF data. In its present distribution, brokers can index the SOIF data using SWISH or WAIS techniques. Harvest's particular strengths lie in its ability to easily gather and summarize a wide variety of file formats. Because of this modularity it will even be able to summarize formats that have not been developed yet. Harvest is great for indexing well-structured HTML documents since it will take META tags and turn them into field-searchable data elements. Harvest is the system I use to index my professional home pages and create Index Morganagus. [6]

SWISH-E , is an acronym for Simple WAIS Indexing For Humans. (The "E" is for "Enhanced.") Originally developed by Kevin Hughes (who also wrote the ever popular hypermail program for archiving mailing list messages), SWISH-E is easier to install than Harvest or freewais, but offers fewer features.

ht://Dig is an indexer and search engine "developed at San Diego State University as a way to search the various web servers on the campus network." This system's unique strength is its ability to merge discrete indexes into one larger index for cross-server searching. It too is easier to install than freewais-sf and Harvest.

Each of these systems has something to offer that the others don't. Depending on what sort of data I am indexing, I choose freewais-sf or Harvest as my indexer of choice. freewais is fast and its indexes do not have to be regularly rebuilt. Furthermore, its field searching capabilities are superior to Harvest's and are more intuitive. Unfortunately, freewais-sf is not as well supported as Harvest and is much more difficult to install.

Harvest, on the other hand, excels at providing indexed access to well- and consistently structured HTML documents. Harvest services can run more automatically than freewais services, but its output leaves a little bit to be desired. If I had to choose one solution, I would choose freewais-sf since it allows me to structure my documents in just about any format I choose, and I can then index them and provide field searching features against the index.

Below is a chart summarizing the features of these indexing systems.

Features Indexers
freewais-sf Harvest SWISH-E ht://Dig
indexes plain text 1 1 1 1
indexes HTML 1 1 1 1
indexes PDF 0 1 0 0
indexes images 1 1 1 1
indexes local files 1 1 1 0
includes robot 0 1 0 1
freetext searching 1 1 1 1
soundex searching 1 0 0 1
field searching 1 1 0 0
nesting 1 1 0 0
right-hand truncation 1 1 0 0
Boolean logic 1 1 1 1
relevance ranking 1 1 1 1
user-select ranking 0 1 0 0
thesaurus creation 1 0 0 0
stop words 1 0 0 0
does not go stale 1 0 1 1
totals 14 13 8 8

A different indexing task

A novel use of indexing is the indexing of your scholarly articles or reports. Suppose you have written an HTML text that is a few pages long. The text includes a number of footnotes, and these footnotes contain URLs. The document alone may not need indexing. On the other hand, if you index your document, as well as the top pages of the URLs in the footnotes, then you will enrich your document with the items in the remote URLs.

I did this with a handout I created for a presentation called "Pointers 4 Searching, Searching 4 Pointers." [7] The presentation was about searching the Internet. My document describes the concept of Boolean logic and a search of the Pointers index returns a number of other documents cited in the handout that also describe Boolean logic. Thus, if my description is confusing, then it would be possible to search the Pointers index in hopes of finding a more meaningful description.

I encourage you to practice some indexing on our own with the tools reviewed above as well as some of the tools reviewed on the American Society of Indexers Home Page. [8] Search the Pointers index and/or the index created for this column, and tell me what you think.

Notes

  1. http://amaunet.cs.uni-dortmund.de/projects/ir/freeWAIS-sf/fwsf_toc.html
  2. http://www.tardis.ed.ac.uk/harvest/
  3. http://sunsite.berkeley.edu/SWISH-E/
  4. http://htdig.sdsu.edu/
  5. http://www.hwwilson.com/history.html
  6. http://sunsite.berkeley.edu/IndexMorganagus/
  7. http://www.lib.ncsu.edu/staff/morgan/pointers/
  8. http://www.well.com/user/asi/

Creator: Eric Lease Morgan <eric_morgan@infomotions.com>
Source: This article was originally published in Computers in Libraries.
Date created: 1998-04-17
Date updated: 2004-11-07
Subject(s): indexing;
URL: http://infomotions.com/musings/indexing/