Really Rudimentary Catalog

This text describes the purpose of the Really Rudumentary Catalog, ponders the usefulness of library catalogs in general, and finally outlines the technology behind this catalog's implementation. You should be able to download this software from the following URL: http://infomotions.com/books/really-rudimentary-catalog.tar.gz

The text that follows is not necessarily scholarly, but more like a columnist's viewpoint.

Purpose
Library catalogs
Technology of this implementation

Purpose

The purpose of this catalog is simply to inventory the books I own. Through the cataloging process I will be able to get a better idea of what books are in my collection and how they are related to each other. With the extra information embedded in the descriptions of the books, I hope to also identify books that I might not have considered when searching for information such as recipes or medieval history.

Library catalogs

For better or for worse, my vocation is also my avocation. In other words, what I do for a living is also what I do as a hobby. (Don't worry. I do have a life.) That being the case, I am going to take this opportunity to reflect on the function and utility of traditional library catalogs. As a person who has worked in libraries for about twenty years, I believe I'm qualified to at least say something on the subject.

When you read about the history of books and libraries you learn about clay tablets, papyrus scrolls, the loss of the Library of Alexandria, codexes, Irish monks and scriptoriums, national and public libraries, the Dewey Decimal System, and card catalogs. Through the reading you learn that the number of books constituting "great" libraries has increased over time. In the Middle Ages a library of 500 books would have been considered large. Now-a-days research libraries in North America boast of collections containing between 3 and 10 million volumes. Over time, as the size of library collections increased so did the difficulty of knowing what books were in a collection. Furthermore, books used to be very expensive. They were rare, required great technical skill to create, possibly encrusted with precious gems and gold, and they literally embodied a wealth of information. Individuals wanted to keep track of their valuable assets and thus the library catalog was born.

The earliest known catalogs were simply accession lists enumerating items in a collection. As collections grew accession lists were supplemented with lists organized by authors and titles. Later, much later, items in collections were additionally organized by subjects. It wasn't until relatively recently that sets of standardized subject headings librarians call "controlled vocabularies" were used to describe items in collections.

With the invention of the "card catalog" also came increased standardization and specialization of what it meant to catalog a book. Consider the old library card catalog. First, each card in the catalog had a limited amount of space for writing/typing, about 3" x 5". Second, there were lots of items in libraries to describe but a limited number of people to do the work. The cards of card catalogs contained information about a book's author, title, physical description, imprint (publication data), the briefest of notes such as "Includes index and bibliographic references", subject headings, "added entries" constituting additional authors and titles, and finally a call number pointing to the physical location of the book. That's a lot of information to put on a single index card, and it usually didn't fit! Consequently the information was spread out among two or more cards. Furthermore, to increase access each "card" was duplicated many times over. One "card set" would include author cards, title cards, added entry cards, and subject cards. Each book may have been described with more than one subject and consequently there would have to be a card for each of those subjects. Standard practice was to have no more than three to five subject headings per book in order to reduce the number of cards a library would need to generate. All in all, it was not uncommon for a single book to generate between twenty and thirty cards each, which all had to be filled correctly.

With these things in mind, it is important to understand that there were only five ways to locate books of interest in a library:

patrons could use the card catalog to look up an author
patrons could use the card catalog to look up a title
patrons could use the card catalog to look up controlled vocabulary terms -- subjects
patrons could ask librarians (or other people) for advice
patrons could browse the shelves for items of interest

In 1965 the United States' Library of Congress invented a data structure called MARC (MAchine Readable Cataloging). MARC was designed to encode the information of catalog cards in a digital format. Once the cataloging data was digitized it was saved to a tape, shipped to other libraries where it was decoded and used to print catalog cards. Considering the library profession's long standing tradition of resource sharing, this was a very good idea because it ultimately reduced the amount of original cataloging any one library had to do. "Why should I catalog this book of the Library of Congress has already done it?"

The MARC record data structure is an interesting beast in that is a wonderful application of technology considering the time. MARC is essentially a sequential file format meaning records are intended to be read from beginning to end. (Remember, they were designed for tape, not necessarily a random access format such as a disk drive.) MARC records are divided into three parts: 1) the leader, 2) the directory, and 3) the bibliographic data. The leader is always twenty-four characters long, and the first five characters are a left-hand, zero padded number denoting the length of the file. Therefore MARC records can be no longer than 99,999 characters long. Other information in the leader denotes things like the length of the directory and what type of record it is. The directory is a set of many twelve character digits denoting the beginning, length, and offset locations of items in the bibliographic section of the record. Finally, the bibliographic section of a MARC record is divided into fields and subfields where a book's descriptive data is encoded. There are literally 1,000 possible fields in the bibliographic section of a MARC record, and they echo the fields of traditional card catalog cards: control numbers, authors, titles, physical descriptions, notes, subjects, added entries, locations, and more notes.

Here is a MARC record describing a particular book on cooking. For readability, carriage returns have been added and non-printable field/subfield delimiters have been removed:

     01064cam 22003014a
     45000010009000000050017000090080041000269060045000679250
     04400112955021000156010001700366020001500383040001800398
     04200080041604300120042405000240043608200170046010000230
     04772450070005002500012005702600034005823000041006165000
     02000657650002300677650002400700650001200724099002600736
     12465496 20021004112447.0 010710s2001 nyua 001 0 eng a7 bcbc
     corignew d1 eocip f20 gy-gencatlg 0 aacquire b2 shelf copies
     xpolicy default apc14 2001-07-11 to ASCD ijf09 2001-07-11
     ejf12 2001-07-12; jf12 to Dewey 07-12-01 aaa15 2001-07-13
     aps16 2002-01-03 bk rec'd, to CIP ver. fjf12 2002-01-16 CIP
     ver. to BCCD; ajg07 2002-10-04 to BCCD, copy 2 a 2001039459
     a0786868546 aDLC cDLC dDLC apcc an-us--- 00 aTX715 b.S65132
     2001 00 a641.5973 221 1 aSmith, Art, d1960- 10 aBack to the
     table : bthe reunion of food and family / cArt Smith. a1st
     ed. aNew York : bHyperion, cc2001. a288 p. : bill. (some
     col.) : c24 cm. aIncludes index. 0 aCookery, American. 0
     aDinners and dining. 0 aFamily. asmith-back-1072625392

Once decoded into a (more) human-readable form, this record might look like this:

     001  12465496
     005  20021004112447.0
     008  010710s2001    nyua          001 0 eng
     010      2001039459
     020    0786868546
     040    DLC DLC DLC
     042    pcc
     043    n-us---
     050 00 TX715 .S65132 2001
     082 00 641.5973 21
     099    smith-back-1072625392
     100 1  Smith, Art, 1960-
     245 10 Back to the table : the reunion of food and family / Art Smith.
     250    1st ed.
     260    New York : Hyperion, c2001.
     300    288 p. : ill. (some col.) : 24 cm.
     500    Includes index.
     650  0 Cookery, American.
     650  0 Dinners and dining.
     650  0 Family.
     906    7 cbc orignew 1 ocip 20 y-gencatlg
     925 0  acquire 2 shelf copies policy default
     955    pc14 2001-07-11 to ASCD jf09 2001-07-11 jf12 2001-07-12; jf12 to
            Dewey 07-12-01 aa15 2001-07-13 ps16 2002-01-03 bk rec'd, to CIP
            ver. jf12 2002-01-16 CIP ver. to BCCD; jg07 2002-10-04 to BCCD,
            copy 2

If we assume the purpose of a library is to primarily serve patrons, not librarians, then the relevant information from the above data pertinant to a patron might be reformatted to look like this, and even then it is still sort of cryptic:

author: Smith, Art, 1960-
title: Back to the table
pagination: 288 p. : ill. (some col.) : 24 cm.
publisher: New York : Hyperion, c2001.
notes: Includes index.
subjects: Cookery, American. Dinners and dining. Family

With all of our scholarship, technology, and patron-oriented retoric, library catalogs are still first and foremost inventory lists. There is not enough information in this record, or the vast majority of other MARC records, to be able to judge whether or not this book is worth reading! They do not provide enough information necessary for qualitative judgments. Library catalogs need to include in their descriptive data qualitative information, the information traditionally received from patron-librarian interactions. There is no information suggesting that this book is popular, authoritative, easy to read, recommended, seminal, scholarly, respected, etc. Nor do cataloging records very often describe the intended audiences of book. For the most part, a person must know how to read between the lines of cataloging data or know the nature of the library itself in order to extract this information. The library profession believes this is a necessary skill for scholarship. I believe this is an impediment for users.

Library catalogs are essentially inventories listing the holdings of a library. Library catalogs are not, and never were, intended to provide suggestions to patrons. That was the job of librarians. Librarians provided the context. With the advent of the Internet, people' expectations regarding information have changed significantly. People expect libraries to change with the times. As fewer and fewer people visit the physical library, fewer and fewer people talk with librarians. Without the librarians the context is lost. It is important for automated library system to provide this context. The systems, like Amazon.com, must provide qualitative information along side the objective information of cataloging data. This additional data, because the librarian is less involved, will assist patrons in selecting items of interest.

Technology of this implementation

My particular catalog, Really Rudimentary Catalog, is just more of the same. The underlying scripts used to implement this catalog do no more for the patron than the library catalogs anywhere else. Just the same, I am sharing my code and describing how it fits together in the hopes things can be improved.

Really Rudimentary Catalog is a simple "card catalog" program. It consists of:

bin/acquire.pl - a rather brain-dead acquisitions program that downloads MARC records from the Library of Congress
bin/marc2html.pl - another brain-dead program that converts one or more files of MARC records into enhanced HTML files as well as a browsable author index, title index, and subject index pages
bin/make-opac.sh - a Unix shell script that calls marc2html.pl, swish-e (an indexer), and the next program, make-marc-dictionary.pl
bin/make-marc-dictionary.pl - a cool hack inspired by Bill Mosely that reads a swish-e index and converts it into an ASPELL dictionary
lib/Alex/*.pm - two incompletely written Perl modules, the most important being Patron.pm allowing the search interface (index.cgi) to get, set, and display "cookie" data from a user's search session
./search.cgi - a CGI script interfacing with the indexed data
marc/* - a set of sample MARC records
html/* - a set of HTML files converted from MARC records
etc/* - sample author, title, and subject indexes as well as the necessary configuration file for swish-e

To use the system:

Edit the configuration sections of all the scripts.
Install all the necessary Perl modules. (See the scripts' use statements.)
Acquire MARC records from the Library on Congress or your favorite integrated library system.
Index your data using bin/make-opac.sh.
Search your index using search.cgi.

This system is barely supported, but don't hesitate to give me a call anyway. Really.

This software is distributed under the GNU Public License.

Post script

P.S. The essay was written in two sittings. The first took place in the periodical reading room of Emory University's main library while hiring programmers for Project OCKHAM. The second sitting took place on the way home from a presentation at the Library on Congress on the topic of XML.

Creator: Eric Lease Morgan <eric_morgan@infomotions.com>
Source: This text was never formally published.
Date created: 2004-02-14
Date updated: 2005-02-08
Subject(s): MARC (Machine Readable Cataloging); computer programs and scripts; OPAC (Online Public Access Catalogs);
URL: http://infomotions.com/musings/rudimentary-catalog/

Really Rudimentary Catalog

Contents

Purpose

Library catalogs

Technology of this implementation

Post script