Next Generation Data Format

In the United States library catalogs traditionally use the MARC standard for bibliographic records. Many questions revolve around the future of MARC and how it interacts with other metadata standards such as Dublin Core, MODS, and VRA Core. This presentation explores these and other issues related to the "next generation" catalog. (This presentation is also available as an abridged, one-page PDF document as well as a set of PowerPoint slides.)

How we got to where we are

MARC is a data structure -- a container designed to hold information, specifically bibliographic information. Designed in 1965, its original purpose was to provide a means for transmitting the data necessary to print library catalog cards. Think catalog cards for a minute. Tracings. A title & author statement. (The Adventures of Huckleberry Finn / by Mark Twain) A series statement, maybe. Pagination. (246 p. : col. ill.) A note or two. (Includes index.) Subject headings and added entries.

MARC is a supremely good example of a computing technology for its time. Disk space was very expensive, and therefore each and every byte was not to be wasted. There were no computer networks. Data was shared between computers through the use of magnetic tape. The concept of the relational database was at least a decade away, and data was organized in "flat files" where fields were delimited by specific characters and/or organized through the use of offsets from the beginning of the record. Given such an environment, MARC is a wonderful medium complete with multiple and redundant checksums, a exploitation of the ASCII character set, and seemingly massive expandability. All the programmer had to do was provide a means for data entry, concatenate the data together while delimiting the fields and subfields, calculate the MARC directory (one type of redundancy), add a leader section (a bit if metadata about the record and a second type of redundancy), and terminate the whole thing with a specific delimiting character. The results was one long line of ASCII text that could very easily be read and written to tape.

The development of MARC begs questions regarding the definition of library catalogs. What is the purpose of the catalog? Through the use of authority lists, controlled vocabularies, and very specific filing procedures, the purpose of a library catalog is to list the inventory of a specific library as well as to create relationships between works ultimately building an intellectual cosmos out of an apparent chaos. What functions is it intended to support? The answer boils down to two things. First, it was designed to allow the patron (they weren't "users" then) to find known items. "Does this library own a copy of The Adventures of Huckleberry Finn?" Answering the question was as easy as knowing the filing rules and fingering through the cards. Second, the catalog allowed the patron to create lists of items owned by the library based on author names and/or topics. "What books does this library own that relate to origami and paper folding?" In this case the patron needed to articulate a specific author name or topic, and again, finger through the cards. Through the use of See and See Also references, as well as a knowledge of how to read a card's list of subject headings, the patron could get a pretty good idea of what a library owned without needing to visit the stacks.

Fast forward around twenty years to the mid-1980s. Online bibliographic database vendors such as DAILOG and BRS are all the rage in Library Land. Leafing through voluminous paper indexes for article citations is becoming less and less desirable. OCLC is providing copy cataloging services and facilitating inter-library loan. The MARC records initially used to print catalog cards are now being used to create "online" card catalogs. These catalogs are "homegrown" and serve "users". Relational databases are out of the laboratory, full-text indexing techniques are beginning to emerge, and if one listens carefully, talk can be heard about SGML. Keyword searches become possible in the "online" catalog, and while still beneficial, there is less of a necessity to know "authoritative" terms in order find things. People's expectations are beginning to change. "Now that the catalog is 'online' does that mean I can get the book over the computer too?"

Fast forward around twenty more years to the present day. Computers are on every desk. In fact, they are so prevalent that at any one time we may be carrying one or more of them around in our pockets. They have more memory, more disk space, and more computing power than was ever imagined in 1965. The entire set of MARC records stored at OCLC can fit on an iPod. Relational database technology, once considered a black art, now abounds. It is so freely available (all puns intended) that anybody with a rudimentary understanding of normalization techniques and SQL -- an English-like query language -- can efficiently store and manage just about any type of data. Amazon.com, primarily a book seller, uses it to drive its online store. Yahoo, one of the premier Internet indexers, uses it to save the content it harvests from the Web. Relational databases, albeit poorly implemented, form the technological heart of library catalogs. Indexing techniques -- the soul of the information retrieval community -- that have been evolving for more than forty years are best exemplified and exploited by Google. Enter a word or two. Get back a list of possible documents. Click. Get the document. Fast. Easy. Seemingly intelligent. Very reliable and highly relevant. XML, a simplified and more elegant implementation of the ideas behind SGML, is now the data transmission format of choice. A person needs to know only six or seven syntactical rules and a plain text editor in order to create well-formed XML. No non-printable ASCII characters. No limits to size. No calculated checksums. Combined with the Internet, XML and JSON (another simple and elegant data structure) are the building blocks of Web 2.0, a computing environment where content is gathered from here, there, and another place to create "mash-ups". Think blogs, RSS, and ATOM. Think Google Maps and Google Ads. Think Flickr and FaceBook. Finally, with the advent of globally networked computers and the rise of a service-based economy, the idea of brokering data and information has become much more prevalent. It is fashionable to be an information provider. Wikipedia, while not perfect but just a good as Encyclopedia Britanica, is a great resource for answering simple facty-type questions. What's a reference librarian to do?

Meanwhile, back at the ranch, MARC still forms the core of our "integrated library systems". MARC records are copied from OCLC, parsed, and stuffed into databases, but because these database are poorly normalized, it is a major task to maintain the content. For example, there no trivial one-line SQL command that can be used to facilitate global find/replace operations. To provide search against these databases users are expected to know how bibliographic information has been traditionally organized by qualifying queries and specifying fields: title, author, subject term, control number, etc. This, after all, is the way one searches databases. On the other hand, the rest of the information world is using indexing technologies to provide search. No need to specify fields. Instead, free text queries and relevancy ranking derived from statistical analysis and harnessing the power of Web is the norm. Moreover, these "integrated library systems" only integrate with themselves, and there is certainly no possibility of Web 2.0 integration. To make matters worse, libraries license (not purchase) access to hundreds of journal indexes and journal titles. Most of these things have their own closed and proprietary interfaces. True, some of them support a library-only protocol called Z39.50, but it is widely regarded as far too complicated to implement to its fullest extent. Think metasearch. Has it fulfilled our expectations? Because our library catalogs are "closed" and because the journal literature is licensed, access to library content is a matter of guessing which "silo" contains the desired information and learning how to use its particular interface. To add insult to injury, once relevant items are identified, how easy is it to get the content? Remember Google. Click. Get the document.

Considering the current state of affairs it is no wonder why people's expectations have changed and why they barely consider using a library to find anything.

Not all gloom and doom

The purpose of this essay is not to paint of picture of gloom and doom. Instead, its purpose is to: 1) illustrate how the core elements of librarianship are relevant in today's environment, and 2) outline a method for reframing these elements. Put another way, it is not the "what" of librarianship that needs to change but rather the "how". The profession's core mission -- to collect, organize, archive, and disseminate data, information, and knowledge for our respective communities -- is certainly valued by society and demonstrated by the exploding numbers of institutions who are doing the same things. On the other hand, the techniques of librarianship -- the "how" -- do not take advantage of the environment, and they do not go far enough to satisfy the expectations of our patrons, now called "users", "clients", and "consumers". The methods of librarianship need to change. They need to evolve. I suggest this can be accomplished with a four-step plan, outlined below.

Refine the definition of librarianship

Refine the definition of librarianship considering an environment where globally networked computers are ubiquitous.

As alluded to above, the core of librarianship can probably be boiled down to four processes: collection, organization, preservation, and dissemination. Bibliographers articulate and identify materials for a library's specific audience. These materials are organized are brought together to form an apparent cosmos by catalogers. In order to preserve the historical record, these materials are archived for future generations. In order to serve today's patron, these same materials are disseminated through various public services and reference librarians. Considering the current environment -- a milieu of electronic information and licensed content -- these materials may or may not be physically brought into our physical libraries. By harnessing the power of the 'Net, cataloging processes can be supplemented by tagging and reviews. Preservation brings on new challenges, but the principles remain the same. Make many copies of things and some of them will survive. When it comes to dissemination, consider the computer as the primary interface and medium between reference librarians and patrons.

Each of our libraries will have a slightly different take on all of these issues because we will each have a slightly different set of patrons, yet the core mission of libraries will be similar across the spectrum. Constantly ask yourself, "What is my library's goal? Where do we want to be in one year? Three years? Five years? Ten years?" This is a never-ending process. The specific answers will change over time, but the framework will probably remain similar.

Reduce dependence

Reduce your dependence on third-party, "closed source" vendors to provide you with content and software solutions.

Many of our problems stem from the fact that our profession has increasingly outsourced many of our services to third parties. This, in turn, has made it difficult to integrate our content and services. To some degree, this can be traced back to 1901, H. W. Wilson, and Reader's Guide to Periodical Literature. Subscribe to Wilson's index and save library time. Convenient, yes. Integrated with the whole of a library's holdings, no. Now there are two places to search for information on any given topic.

As computers appeared, many libraries had their own "homegrown" library catalogs. For reasons that are not perfectly clear to me, these systems gave way to commercial systems. Through this process the profession shielded itself from technology. We computerized our work environment by mimicking our paper process -- automation. Through RFPs we got exactly what we asked for. This created another silo since our systems did not work with other systems.

In the current environment we are increasingly licensing our content instead of owning it. Can we really depend on publishers to maintain content for decades? Is preservation best served by hosting content in a single location or purchasing insurance against its loss? Are access-only collections a viable long-term solution?

By supporting open access content and open source software to a greater degree than is currently demonstrated, the library profession can ensure increased viability in the future. Open access content, by definition, can be copied without restriction. This addresses preservation issues and the ownership of content. Open source software makes our computing environment more flexible and transparent. It provides a way for libraries to have more control over their technical infrastructure.

It is highly unlikely that open access and open source will completely overshadow for-profit publishing and "closed source" software. More than likely, each of these distribution and software models will exist side-by-side. Neither is a perfect solution, but by advocating and endorsing "all things open" -- a trend in a de-centralized network such as the Internet -- libraries can more easily continue to fulfill their core mission.

Exploit technology

Exploit and combine the use of relational databases, indexing technologies, XML, and the Internet.

Again, the MARC record as a data structure was a wonderful thing for its time, but there have been a myriad of computer technologies since 1965 that can better assist a library in getting its work done. The first is relational databases. Originally developed by IBM, relational databases provide the means to efficiently create and maintain data. A relational database is really a set of one or more lists of discrete data sets "joined" together by a set of common elements called "keys". Since data/information in databases is associated with these keys -- pointers -- and not necessarily specific values it is possible to do global find/replace operations throughout the database by editing only a single field. (Databases are not Excel spreadsheets!). Ironically, databases are not very good when it comes to search because users must know the structure of the database in order to create queries.

If you want to provide search, then you want to employ an index. Indexers are a result of the information retrieval community. They work exactly like back-of-the-book indexes. Feed an indexer a document. The indexer parses the document into individual words. It then saves the words, their position in the document, and a document identifier to disk. A search engine is the other half of the indexer. Given a word the search engine will loop through the word list and return document identifiers matching the query. If Boolean operations are specified, then the search engine also takes into account word positions. Indexer/search engines excel at search for two reasons. First, a knowledge of the data's underlying structure is not necessary making the index easier to search. Second, statistical analysis can be employed to calculate the relevance of a given document, thus providing alternative ways to sort the search results.

Databases and indexers are two sides of the information retrieval coin. Use the first to create and maintain content. Write reports against the database and feed them to an indexer/search engine to support search.

Learn to how two read and write XML files. Like MARC, XML is a data structure -- a container for data. Debates rage in some of the library community regarding the benefits of MARC versus XML. MARC is smaller. MARC is our standard. MARC is language agnostic. The number of MARC records is huge. Our tools know how to use MARC, and conversion to XML is too drastic. On the other hand, XML requires the knowledge of very few rules in order to construct it. XML can be written and read with a plain text editor. XML can be used to mark-up just about any kind of data. Try marking up Shakespeare's sonnets in MARC. Despite all of these things, none of these reasons are the best reason to migrate from MARC to XML. Instead, the best reason is because everybody else is using XML. A large part of what it means to do librarianship is the dissemination of data, information, and knowledge. If we want to share our content, then we need to share it in a language everybody else can understand. That language is XML.

Finally, learn how to exploit the Internet. With the advent of our global network it is easier to communicate and get the input from a much larger number of people. It is easier to reach people physically far away. It is easier to seek out and find kindred spirits with thoughts and ideas like your own. The era of centralized authority is waning. While centralized authority will not vanish completely, its importance is diminishing. Libraries need to use this to their advantage. Social networking can be very powerful. Think MySpace, LibraryThing, Fickr, YouTube, and Facebook. There are experts out there who can contribute to just about any endeavor. Examples for libraries include tagging (folksonomies), reviews, rankings, linking, sharing, and blogging. Put another way, many people are doing library work without professional library degrees. Learn to harness their energies and put them to use in your digital library collections and services. Put yet a third way, "'A rising tide floats all boats.' All we have to do is put our boat in the water."

Work collaboratively

Work with sets of peers and stakeholders inside and outside your library to design and implement solutions to shared problems.

The theme of this essay is, "The core processes of librarianship are valued by the wider community, but the way the profession puts these processes into practice is antiquated." If you were to put the first three steps of change into action (refine librarianship, reduce dependence, and exploit technology), then you will be primed to take your new-found skills on the road.

When it comes to digitizing content in an academic library, consider digitizing content that is of interest to your local students, instructors, and scholars. While there is significant evidence that special collections content is of wide interest to the outside community, digitizing the content needed by your local constituents will pay off quicker. Your local constituents are going to sing your praises louder than the anonymous people from the Internet.

Increasingly data sets are becoming a part of the scholarly landscape. No longer is it satisfactory to do a set of experiments or conduct a number of surveys and then write about the results. It is becoming expected to make the data used to come to the scholarly conclusions available as well. Who is going to collection, organize, preserve, and disseminate this data? Sounds like a job for librarians to me. Seek out the scientists and figure out ways to solve each other's problems.

While many librarians advocate open access publishing, it is important to not advocate too hard because it often comes off as being pushy. On the other hand, figure out ways to make it easy for the institution to collection locally developed materials. They might include things beyond formally published articles. They could also include much of the gray literature: pre-prints, conference presentations, student research, etc. If the metadata describing these thigns is not perfect, then so what. The world is not going to come to an end and you will moving in the right direction as opposed to doing nothing at all.

Just as libraries have a tradition of co-operative collection development and cataloging, libraries could be a natural when it comes to working together to create technical solutions to library problems. Practice with your technology. Write down what you've learned. Share it on the Internet. Repeat. Your peers will want to know.

"Next generation" library catalogs

All of this brings us back to the topic of the day -- "next generation" library catalogs. Considering today's environment, people have no problem finding content. It abounds. It is so plentiful we still feel like we are "drinking from the 'proverbial' firehose." With all this information, whether we find it on Google or through a library system, the big question really is, "What are people going to do with the information?" Yes, library catalogs need to change. They need to be easier to use and seemingly more intelligent. They need to include content that goes beyond the physical (or licensed) holdings of a particular library. The networked environment in which we live has changed the expectations of our users. Our catalogs need to reflect these changing expectations or we need to face the consequences of not keeping up with the times. To these ends, it is almost a trivial computing task to combine the technology of the day to create a "next generation" library catalog:

define a collection development policy
build the collection
describe and/or manifest the collection using XML
manage the XML using relational databases
make the XML searchable by indexing it
provide access to the index

From a computer technology point of view, this process is easy. Just for this presentation I indexed a set of XML documents that could form the beginnings of a "next generation" library catlog. The XML comes in many forms: MARCXML, TEI, EAD, and VRA Core, representing books, full-text documents, archival materials, and images, respectively. The index is rudimentary in that is maps the XML to Dublin Core elements, but the indexing process does not have to be that simple. It is entirely possible to include in the index any one or more XML-specific elements and make those elements searchable. Access to the index is provided through an SRU (Search/Retrieve via URL) interface, thus returning more XML that can be transformed into many different types of displays. For more detail see the:

In my opinion, the real opportunity for "next generation" library catalog systems is not search, but services. The profession needs to figure out ways to enable patrons to use the data they find. Libraries, as opposed to Google, are in a unique position in this regard because libraries are more able to place content into the context of the user. We have a better idea of who our users are and what they want to do with with their content. We know whether or not they are students, instructors, or scholars. We know whether or not they are grade school students or senior citizens. We know what classes they take and in what departments they reside. Based on this sort of information it is possible to taylor search results and provide additional services against the content -- especially if it exists as full-text. Each of these services can be expressed using an action verb, and here is a list of possibilities:

Add to my collection - Once an item is displayed, select it to add it your "your library" and optionally add it to a sub-section of your library. This is like bookmarking.
Annotate - As you read a text, the system will allow you to comment on the text, and associate each comment with a particular word, sentence, paragraph, section, etc. This would function much like scholarly annotated version of text like The Annotated Alice.
Cite - Select a text. Choose an option. Return a citation of the item in any number of formats. MLA. Chicago. Etc. It would be nice of each text could be associated with a URI.
Compare & contrast - This is one of the "kewlest" function. Select any number of texts. Two. Five. Fifty. One hundred. Select compare. The system reads each text and analyses the words it finds. It counts the words (much like a concordance) and returns a report listing things such as: these words appear in 90% of the documents, these words appear in 45% of the documents, this document contains most of the words contained in the other documents, this document contains the most number of unique words. By going through this process a person would be able to see which documents were similar to each other and which documents were dissimilar. Optionally, the user could seed the compare & contrast documents with word or phrases to focus on a particular idea.
Create different version of - Given a document in TEI, transform the document into PDF, something designed for your iPhone, or XHTML. Given a document created for one ebook reader, convert it into a version for another reader.
Create flip book - A simple approach it to create an interface allowing the person to "flip" through the book very quickly similar to the way a person makes pages turn very quickly in their hands. Even better, extract all the images from one or more texts and create slide show of the images. This will allow a person to scan/browse many texts quickly and select a text accordingly.
Create tag cloud from - As a graphic illustration of what a text (or collection of texts) is about, count the number of times words appear in a document and lay them out according to their rank, much like Del.icio.us tag clouds, only bigger.
Delete from my collection - This is the inverse of Add To My Collection.
Do concordance against - A concordance counts words, allows a person to list the words in alphabetic or numeric order, and then points you to the location of the words in the text. This is just about the oldest form of indexing and it was originally applied to the bible hundreds of years ago.
Do rudimentary morphology - Given a word and a dictionary/thesaurus, extract from one or more texts the way the word was used, both forward and backward. The system would take advantage of alternations in spellings as well as meanings. Tricky!
Find opposite - Determine the "aboutness" of a document. Use a thesaurus to find the antonym of the "aboutness" and find new documents.
Find similar - This is the same as Find Opposite except the system looks for synonyms.
Hilight - This is similar to Annotate except the annotations are graphical in nature. This is similar to using a hi-lighter pen on books or diagraming sentences.
Incorporate into syllabus - Link a given document to another document and supplement the link with a short blurb such as an assignment.
Map to controlled vocabulary term - Exploit user tags and/or statistical analysis to determine the "aboutness" of a document or the documents' author authority heading, and find the closest match in a "subject heading" list such as LCSH or Dewey.
Plot on a map - Given a gazetteer, find all the places in a document and plot them on a map in terms of both time and place. Associate each point with a passage in the text.
Print - Move the document from the screen to paper. It would be even cooler if the printed version were printed like books are "suppose" to print -- odd pages appear on the right, chapter headers appear on odd pages, title page and verso, "properly" numbered pages, back-of-the-book indexes, colophon, etc.
Rate - Give the text a numeric rating such as 1 - 5.
Review - Write a description of the text and its content. This a verbose version of Rate.
Save - Copy a version of the text from the remote site to your local file system.
Search - Enter a term and return/navigate the user to sections of one or more documents.
Search my collection - Query only the documents you have put on your "bookshelf".
Share - Create one or more lists of documents and publish the lists.
Summarize - This is very similar to Review but is not necessarily intend to include value judgments. This can be done by a human, but it can also be done by a computer through various extraction techniques.
Tag - Associate with your own controlled (or not so controlled) vocabulary term.
Trace author - Determine author(s) of document, and find other work by or about them.
Trace citation - Extract citations from texts bibliography. See who else has used used those citations, both before the text was written and/or after the text was written.
Translate - Convert the text(s) or passages into other languages.

Summary/conclusion

Libraries have an enormous set of opportunities available to them. It is fashionable to do library work. It is just that the work does not necessarily manifest through books, but rather the content of books and journals and images and data sets, etc. Moreover, it is not about tried-and-true library techniques (MARC, Library of Congress Subject Headings, Boolean logic, etc.) Instead it is about using the methods and technologies of the times. It is not the "what" of librarianship that needs to change. It is the methods.

Creator: Eric Lease Morgan <eric_morgan@infomotions.com>
Source: This presentation was originally given at the Massachusetts Library Association annual meeting in North Falmouth (May 7, 2008)
Date created: 2008-05-03
Date updated: 2008-05-03
Subject(s): Massachusetts Library Association; next generation library catalogs; presentations;
URL: http://infomotions.com/musings/ngc4mla/