Open Library Developer's Meeting: One Web Page for Every Book Ever Published

I attended an Open Library Developer's Meeting on Friday, February 29, 2008 in San Francisco's Presidio, and this travel log outlines my experiences there. In a sentence, it was one of the more inspiring meetings I ever attended.

Open Library -- The movie!

The meeting was introduced by Brewster Kahle, the leader of the Internet Archive [1] and who sponsors/supports things like the Wayback Machine [2], Open Library [3], and the Open Content Alliance [4]. Kahle outlined the goal and purpose of the Open Library, and like any well-articulated goal, it can be distilled down to one, easy-to-understand phrase, "To create a system where every book ever published has at least one web page." Kahle certainly understands the content of libraries goes well beyond books, but books, he said, represent the bulk of the issue to be addressed. Books are the starting point.

While Brewster Kahle financially supports the Open Library, Aaron Swartz is the project's leader. Swartz was introduced and he gave an overview of how the goal of the Open Library is to be implemented, specifically, by amassing as many records describing as many books as possible, saving them to a database, and making the content of the database available through a myriad of ways. Wiki-like web pages. Search. Web Services computing and computer language APIs. Etc. By building relationships with stakeholders and content providers (publishers, retailers, libraries, software publishers, etc.) access to Open Library might provide cover art, summaries, scholarly reviews, popular reviews, full-text versions of the texts in many formats, rankings, discussion forums for each title, tight integration with Wikipedia, tight integration with citation management tools, links to library holdings, scanning & printing on demand services, the ability to harvest records for inclusion with local library catalogs, or even (gasp) the ability to buy the book. By exploiting authority and controlled vocabulary lists, not only will every book have a Web page but author name and subject term might have pages as well, thus creating a veritable "Web of books." This functionality and vision is very similar to the functionality and vision articulated by many proponents of "next generation" library catalogs.

The floor was then given to Anand Chitipothu and Edward Betts who described the system's underlying infrastructure and data processing techniques, respectively. In brief, the infrastructure -- code named "Infogami" -- is an Infobase relational database with a minimum of tables and fields. Each book record exists as an object in the database and associated with sets of configurable name/value pairs denoting object characteristics. These characteristics might include author(s), subject(s), identifier(s), etc. [5] The database supports records for people and revisions of content thus paving the way for user authentication and wiki-like pages. Infogami is written in Python and supports in internal API for the reading and writing of content using JSON as the exchange format. Content for Infogami (and therefore Open Library) is expected to come from a number of sources such as MARC records, ONIX files, and authority files. As the data is ingested the hopes are to normalize it, de-duplicate it, enhance it, and FRBR-ize it. All of this is especially tricky for quite a number of reasons. The lack of single identifiers. Inconsistent encoding practices. Only near-perfect crosswalks. Changes in terminology. Language translation difficulties. Etc.

Open Library Developer's Meeting attendees

The afternoon was given up to a number of break-out sessions to discuss specific issues. I participated in the discussion surrounding APIs to integrate Wikipedia with Open Library. The desire is to allow people to cite books in Wikipedia by: 1) searching for title or key, 2) returning a list of results, 3) selecting an item, 4) having the system create a citation for the item, and 5) inserting the citation into a Wikipedia article. Then, on a regular basis, these citations will be checked and updated ensuring their integrity. Everybody in the group supported the concept of REST-ful Web Service computing to accomplish the task, but not everybody's definition of REST-ful computing was congruent, and a bit of a religious war ensued. I tried to advocate for the use of SRU as the search protocol, but ultimately people leaned towards OpenSearch. "SRU is too complicated." Regarding the citation checking the discussion surrounded the requesting of one or more identifiers from Open Library and returning a stream of metadata. Here I tried to advocate for another existing, well-established standard -- OAI-PMH -- but again I was shot down. "Too complicated." In reality I think two things worked against the adoption of SRU and OAI. First my description of their functionality was not as eloquent as it could have been, and second, the Open Library personnel had never heard of nor knew anything about either protocol. This is another example of library standards being too library-centric. Think Z39.50.

The other break-out sessions included one on identifiers where the consensus was towards not using local database keys but instead maybe hashes of author names and titles. It seems to me that the search for global identifiers -- think URIs -- is like searching for the Holy Grail. The book display group advocated the use of JPEG2000 as the file format and hoped to create "flip books" supporting things like cut & paste, zoom, pan, thumbnails, rotate, KWIK highlighting, etc. The user experience group wanted to make sure the Web browser interface supported things like ease-of-use without any training needed, find, get, interact, expert interface, novice interface, and a well thought out information architecture created through the exploitation of user-centered design techniques.

The meeting was brought to a close by allowing everybody in the room to briefly state what they would like to see in the future of the Open Library. Some of the comments included:

international languages repository of identifiers complete with an API an integrated workflow integration with LibraryFind, contributed records, Wikipedia, etc. services against they data like xISBN or "my catalog" faceted browsing seeing libraries implemented a "Web scale" contributed content good documentation more conversations with OCLC a deepening of the 'Net and putting these things in reach of our children, and overwhelmingly creating a data set that rethinks the definition of library collections and applies Web 2.0 technologies to them.

As we were having our picture taken, the big joke was, "LOCKSS -- Lot's of cameras keep stuff safe."

As an extra treat most of us went across the street to the Internet Archives' offices. There we got a demonstration of their book binding machine (that didn't quite work correctly), we heard an inspiring description of Kahle's vision for future libraries, and we saw one of the Archives' "scribes" -- a book scanning device. Finally, we had the pleasure of walking through the Presido to Kahle's home where we enjoyed wine and cheese while looking at San Francisco below us and the Golden Gate Bridge to our left.

Reflection

This was one of the more inspiring meetings I have attended in a long time. The Internet Archive was generous with their time, energy, and money. People came to the table with open minds and open agendas. The vision was clearly articulated. The problems were neatly outlined. Solutions were brainstormed. People went away charged, and the Open Library developers were probably re-invigorated.

The ultimate idea of Open Library goes far beyond Fred Kilgour's original idea of cooperative catalog and OCLC. Yet, at the same time, the core of Mr. Kilgour's idea is at the heart of Open Library -- a very large, centralized library. I don't believe there will ever will be or ever should be one and only one library for all of humankind because libraries ultimately serve individual constituents, and it is impossible for any single institution (read "library") to be all things to all people. On the other hand the idea of a large, centralized repository of knowledge does have a certain appeal. It can be looked upon as a respected authority and a touchstone for ideas. Considering the exiting institutions who hold and distribute library-based content, Open Library looks like a promising upstart. At the very least I believe it will demonstrate what a loosely federated network of committed individuals with a diverse sets of skills and cooperation can do to solve large problems.

Notes

[1] Internet Archive - http://www.archive.org/
[2] Wayback Machine - http://www.archive.org/web/web.php
[3] Open Library - http://demo.openlibrary.org/
[4] Open Content Alliance - http://www.opencontentalliance.org/
[5] Infogami, besides having a name alluding to one of my favorite hobbies, seems to have a schema very much resembling MyLibrary where information resources are represented as (Perl) objects associated with any number of locally defined facet/term combinations - http://mylibrary.library.nd.edu/category/facets/.

Creator: Eric Lease Morgan <eric_morgan@infomotions.com>
Source: This travel log first appeared on the Hesburgh Libraries website at http://www.library.nd.edu/daiad/morgan/travel/open-library/.
Date created: 2008-03-14
Date updated: 2008-05-24
Subject(s): Presidio; Open Library; travel log;
URL: http://infomotions.com/musings/open-library/