Mass digitization

This travel log outlines my experiences at a symposium sponsored by the University of Michigan called Scholarship and Libraries in Transition: A Dialog about the Impacts of Mass Digitization Projects , March 10-11, 2006. In short, the symposium facilitated quite a number of presentations and panel discussions about mass digitization. Discussion topics ranged from mass digitization's impacts on libraries to possibilities for scholarship to changes in changes in publishing to economic effects to public policy issues. Most of the discussions were framed by the Google Print project, and just about everybody provided their point of view regarding the effect of Google on the academy.

The Symposium

Rackham Auditorium ceiling
Rackham Auditorium ceiling

On Thursday, March 9 Pascal Calarco, Nigel Butterwick, Parker Ladwig, and Joe Thomas, and I converged on Ann Arbor to attend the symposium. Bright and early the next morning the symposium began. The first panel discussion focused on libraries. I found Michael Keller's (Stanford University) and Karin Wittenborg's (University of Virginia) the most interesting. Both of them acknowledged that mass digitization allows libraries to rethink the role of physical space, but more importantly, it allows libraries to rethink what libraries do regarding collections. Both of them alluded to the possibilities of enhanced, value-added services against collections of electronic texts, but when pressed for elaborations none were forthcoming.

The keynote presentation was given by Tim O'Reilly (O'Reilly Media). A man after my own heart, he is a strong advocate of open source software and has very successfully demonstrated ways to build sets of economics around it. As a Harvard graduate and a for-profit publisher of computer-related materials, O'Reilly asked the provocative question, "What job do books do?" To his mind the World Wide Web is the largest ebook. He sees books akin to content databases, not necessarily sets of pages between covers. The same way people "rip" CD's, he hopes we can begin to "rip" books. Google Print is the beginning of such a process but on a huge scale. Buy unbundling the content of books from their pages people will be able to "mash up" content in ways never before imagined. O'Reilly pointed to a website called Last.fm as an example. It uses music instead of text. He went on to postulate that making the content of books (even older books) searchable makes them more valuable and more useful. He more or less demonstrated this by studying the use of searchable books from his publishing house.

O'Reilly guessed that about 5% of the books in existence are "in-print" books -- books currently sold by publishers. About 20% of the books in existence are in the public domain -- "old" books. The remaining 75% of books are orphaned works. This ratio was used time and time again throughout the symposium.

Finally, he advocated libraries figure out how to harness the "collective intelligence" of users to enhance the use of books and to generate new knowledge. He pointed to Google's PageRank, Ebay, Amazon.com, and Craig's List as websites exploring and using collective intelligence. Another example was the combined interaction between an iPod (a device), iTunes (software to interact with community), and music (content). The same sort of techniques could be applied to the content of books.

The panel on research, teaching, and learning seemed to focus much of its attention on search. For example, Ed Tenner (Princeton University) thought that present-day search gave users a false sense of accomplishment and search engines should have an academic mode. Jean-Claude Guedon's comments were a bit more innovative. He thought mass digitization would allow academia to create huge concordances and provide the means to identify lesser-used words and create new pathways to knowledge. Guedon also advocated the creation of relationships between people through their interaction with common texts.

Two themes became apparent during the publisher's panel. First, even with the advent of the Internet and the Google Print project, there will still be the need for publishers, but publishers will increasingly focus on niche markets and greater collaboration. This was described by Suzanne BeDell (ProQuest) who pointed to the Text Creation Partnership as an example. Alicia Wise (Publishers Licensing Society) re-enforced the idea and emphasized that mass digitization does not capture the subtle nuances of publishing. Dan Greenstein (California Digital Library) put forth the idea that information is increasingly becoming a commodity and a public good -- information as a utility, and this public good needs to be held by a trusted third party such as libraries, archives, and museums.

The first day came to a close with a presentation by Adam Smith (Google) who described the Google Print project in greater detail. He compared it as a virtual card catalog, an index. The most interesting part of his talk was his almost off-the-cuff outline of a development process. Create a service. Get it out early. Watch users. Iterate. There was not a whole lot of new news in his presentation, and he did his sincere best to field questions from the audience.

The second day began with a panel discussion on the economics of mass digitization. Paul Courant (University of Michigan) echoed the idea of information as a public good and said mass digitization turns a local public good into a global public good. More importantly he thought mass digitization enables scholarship (which implies scholarly communication) to take place over greater amounts of space and time. Hal Varian (University of California, Berkeley) compared and contrasted the opt-in versus the opt-out model of inclusion into the Google Print project. From his point of view, the opt-out model is much lesser expensive than the opt-in model. He advocated it. Karl Pohrt (Shaman Drum Bookshop) described how the Internet has changed his profession. In short he believes people do less reading and he was dismayed because the Internet can not provide ways to browse physical shelves.

The last panel discussion surrounded public policy. Bruce James (United States Government Printing Office) said that people are looking for federal documents more than ever and while the Government does not want to give any individual company complete control of the printing process, he does look to creating relationships and collaborations as solutions to its publication and distribution problems. He said he is also looking into methods of stamping or watermarking documents in order to denote their authenticity. James Hilton (University of Michigan) thought that too many intellectual property "fences" built up around smaller and smaller domains will be the demise of the academy. Intellectual property rights are increasingly becoming insidious. Patenting things like 1-click shopping by Amazon.com, the consumer price setting by PriceLine, and the simple building of huge piles of patents by IBM are all examples. Intellectual property should be like ever-expanding fields of wheat nourishing humanity, not like oil derricks sucking non-replenishable resources out of the ground.

Clifford Lynch (Coalition for Networked Information) brought the symposium to a close with his usual flair for summarizing the issues and describing possibilities for the future. First of all, he likes the phrase "large-scale digitization" instead of "mass digitization" because large-scale digitization is more planned. Second, books are not the be-all and end-all for library collections. There are manuscripts, data-sets, music, the images of amateur photographers, multi-media objects, etc. As "large-scale digitization" efforts come to fruition we need to ask ourselves, what are we going to do with these collection? How are we going to plan for success? For example, what are we, librarians, going to say when people want to download the entire collection? Will we let them? He advocated digitization as form of preservation (insurance) especially when the copies are replicated. Finally he wondered how people would use these collections. For example, our presidential collections are massive. How will people write authoritative biographies of past presidents using these collections? With these large collections of primary materials at their disposal the biographer will need the assistance of computers to analyze and synthesis content.

Personal observations

I appreciated the opportunity to attend the symposium. It was attended by people from all over the country. The venue was beautiful. The topic was timely. The price was right, and it was very well organized. Kudos to the organizing committee. Ann Arbor is a nice place to visit.

On the topic of mass digitization I have a number of personal observations. First, like Tim O'Reilly, I have never considered a book to be sets of pages between covers. Books are containers, and libraries are not about books. Libraries are about what is inside the books. Books are merely manifestations of data, information, and knowledge. Yes, some books are special in and of themselves, but for the most part they are simply "content databases", not things to be treasured and hidden away in dark rooms. "Books are for use", and I write in my books all the time. A well-used book naturally opens up to the most important parts. Don't get me wrong. I appreciate the book, a codex, as a technology. I make and bind my own notebooks. They are portable, durable, self-sustained, and last a good long time. At the same time digitized books offer a greater degree of utility than traditional books, as long as the digitized books are not limited by some sort of digital rights management system. Mass digitization will only increases the opportunities for this utility, and with these increases will also come increases in user expectations.

Larger and larger quantities of books and journal articles are being digitized or "born digital". Combine this with the user's ability to locally store gigabytes of data on portable storage devices such as flash drives and iPods. Imagine the ability to carry around the entire corpus of the published literature from the 18th and 19th century on such a device. Imagine the ability to supplement this "collection" with all the relevant literary criticism. In such a world, what is the role of the library and the librarian? Obviously it is not about collections because the user has the collection. Instead, the role of the library and the librarian is about services against the collection. The role is the creation and distribution of tools allowing the student, researcher, or casual reader the ability to make the better use of the collection. Examples might include:

As the amount of digital content grows so does the likelihood that the content will be duplicated. Increasingly people will be creating their own personal "collections". If we gave people the opportunity to download the entire works of Mark Twain, then it would get downloaded. Once students and researchers create these "collections" then they are going to want to do analysis against them. This holds true for current electronic serial literature as well, and if libraries were not bound by licensing restrictions we would allow such downloads. In this case the scientist will be wanting to compare and contrast too, not just the humanist. Track this author. Track this citation. Trace forwards and backwards this particular chemical model. These kinds of services are not the sole purview of librarians, but they do represent interesting library work, and the development of tools like the ones outlined above represent growth opportunities for libraries. They represent ways libraries can remain relevant. They represent ways librarians can use computers to revolutionize the use of libraries and not just mimic older technologies (the card catalog) with newer ones (the OPAC).

After the time of mass digitization a library's collection will not be as important as it is today. Everybody will be carrying the collection around in their pocket. Instead what people will need are sets of services -- tools -- to apply against the collections making the content more useful. In a digital environment the things of traditional librarianship (books) will give way to their content and this makes services increasingly important. Libraries, especially libraries hosting digital materials need to be about the combination of collections and services. This was alluded to many times throughout the symposium but not very thoroughly. O'Reilly touched on these ideas with this "mash ups" and "collaborative intelligence". Keller briefly mentioned them in his remarks. Guedon postulated the creation of interpersonal relationships. Lynch outlined these ideas in the greatest detail. Unfortunately these ideas seemed to generate no sparks amoung the audience. I was disappointed.

I can summarize my person observations in this way. Collections without services are useless, and services without collections are empty. You can't have one and not the other and call your thing a library. Librarians need to provide equal amounts of both in order to practice balanced librarianship, especially in a digital environment.


Creator: Eric Lease Morgan <eric_morgan@infomotions.com>
Source: This text was never formally published.
Date created: 2006-03-18
Date updated: 2006-03-18
Subject(s): mass digitization; Ann Arbor, MI; travel log;
URL: http://infomotions.com/musings/mass-digitization/