Joint Conference on Digital Libraries, 2006

This text outlines my experiences at the Joint Conference on Digital Libraries Annual Conference in Chapel Hill (North Carolina), June 12-14, 2006. In a sentence, the Conference was a nice mix of formal and informal discussions surrounding digital research library topics, and it was also nice to see a large number of familiar faces. The conference's content presented few surprises.

Getting books online

The Giant John McCullough Mushroom

The first day began with a plenary panel discussion on the topic of digitized books moderated by Cliff Lynch (Collation for Networked Information). Lynch framed the discussion with a pair of questions, "What are we going to do with the books when they get online? What happens if we succeed?" These were the same questions he asked at the Mass Digitization Symposium in Ann Arbor a few months previously. These sorts of questions never go out of style.

Daniel Clancy (Google) prefaced his remarks on the Google Print project by saying "information that is good enough an easy to get is preferable to the 'best' information that is difficult to get." I have noticed this statement, much to the chagrin of most of the librarian profession, can be backed up by numerous OCLC environmental scans. People are driven by ease-of-use and convenience. Clancy went on to compare and contrast the Google Partner Program and the Google Print project. He echoed the percentages of potentially available content for the Google Print program (15% is in print, 65% are "orphan" works, and 20% is in the public domain). The Google Print program is trying scan thirty million books, and he sees the scanning as the easy part. The biggest challenge is trying to balance the desires of copyright holders with the desires of the public interest. He sees the interrelationships contained in books to be analogous to the hyerlinked World Wide Web. He desires to make these relationships more explicit and available electronically. Like always, he, as a Google employee, was cagey when it came to describing the "how's" of the digitization process.

David Ferrier (New York Public Library) described why the Library decided to participate in the Google Print project. Namely they were already digitizing printed research materials, and they saw a partnership with Google as a way to accomplish their goals more quickly as well as an opportunity to collaborate with the other "G5" libraries. The New York Public Library is only digitizing public domain works. Ferrier enumerated and elaborated upon a short "worry list" of issues: 1) copyright, 2) privacy, 3) collection and duplication, and 4) preservation.

Daniel Greenstien (California Digital Library) described the Open Content Alliance. The Alliance's goal is to digitize out of copyright works. It is a joint venture between the University of California, the University of Toronto, Microsoft, Adobe, the Internet Archive, and other institutions. He describes the organization as "thin." They are presently scanning about 10,000 books/month while operating ten scanners during two shifts a day. They plan to make their first books available in October. After a bit of compare and contrast between the Alliance and Google Print, the only significant difference between the projects is that the Alliance will allow third-parties to index their content. Google will not. Greenstein presented his "worry list" in the form of a research agenda: 1) opportunities to investigate more thoroughly search, recommender, collaborative filtering, etc. services, 2) content integration/curation, 3) allowing people to institutionalize this content and serve the needs of individuals and groups, and 4) demonstrating print-on-demand services.

The discussion after the presentation surrounded a debate whether or current optical character recognition (OCR) accuracy rates (about 98%) were good enough. Some said yes. Others said no. There was also interesting discussion whether nor not the content of the digitized materials would too heavily be weighted toward the English language. Apparently 490 languages are presented in the content's millions of books, and about 50% of those materials are written in English.

Named entities

Alison Jones (Tufts University) presented a paper called "The challenge of Virginia Banks" and described ways they were automatically marking-up texts in TEI and then creating browsable lists of names. Starting with a gazetteer, their system looks for patterns in texts and marks things up accordingly. For example, anything following the word "Lieutenant" was probably a name of a person, and anything preceding "Mass." was probably a city or town. While not perfect, the system was able to extract a large number of names from texts, and she is confident the system is moving in the right direction.

Moises G. de Carvalho (Federal University of Minas Gerais) in "Learning to de-duplicate" demonstrated the use of a Genetic Programming techniques to find and remove duplicate records from digital library collections.

Byung-Won On (Pennsylvania State University) advocated the use of a coined process called Quasi-Clique to cluster different named entities representing the exact same person in a paper entitled "An Effective approach to Entity Resolution Problem Uinsg Quasi-Clique and its application to digital libraries." The process advocated the use of a graph to determine similarities and duplicates.

Duncan M. McRae-Spencer (University of Southhampton) in "Also by the same author" described AKTiveAuthor as a technique for moving towards the Semantic Web by exploiting the fact that authors often cite themselves in papers.

In the discussion after the presentation, the problem of named entities (or authority control as it is called in traditional libraries) is in a Golden Age. Solutions to these problems will be achieved through a combination of automation and human intervention. I thought expectation were being raised.

Archives Hub

I went to lunch with Clare Llewellyn and John Harrison who both work on the Archives Hub at the University of Liverpool. We discussed the Archives Hub and asked ourselves how it might be improved both technically and in conjunction with other JISC-sponsored programs. I was thoroughly impressed with their infrastructure and planned improvements, but I still wonder how the Archives Hub may be integrated with some of the other JISC services.

Augmenting interoperability across scholarly repositories

In the afternoon I listened to Herbert Van de Sompel (Los Alamos National Laboratory) describe ideas behind a "Pathways" project. The project was very similar to the project he described at OAI4 -- a shared data model combined with a set of shared data services. He likes to create these models and then implement them in software allowing for newer and different technologies to used interchangeably. He sees three interfaces in this system: 1) harvest, 2) obtain, and 3) put. I would add many more services than that, but most of my peers think my ideas regarding services go too far.

Carl Lagoze (Cornell University) similarly described Pathways and thought some of the features of the system should be the ability to get things as needed (live transfer). He also thought some sort of registry, like the Ockham Registry, would be a good idea. Much of the presentation came out of a set of discussions that may be available at http://msc.mellon.org/Meetings/Interop/.

Document analysis

Shaelei Feng (University of Massachusetts at Amherst) in "A Hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books" described a way to aligned scanned texts to improve optical character recognition systems.

In "Combining DOM tree and geometric layout analysis for online medical journal article segmentation" Jie Zou (National Library of Medicine) divided HTML documents into zones for the proposes of improving information retrieval.

Xiaonan Lu (Pennsylvania State University) described a method for extracting and providing information retrieval services against non-textual elements (images) of documents in "Automatic categorization of figures in scientific documents." The process was dependent on lines being identified in the images and their conclusion stated they need to go beyond lines to include shapes.

Open information

The second day of the conference was opened by Jonathan Zittrain (Harvard University and Oxford University). Zittrain is a lawyer and an academic who recently argued a copyright case before the United States Supreme Court. A witty, quick-minded, engaging, and articulate speaker, Zittrain turned open information a bit on its head and discussed issues surrounding privacy on the 'Net. The popularity of MySpace was alluded to. He described how digital information, once made available, is very difficult to destroy. For example, retractions are notoriously difficult to accomplish in a digital environment. He sees the role of library in a digital environment as the long-term keeper of information, but we must ask ourselves if we are going to delete data/information when necessary. He outlined eight reasons why information might need to be retracted: 1) national security, 2) privacy, 3) copyright, 4) fair use, 5) reputation/slander, 6) author's choice exemplified by the copyright laws in France, 7) government involvement, and 8) third party issues. He asked more questions about the roles of libraries and thought it centers around a recommendation/filtering and the building of collections. Zittrain is able to see really big pictures, "Because it is the size of Montana we often times do not know what we are doing." Privacy might be the size of Montana. Finally, privacy might be dying because it might be a generational thing. Some people don't mind being entirely public.

Digital Library Curriculum

Yongqing Ma (Victoria University of Wellington) in "Digital library education" described the results of a survey comparing and contrasting digital library curriculums. She advocates a shared digital library model for improving curriculums.

David Nichols (University of Waikato) and Stephen Downie (University of Illinois at Urbana-Champaign) in "Learning by building digital libraries" described how they were teaching students about digital libraries by creating collections with Greenstone. While their efforts were described as successful, I was leery that the students learned very much; the instruction described seemed shallow.

In "What do digital librarians do?" the results of another survey was presented but the sample size was so small (thirty-nine people) I wonder how relevant the results can be, but the defining characteristics of "digital librarians" were the mixture of computer technology skills used in library settings. Not very definitive.

Jeffrey Pomerantz (University of North Carolina at Chapel Hill) in "Curriculum development for digital libraries" advocated a shared curriculum too. He advocated a set of modules for teaching purposes: 1) collection development, 2) digital objects, 3) metadata/cataloging, 4) architecture/interoperability, 5) visualization, 6) services, 7) intellectual property, 8) social issues, and 9) archiving and preservation.

Supporting education

Carl Lagoze in "Metadata aggregation and 'automated digital libraries'" gave a retrospective of the NSDL Core Integration process. He did not seem positive about the outcomes of the process and did not seem to believe the project was a success. He outlined seven reality checks: 1) people don't want to create metadata, 2) there is a knowledge gap in the creation of metadata (domain expertise, metadata knowledge, and technology knowledge), 3) harvested metadata is not necessarily useful metadata, 4) OAI-PMH is not necessarily a 'low barrier' protocol because XML is a pain, 5) the human cost of harvesting is high, 6) matching metadata records to the things described is difficult, and 7) good metadata does not make for a complete digital library. While he was sort of down on the results, I thought progress had still been made. With relatively few staff he was able to create a large and useful index of digital objects. The index (collection) may not be deep but if the time, energy, and skills of traditional librarians where brought to bare more impressive results could be obtained. I believe digital libraries need to learn more from traditional librarianship, and traditional librarianship needs to adapt the techniques of digital library researchers attending conferences such as JCDL and ECDL. Lagoze's paper won the Best Paper Award.

In "Using resources across educational digital libraries" Mimi Recker (Utah State University) described the Instructional Architect as a tool allowing users to search, annotate, and publish things from NSDL content. Like Lagoze's Core Integration service, few people use Recker's services, but the people who do use it seem to use it a lot.

Information retrieval

Illhoi Yoo (Drexel University) compared and contrasted a number of clustering techniques and in the end advocated the use of bisecting K-means as the most effective one in "A Comprehensive comparison study of document clustering for a biomedical digital library Medline".

Metadata in action

Day three began with George Buchanan (University of Wales) providing an overview of FRBR for computer scientists in "FRBR: Enriching and integrating digital libraries". He noted there is little discussion of FRBR in the digital library community, and he compared it to an object-oriented model of computing made up of publications, people, and subjects. He described a GUI FRBR editor and an alerting service based on FRBR. The most interesting issue he thought was wondering "whose FRBR do you trust?"

Diana Tanase (Shodor Eduction Foundation) in "Scaffolding the infrastructure of the computational science digital library" described a website allowing computational scientists to collect, organize, and disseminate their simulations

Alfredo Sanchez (Universidad de las Americas Puebla) described two applications he created making it easier to distribute content via OAI in a presentation called "Dynamic generation of OAI servers". VOAI creates a database and compiled code for data storage. XOAI works with XML databases through XQueries to generate OAI output.

At LANL there is a need to aggregate and homogenize the bibliographic information made available to them from publishers Beth Goldsmith (Los Alamos National Laboratory) described how they used XSLT to convert various XML feeds into MARCXML for these purpose in "Looking back, looking forward: A metadata standard for LANL's aDORe repository".

William Moen (University of North Texas) in "Learning from artifacts" described the MCDU project. The purpose of the project is/was to evaluate the effectiveness of MARC in terms of cost-effectiveness and utility. While the project is incomplete it is his hope to create a required/highly recommended list of elements/tags used to describe information artifacts. I believe the most interesting thing Moen said was "There will not be one single metadata element set. There will always be a need for multiple descriptive schemes."

Summary

The conference was well-organized and attended by the usual suspects. The biggest take-away I got from the conference was the growing, albeit slowly growing, understanding that traditional librarianship has things to offer the research digital library community. The people here seem realize the need for human interaction when it comes to the creation and maintenance of metadata. Yes, there is still optimism when it comes to automated means for the creation of metadata. Yes, there are problems using metadata harvested from OAI repositories. At the same time the traditional library community needs to take greater advantage of the computing techniques used by the digital library community. The two groups need to work more together. One spends a bit too much time re-inventing the wheel. The other puts their heads in the sand and makes implementation decisions based on antidotal evidence. Finally, a word of advice to some of the presenters, practice your public speaking skills.

Creator: Eric Lease Morgan <eric_morgan@infomotions.com>
Source: This text was never formally published.
Date created: 2006-06-27
Date updated: 2006-06-27
Subject(s): digital libraries; JCDL; travel log;
URL: http://infomotions.com/musings/jcdl-2006/