Open Repositories, 2007: A Travelogue

This text documents my experiences at the Open Repositories 2007 conference, January 22-26, San Antonio (Texas). In a sentence I learned two things: 1) institutional repository software such as Fedora, DSpace, and EPrints are increasingly being used for more than open access publishing efforts, and 2) the Web Services API of Fedora makes it relatively easy for developers using any programming language to interface with the underlying core.

Gruene
Gruene water tower
not Gruene, brown

I was off to the races, and began with Matt Zumwalt with "Simplifying Fedora Frontends with XForms and Fedora Disseminators". Zumwalt strongly advocated the use of W3C XForms to make the creation of user interfaces against Fedora systems easier. To do this you first need to understand your content model and what you want to do with it. Next, create a set of XForms outlining those functions, and finally transform the XForms into XHTML for display in a browser. This process lends itself well to Fedora because of its API. It also separates business logic from presentation. I was glad to learn about XForms.

Peter Murray in "Building an Institutional Repository Interface Using EJB3 and JBoss Seam" advocated the use of Fedora as the underlying technology for a OhioLink-wide repository system. The OhioLink system is made up of large and small institutions. To some degree they all want to participate in repository efforts, but each desires to have different content, different functionality, and a different look & feel. Fedora, with its API, will be able to support these requirements and allow the system to be centralized at the same time. For more information see: info.drc.ohiolink.edu/.

Chi Nguyen in "Federated Authentication and Authorization for Fedora" described sets of middleware he was writing to support more finely tuned access control for repositories. He characterized repositories needing to be collaborative and provide fast access to content. For these reasons authentication and authorization are important. He gave an overview of Shibboleth, described how he was creating a Shibboleth module, and praised Fedora along the way because his work required no modification to Fedora nor a GUI interface.

Leslie Johnson in "Best Practices in Developing a Digital Library Repository Using Fedora" described a number of "objects" the University of Virginia employes to create and maintain their repository: 1) workflow objects, 2) aggregation objects, 3) media objects, and 4) work objects. Each of these objects plays a part in the creation of their digital libraries and they revolve around questions regarding metadata, preservation, and usefulness. They are dealing with legacy data and data born yesterday. They are dealing tiny objects (4K images) and huge data sets. Policies and procedures applied against these objects is constantly in flux, and consequently the "best practices" are in continual development. "Documenting these best practices is important to the institution", Johnson says.

In "Fedora Outreach and Communications: A Working Group Report" both Stacy Pennington and Carol Minton Morris described the results of a survey against Fedora-hosting institutions to learn how it was being used. While the sample size was small (if not tiny, fifteen surveys) they discovered Fedora is being used for a wide variety of purposes that can be classified into four groups: 1) e-science, 2) museum & library displays, 3) education, and 4) scholarly communication. Pennington and Morris plan to use this information to develop sets of communication statements in an effort to broaden the base of Fedora installations.

Chris Wilper and Aaron Birkland advocated an alternative triple-store for Fedora in a presentation called "MPTStore: Implementing a Fast, Scalable, and Stable RDBMS-backed Triple-store for Fedora and the NSDL". In short, they experimented with the parsing of resource descriptions into triples (names, predicates, and values) and storing them in separate fields of a relational database instead of a single field. Based on their experiments, system performance was improved dramatically and this is something they advocate in future implementations of Fedora. For more information see: mptstore.sourceforge.net

Sandy Payette in "Sustainability and the Vision for the Fedora Commons" outlined a number of ways the Fedora Project is headed in terms of taking it to the next self-sustaining step. She echoed what had been mentioned previously, namely that Fedora is used for a number of different purposes: collaboration, data & publication sharing, repositories of blogs & wikis, reviews & annotations of objects, museum & library exhibits, lesson plans, etc. She outlined a number of "keys to the puzzle"; Fedora seems to piece together: collaboration, repository, preservation, and process management. "Fedora attempts to address the entire life-cycle of a digital object." Working with the Mellon Foundation, it is quite likely a thing called the Fedora Commons will come into being. A non-profit organization, it will enable software developers, cultural heritage centers, and community building. The goal seems to be to maintain a steady-state by the year 2010 and revenue will be generated through dues.

In "Content Interchange and the Invisible Repository" Scott Yeadon advocated for a greater number of metadata interchange specifications to enable the sharing of content. Building on the strengths of XML, many interchange specifications can be articulated. Features of such specifications will make things like ingest, dissemination, sustainability, changes in repository software easier. The DSpace SIP, AIP, and DIP do this to a great degree. He demonstrated how such objects can be exported from things like the Open Journal System (OJS) and into DSpace. Other ways these "packages" can be use were demonstrated with Google Maps. For more details try: dspace-dev.anu.edu.au/dspace-xmlui/handle/1030.58/19507.

The opening keynote was given by James L. Hilton entitled "Open Source for Open Repositories: New Models for Software Development and Sustainability". He began with two over-arching statements: 1) "Open repositories have the same potential as the printing press", and 2) "We have a moment in time to build repositories in a collaborative environment." He elaborated by first mentioning Larry Ellison as his hero for open source. Why? Because Ellison, through fear, has made people re-think long-lost, closed questions regarding software supporting the enterprise. "We could become hostages to the software we use to do our everyday business." A similar "hero" is Michael Chasen of Blackboard and for the same reasons. He went on to equate open source software with a "free" puppy (when I like to use the analogy of a "free" kitten), and he strongly emphasized the need for collaboration and the "right partners" in order for open source software to work. To backup his statements he described his work in the development of Saki -- an open source courseware system. Of particular interest to me was the way each Saki development partner promised to implement the system when it was completed. Additionally, ongoing development and maintenance of the system is expected to be supported by dues-paying members of a foundation. (This is the same model being pursued by the Fedora Project.) He sees many benefits to open source software: destiny control, the building of communities, the unbundling of software ownership & support, the ability to leverage links between software. Finally, he outlined a number of challenges regarding open source software: are the intellectual property right issues "clean", are patents looming, there is no scape-goat ("Who can we sue?), it is not a silver bullet, licensing matters, communities just don't happen without governance and collaboration, discipline is required, the process requires trust, and there needs to be a balance between pragmatics and ideals. Hilton is a man after my own heart.

downtown Gruene
Gruene Mansion
Gruene home

Andrew Treloar presented "The ARROW Project at 3 Years: Looking Backwards, Aiming Forwards". In it he described the creation and maintenance of a repository project using Fedora for the underlying software with the support of a third party -- VTLS. In the end he articulated the need for tight specifications and lots of communication as keys to success.

Leslie Johnson in "How the Principles and Activities of Digital Curration Guide Repository Management and Operations" described the characteristics of her university library's digitization guidelines. Some of these characteristics included: unique and rare materials, things useful for teaching & learning, things were there are value-added possibilities, formats that can be used over & over, whether or not they (the library) can fulfill a sense of trustworthiness, whether or not they can support authentication & authorization, and finally, whether or not they can provide sustainability and preservation of the materials.

Atsuke Takano shared her experiences with the creation of one the first library-sponsored institutional repositories in Japan in a presentation called "CURATOR: Its Development Strategy". One of the most interesting phrases she used to describe the repository's collection policy was "principled promiscuity" -- meaning, more or less, just collect everything. The repository was slow going, starting in 2002 and currently containing about 2,000 items. The repository does not use Fedora, DSpace, nor Eprints. It is indexed by SCIRUS, and it contains the usual suspects in terms of content: images, data sets, working papers, articles, etc.

poster

As a part of the Poster Session Minute Madness Tyler Walters and Eric Lease Morgan encouraged people to visit their (our) poster. The poster described how each of our respective institutions (Georgia Tech and University of Notre Dame) add value our repositories by providing additional services to their content. For example, Georgia Tech provides a What's New? sort of feature as well as support of adding content for faculty. Notre Dame illustrated the use of "widgets" to syndicate content as well as displaying Google ranks. Keywords describing the poster included: Web of Science to DSpace, javascript, RSS, Google PageRanks, DSpace, DigiTool, ETD-db, SRU, OAI, MyLibrary, highlight my department, working papers, What's new?, syndication, name authority, services to faculty, open source, images, theses & dissertations, undergrad research, metadata, and make my publications list.

MacKenzie Smith advocated the articulation and creation of repository policies in the form of METS files in "Policy Frameworks for Institutional Repositories". Through the creation of these policies it will be possible to share the policies between systems as well as be able to configure services around them. "Policies are a matter of measuring trust against the data for your repository."

Joan Smith described her latest work with mod_oai in "Using OAI-PMH Resource Harvesting and MPEG-21 DIDL for Digital Preservation". Much of Smith's motivation surrounded the ability to preserve entire websites. Such a thing is not necessarily possible using crawlers like wget, etc because links may be broken. Furthermore, content may not be distributed in a form amenable to many clients. The use of mod_oai coupled with DIDL is intended to address this problem. Install mod_oai as a part of an Apache HTTP server. Configure a few plug-ins to generate automatic sets of keywords, point your OAI client to your Web server, harvest content, and mirror. See: www.modoai.org.

Miguel Ferreira described CRiB as a Service-Oriented Architecture for recommending which format a digital object should be migrated to for preservation purposes in "CRiB: Preservation Services for Digital Repositories.

In "Making Fedora Easier to Implement with Fez: A Free Open Source Content Model and Workflow Management Front-end to Fedora" Christian Kortekaas described the creation and management of user interfaces to Fedora including features such as: access control, GUI, indexing, Google Suggest, and preservation services employing PREMIS.

In "The Eprints Application Profile: A FRBR Approach to Modeling Repository Metadata" by Julie Allinson a case was made for using the the Dublin Core Abstract Model for describing repository content. The advantages and disadvantages of "flat" Dublin Core were articulated. It was then compared to the Abstract Model, with it roots in FRBR. By adding entities and relationships to Dublin Core, Allinson hopes to make the description of content more meaningful. Alas, after doing some follow-up reading on the Dublin Core Abstract Model, I still feel a bit in the dark.

Matthias Razum in "eSciDoc: A Scholarly Information and Communication Platform for the Max Planck Society" described a repository effort intended to work more like human memory and less like information silos. To do so he advocated storing things centrally, providing services against the store, protecting the services with access control, and finally building applications around this central core while making any interfaces as independent as possible.

C. Lee Giles described the challenges of building a chemistry repositories in "ChemXSeer: A Chemistry Web Portal for Scientific Literature and Datasets". The discipline of chemistry has traditionally not been as open as computer science or physics when it comes to sharing their content. Incorporating some of the ethos of the chemistry discipline in repository applications make the process more difficult. Similarly, chemistry is primarily not text-based making traditional text searching a challenge. Building on the success of CiteSeer, ChemXSeer hopes to bring aspects of open access publishing to chemistry.

Tony Hey gave the closing keynote address called "e-Science and Scholarly Communication". His well-balanced presentation described what he saw as an evolution in the scientific process. Oldest science (think Galileo) was experimental. Science then became more theoretical (think Einstein). Using computation he sees science moving in a new direction. By gathering, sharing, and synthesizing data & information in more systematic ways, Hey sees science evolving. He sees science as becoming data-centric. Take a mini-volcano as an example. Place sensors at the volcano. Monitory the sensors. Record the volcano's activity. Share the data freely and with many communities. Allow people to draw their own conclusions from the data and combine it with other data/information. He sees new types of peer-review in the form of social networking, and new types of ranking beyond journal impact. The keys to success in this regard surround acquisition, preservation, open access of data & information.

Reflections

cabbage

In the past couple of years I have had the opportunity to attend a number of technology-heavy, library-related conferences. This was one of those conferences. During my attendance and upon my return I am often dismayed regarding the gap between the innovative ideas being shared at the conferences ("research") and the degree which these ideas are being implemented in academic libraries. "If academic libraries are a part of the 'research' community, then why aren't these ideas being put into practice more quickly? After all, the ideas don't have very far to travel." Examples can be seen from the information retrieval (IR) world regarding automatic classification and relevance ranking systems, and the non-adoption of these techniques in online catalogs. I suppose this gap has always existed. I just have the opportunity to see it more often than others.

The reason I mention this is because this particular conference was not quite like that. These repository applications were being used for production services as well as research -- exploratory -- investigations. Going into this conference I had mistakenly thought it was going to be about open access publishing and scholarly communication. No, it was more about collecting, organizing, archiving & preserving, and then re-disseminating digital content. Put another way, the conference was about practical digital library activities.

Of the three applications (Fedora, DSpace, and Eprints), Fedora seems to have the most promise in terms scalability. While all the systems are open, Fedora seems to have the most open API. On the other hand DSpace seems to have the widest adoption, probably because it is more like a turn-key application. Fedora is more like a tool-kit.

Finally, since I was so impressed with the Fedora model, I got to thinking about one my own pet projects -- MyLibrary. It would not be difficult to write a set of programs implementing a Web Services API against MyLibrary. MyLibrary's strength is its ability to read and write to a specific database schema. It essentially includes three tables: 1) resources whose "content model" is rooted in Dublin Core, 2) people whose "content model" is rooted in FOAF, and 3) facet/term combinations providing controlled vocabulary services against the resources and people. The problem with MyLibrary is a diminishing number of people who know how to write things in Perl. PHP, Java, Ruby On Rails, etc. are becoming more popular. Through Web Services this would be a non-problem, almost. Send a (CGI) script a URL containing a set of well-defined name-value pairs. Use the MyLibrary Perl API to process the query. Return to the client a similarly well-defined stream of XML. Allow the client to transform the data into services and applications. If MyLibrary had a Web Services API, then people could use PHP, Java, Ruby On Rails, etc. and still benefit from MyLibrary's "content model". Food for my thought. Hmmm...

Creator: Eric Lease Morgan <eric_morgan@infomotions.com>
Source: This file was never formally published.
Date created: 2007-02-11
Date updated: 2007-02-13
Subject(s): Gruene, Texas; institutional repositories; digital libraries; travel log;
URL: http://infomotions.com/musings/open-repositories-2007/