Code4Lib Conference, 2011

This posting documents my experience at the 2011 Code4Lib Conference, February 8-10 in Bloomington (Indiana). In a sentence, the Conference was well-organized, well-attended, and demonstrated the over-all health and vitality of this loosely structured community. At the same time I think the format of the Conference will need to evolve if it expects to significantly contribute to the library profession.

student center
student center

Day #1 (Tuesday, February 8)

The Conference officially started on Tuesday, February 8 after the previous day’s round of pre-conference activities. Brad Wheeler (Indiana University) gave the introductory remarks. He alluded to the “new normal”, and said significant change only happens when there are great leaders or financial meltdowns such as the one we are currently experiencing. In order to find stability in the current environment he advocated true dependencies and collaborations, and he outlined three tensions: 1) innovation versus solutions at scale, 2) local-ness and cloudiness, and 3) propriety verus open. All of these things, he said, are false dichotomies. “There needs to be a balance and mixture of all these tension.” Wheeler used his experience with Kuali as an example and described personal behavior, a light-weight organization, and local goals as the “glue” making Kuali work. Finally, he said the library community needs to go beyond “toy” projects and create something significant.

The keynote address, Critical collaborations: Programmers and catalogers? Really?, was given by Diane Hillman (Metadata Management). In it she advocated greater collaboration between the catalogers and coders. “Catalogers and coders do not talk with each other. Both groups get to the nitty-gritty before their is an understanding of the problem.” She said change needs to happen, and it should start within our own institutions by learning new skills and having more cross-departmental meetings. Like Wheeler, she had her own set of tensions: 1) “cool” services versus the existing online public access catalog, and 2) legacy data versus prospective data. She said both communities have things to learn from each other. For example, catalogers need to learn to use data that is not created by catalogers, and catalogers need not always look for leadership from “on high”. I asked what the coders needed to learn, but I wasn’t sure what the answer was. She strongly advocated RDA (Resource Description and Access), and said, “It is ready.” I believe she was looking to the people in the audience as people who could create demonstration projects to show to the wider community.

Karen Coombs (OCLC) gave the next presentation, Visualizing library data. In it she demonstrated a number of ways library information can be graphed through the use of various mash-up technologies: 1) a map of holdings, 2) QR codes describing libraries, 3) author timelines, 4) topic timelines, 5) FAST headings in a tag cloud, 6) numbers of libraries, 7) tree relationships between terms, and 8) pie charts of classifications. “Use these things to convey information that is not a list of words”.

In Hey, Dilbert, where’s my data?”, Thomas Barker (University of Pennsylvania) described how he is aggregating various library data sets into a single source for analysis —

Tim McGeary (Lehigh University) shared a Kuali update in Kuali OLE: Architecture of diverse and linked data. OLE (Open Library Environment) is the beginnings of an open source library management system. Coding began this month (February) with goals to build community, implement a “next-generation library catalog”, re-examine business operations, break away from print models of doing things, create an enterprise-level system, and reflect the changes in scholarly work. He outlined the structure of the system and noted three “buckets” for holding different types of content: 1) descriptive — physical holdings, 2) semantic — conceptual content, and 3) relational — financial information. They are scheduled to release their first bits of code by July.

Cary Gordon (The Cherry Hill Company) gave an overview of Drupal 7 functionality in Drupal 7 as a rapid application development tool. Of most interest to me was the Drupal credo, “Sacrifice the API. Preserve the data.” In the big scheme of things, this makes a lot of sense to me.

After lunch first up was Josh Bishoff (University of Illinois) with Enhancing the mobile experience: mobile library services at Illinois. The most important take-away was the importance between a mobile user experience and a desktop user experience. They are not the same. “This is not a software problem but rather an information architecture problem.”

Scott Hanrath (University of Kansas) described his participation in the development of Anthologize in One week, one tool: Ultra-rapid open sources development among strangers. He enumerated the group’s three criteria for success: 1) usefulness, 2) low walls & high ceilings, and 3) feasibility. He also attributed the project’s success to extraordinary outreach efforts — marketing, good graphic design, blurbs, logos, etc.


VuFind beyond MARC: Discovering everything else by Demian Katz (Villanova University) described how VuFind supports the indexing of non-MARC metadata through the use of “record drivers”. Acquire metadata. Map it to Solr fields. Index it while denoting it as a special metadata type. Search. Branch according to metadata type. Display. He used Dublin Core OAI-PMH metadata as an example.

The last formal presentation of the day was entitled Letting in the light: Using Solr as an external search component by Jay Luker and Benoit Thiell (Astrophysics Data System). ADS is a bibliographic information system for astronomers. It uses a pre-print server originally developed at CERN. They desired to keep much of the functionality of the original server as possible but enhance it with Solr indexing. They described how they hacked the two systems to allow the searching and retrieving of millions of records at a time. Of all the presentations at the Conference, this one was the most computer science-like.

The balance of the day was given over to breakout sessions, lightning talks, a reception in the art museum, and craft beer drinking in the hospitality suite. Later that evening I retired to my room and hacked on Twitter feeds. “What do library programmers do for a good time?”

Day #2 (Wednesday, February 9)

The next day began with a presentation by my colleagues at Notre Dame, Rick Johnson and Dan Brubakerhorst. In A Community-based approach to developing a digital exhibit at Notre Dame using the Hydra Framework, they described how they are building and maintaining a digital library framework based on a myriad of tools: Fedora, Active Fedora, Solr, Hydrangia, Ruby, Blacklight. They gave examples of ingesting EAD files. They are working on an ebook management application. Currently they are building a digitized version of city plans.

I think the most inspiring presentation was by Margaret Heller (Dominican University) and Nell Tayler (Chicago Underground) called Chicago Underground Library’s community-based cataloging system. Tayler began and described a library of gray literature. Poems. Comics. All manner of self publications were being collected and loosely cataloged in order to increase the awareness of the materials and record their existence. The people doing the work have little or no cataloging experience. They decided amongst themselves what metadata they were going to use. They wanted to focus on locations and personal characteristics of the authors/publishers of the material. They whole thing reminded me of the times I suggested cataloging local band posters because somebody will find everything interesting at least once.

Gabriel Farrell (Drexel University) described the use of a non-relational database called CouchDB in Beyond sacrilege: A CouchApp catalog. With a REST-ful interface, complete with change log replication and different views, CouchApp seems to be cool as well as “kewl”.

Matt Zumwalt (MediaShelf) in Opinionated metadata: Bringing a bit o sanity to the world of XML metdata described OM which looked like a programatic way of working with XML in Ruby but I thought his advice on how to write good code was more interesting. “Start with people’s stories, not the schema. Allow the vocabulary to reflect the team. And talk to the other team members.”

Ben Anderson (eXtensible Catalog) in Enhancing the performance of extensibility of XC’s metadata services toolkit outlined the development path and improvements to the Metadata Services Toolkit (MST). He had a goal of making the MST faster and more robust, and he did much of this by taking greater advantage of MySQL as opposed to processing various things in Solr.

power supply
power supply
water cooler
water cooler

In Ask Anything! a.k.a. the ‘Human Search Engine moderated by Dan Chudnov (Library of Congress) a number of people stood up, asked the group a question, and waited for an answer. The technique worked pretty well and enabled many people to identify many others who: 1) had similar problems, or 2) offered solutions. For better or for worse, I asked the group if they had any experience with issues of data curation, and I was “rewarded” for my effort with the responsibility to facilitate a birds-of-a-feather session later in the day.

Standing in for Mike Grave, Tim Shearer (University of North Carolina at Chapel Hill) presented GIS on the cheap. Using different content from different sources, Grave is geo-tagging digital objects by assigning them latitudes and longitudes. Once this is done, his Web interfaces read the tagging and place the objects on a map. He is using a Javascript library called Open Layers for the implementation.

In Let’s get small: A Microservices approach to library websites by Sean Hannan (Johns Hopkins University) we learned how a myriad of tools and libraries are being used by Hannan to build websites. While the number of tools and libraries seemed overwhelming I was impressed at the system’s completeness. He was practicing the Unix Way when it comes to website maintenance.

When a person mentions the word “archives” at a computer conference, one of the next words people increasingly mention is “forensics”, and Mark Matienzo (Yale University) in Fiwalk with me: Building emergent pre-ingest workflows for digital archival records using open source-forensic software described how he uses forensic techniques to read, organize, preserve digital media — specifically hard drives. He advocated a specific workflow for doing his work, a process for analyzing the disk’s content with a program called Gumshoe, and Advanced Forensic Framework 4 (AFF4) for doing forensics against file formats. Ultimately he hopes to write an application binding the whole process together.

I paid a lot of attention to David Lacy (Villanova University) when he presented (Yet another) home-grown digital library system, built upon open source XML technologies and metadata standards because the work he has done directly effects a system I am working on colloquially called the “Catholic Portal”. In his system Lacy described a digital library system complete with METS files, a build process, an XML database, and an OAI-PMH server. Content is digitized, described, and ingested into VuFind. I feel embarrassed that I had not investigated this more thoroughly before.

Break-out (birds-of-a-feather) sessions were up next and I facilitated one on data curation. Between ten and twelve of us participated, and in a nutshell we outlined a whole host of activities and issues surrounding the process of data management. After listing them all and listening to the things discussed more thoroughly by the group I was able to prioritize. (“Librarians love lists.”) At the top was, “We won’t get it right the first time”, and I certainly agree. Data management and data curation are the new kids on the block and consequently represent new challenges. At the same time, our profession seems obsessed with the creation of processes, implementations, and not evaluating the processes as needed. In our increasingly dynamic environment, such a way of thinking is not feasible. We will have to practice. We will have to show our ignorance. We will have to experiment. We will have to take risks. We will have to innovate. All of these things assume imperfection from the get go. At the same time the issues surrounding data management have a whole lot in common with issues surrounding just about any other medium. The real challenge is the application of our traditional skills to the current environment. A close second in the priorities was the perceived need for cross-institutional teams — groups of people including the office of research, libraries, computing centers, legal counsel, and of course researchers who generate data. Everybody has something to offer. Everybody has parts of the puzzle. But no one has all the pieces, all the experience, nor all the resources. Successful data management projects — defined in any number of ways — require skills from across the academe. Other items of note on the list included issues surrounding: human subjects, embargoing, institution repository versus discipline repositories, a host of ontologies, format migration, storage and back-up versus preservation and curation, “big data” and “little data”, entrenching one’s self in the research process, and unfunded mandates.

text mining
text mining

As a part of the second day’s Lighting Talks I shared a bit about text mining. I demonstrated how the sizes of texts — measured in words — could be things we denote in our catalogs thus enabling people to filter results in an additional way. I demonstrated something similar with Fog, Flesch, and Kincaid scores. I illustrated these ideas with graphs. I alluded to the “colorfulness” of texts by comparing & contrasting Thoreau with Austen. I demonstrated the idea of “in the same breath” implemented through network diagrams. And finally, I tried to describe how all of these techniques could be used in our “next generation library catalogs” or “discovery systems”. The associated video, here, was scraped from the high quality work done by the University of Indiana. “Thanks guys!”

At the end of the day we were given the opportunity to visit the University’s data center. It sounded a lot like a busman’s holiday to me so I signed up for the 6 o’clock show. I got on the little bus with a few other guys. One was from Australia. Another was from Florida. They were both wondering whether or not the weather was cold. It being around 10° Fahrenheit I had to admit it was. The University is proud of their data center. It can withstand tornado-strength forces. It is built into the side of a hill. It is only have full, if that, which is another way of saying, “They have a lot of room to expand.” We saw the production area. We saw the research area. I was hoping to see lots of blinking lights and colorful, twisty cables, but the lights were few and the cables were all blue. We saw Big Red. I wanted to see where the network came in. “It is over there, in that room”. Holding up my hands I asked, “How big is the pipe?”. “Not very large,” was the reply, “and the fiber optic cable is only the size of a piece of hair.” It thought the whole thing was incongruous. All this infrastructure and it literally hangs on the end of a thread. One of the few people I saw employed by the data center made a comment while I was taking photographs. “Those are the nicest packaged cables you will ever see.” She was very proud of her handiwork, and I was happy to take a few pictures of them.

Big Red
Big Red

Day #3 (Thursday, February 10)

The last day of the conference began with a presentation by Jason Casden and Joyce Chapman (North Carolina State University Libraries) with Building a open source staff-facing tablet app for library assessment. In it they first described how patron statistics were collected. Lots of paper. Lots of tallies. Lots of data entry. Little overall coordination. To resolve this problem they created a tablet-based tool allowing the statistics collector to roam through the library, quickly tally how many people were located where and doing what, and update a centralized database rather quickly. Their implementation was an intelligent use of modern technology. Kudos.

Ian Mulvany (Medeley) was a bit of an entrepreneur when he presented Medeley’s API and university libraries: Three example to create value on behalf of Jan Reichelt. His tool, Medeley, is intended to solve real problems for scholars: making them more efficient as writers, and more efficient as discoverers. To do this he provides a service where PDF files are saved centrally, analyzed for content, and enhanced through crowd sourcing. Using Medeley’s API things such as reading lists, automatic repository deposit, or “library dashboard” applications could be written. As of this writing Medeley is sponsoring a contest with cash prizes to see who can create the most interesting application from their API. Frankly, the sort of application described by Reichelt is the sort of application I think the library community should have created a few years ago.

In Practical relevancy testing, Naomi Dushay (Stanford University) advocated doing usability testing against the full LAMP stack. To do this she uses a program called Cucumber to design usability tests, run them, look at the results, adjust software configurations, and repeat.

Kevin Clarke (NESCent) in Sharing between data repositories first compared & contrasted two repository systems: Dryad and TreeBase. Both have their respective advantages & disadvantages. As a librarian he understands why it is good idea to have the same content in both systems. To this end he outlined and described how such a goal could be accomplished using a file packaging format called BagIt.

The final presentation of the conference was given by Eric Hellman (Gluejar, Inc) and called Why (Code4) libraries exist. In it he posited that more than half of the books sold in the near future will be in ebook format. If this happens, then, he asked, will libraries become obsolete? His answer was seemingly both no and yes. “Libraries need to change in order to continue to exists, but who will drive this change? Funding agencies? Start-up companies? Publishers? OCLC? ILS vendors?” None of these things, he says. Instead, it may be the coders but we (the Code4Lib community) have a number of limitations. We are dispersed, poorly paid, self-trained, and too practical. In short, none of the groups he outlined entirely have what it takes to keep libraries alive. On the other hand, he said, maybe libraries are not really about books. Instead, maybe, they are about space, people, and community. In the end Hellman said, “We need to teach, train, and enable people to use information.”

conference center
conference center
hidden flywheel
hidden flywheel


All in all the presentations were pretty much what I expected and pretty much what was intended. Everybody was experiencing some sort of computing problem in their workplace. Everybody used different variations of the LAMP stack (plus an indexer) to solve their problems. The presenters shared their experience with these solutions. Each presentation was like variations of a 12-bar blues. A basic framework is assumed, and the individual uses the framework to accomplish to create beauty. If you like the idea of the blues framework, then you would have liked the Code4Lib presentations. I like the blues.

In the past eight months I’ve attended at least four professional conferences: Digital Humanities 2010 (July), ECDL 2010 (September), Data Curation 2010 (December), and Code4Lib 2011 (February). Each one had about 300 people in attendance. Each one had something to do with digital libraries. Two were more academic in nature. Two were more practical. All four were communities unto themselves; at each conference there were people of the in-crowd, new comers, and folks in between. Many, but definitely not most, of the people I saw were a part of the other conferences but none of them were at all four. All of the conferences shared a set of common behavioral norms and at the same time owned a set of inside jokes. We need to be careful and not go around thinking our particular conference or community is the best. Each has something to offer the others. I sincerely do not think there is a “best” conference.

The Code4Lib community has a lot to offer the wider library profession. If the use of computers in libraries is only going to grow (which is an understatement), then a larger number of people who practice librarianship will need/want to benefit from Code4Lib’s experience. Yet the existing Code4Lib community is reluctant to change the format of the conference to accomodate a greater number of people. Granted, larger numbers of attendees make it more difficult to find venues, enable a single shared conference experience, and necessitates increased governance and bureaucracy. Such are the challenges of a larger group. I think the Code4Lib community is growing and experiencing growing pains. The mailing list increases by at least one or two new subscribers every week. The regional Code4Lib meetings continue. The journal is doing just fine. Code4Lib is a lot like the balance of the library profession. Practical. Accustomed to working on a shoe string. Service oriented. Without evolving in some way, the knowledge of Code4Libbers is not going to have a substancial effect on the wider library community. This makes me sad.

Next year’s conference — Code4Lib 2012 — will be held in Seattle (Washington). See you there?


ECDL 2010: A Travelogue

This posting outlines my experiences at the European Conference on Digital Libraries (ECDL), September 7-9, 2010 in Glasgow (Scotland). From my perspective, many of the presentations were about information retrieval and metadata, and the advances in these fields felt incremental at best. This does not mean I did not learn anything, but it does re-enforce my belief that find is no longer the current problem to be solved.

University of Glasgow
University of Glasgow
vaulted ceiling
vaulted ceiling
Adam Smith
Adam Smith

Day #1 (Tuesday, September 7)

After the usual logistic introductions, the Conference was kicked off with a keynote address by Susan Dumais (Microsoft) entitled The Web changes everything: Understanding and supporting people in dynamic information environments. She began, “Change is the hallmark of digital libraries… digital libraries are dynamic”, and she wanted to talk about how to deal with this change. “Traditional search & browse interfaces only see a particular slice of digital libraries. An example includes the Wikipedia article about Bill Gates.” She enumerated at least two change metrics: the number of changes and the time between changes. She then went about taking snapshots of websites, measuring the changes, and ultimately dividing the observations into at least three “speeds”: fast, medium, and slow. In general the quickly changing sites (fast) had a hub & spoke architecture. The medium change speed represented popular sites such as mail and Web applications. The slowly changing sites were generally entry pages or sites accessed via search. “Search engines need to be aware of what people seek and what changes over time. Search engines need to take change into account.” She then demonstrated an Internet Explorer plug-in (DiffIE) which highlights the changes in a website over time. She advocated weighing search engine results based on observed changes in a website’s content.

Visualization was the theme of Sascha Tönnies‘s (L3S Research) Uncovering hidden qualities — Benefits of quality measures for automatic generated metadata. She described the use of tag clouds with changes in color and size. The experimented with “growbag” graphs which looked a lot of network graphs. She also explored the use of concentric circle diagrams (CCD), and based on her observations people identified with them very well. “In general, people liked the CDD graph the best because the radius intuitively represented a distance from the central idea.”

What appeared to me as the interpretation of metadata schemes through the use of triples, Panorea Gaitanou (Ionian University) described a way to query many cultural heritage institution collections in Query transformation in a CIDOC CRM Based cultural metadata integration environment. He called the approach MDL (Metadata Description Language). Lots of mapping and lots of XPath.

Michael Zarro (Drexel University) evaluated user comments written against the Library of Congress Flickr Commons Project in User-contributed descriptive metadata for libraries and cultural institutions. As a result, he was able to group the comments into at least four types. The first, personal/historical, were exemplified by things like, “I was there, and that was my grandfather’s house.” The second, links out, pointed to elaborations such as articles on Wikipedia. The third, corrections/translations, were amendments or clarifications. The last, links in, were pointers to Flickr groups. The second type of annotations, links out, were the most popular.

purple flower
purple flower

Developing services to support research data management and sharing was a panel discussion surrounding the topic of data curation. My take-away from Sara Jone‘s (DDC) remarks was, “There are no incentives for sharing research data”, and when given the opportunity for sharing data owners react by saying things like, “I’m giving my baby away… I don’t know the best practices… What are my roles and responsibilities?” Veerle Van den Eynden (United Kingdom Data Archive) outlined how she puts together infrastructure, policy, and support (such as workshops) to create successful data archives. “infrastructure + support + policy = data sharing” She enumerated time, attitudes and privacy/confidentiality as the bigger challenges. Robin Rice (EDINA) outlined services similar to Van den Eynden’s but was particularly interested in social science data and its re-use. There is a much longer tradition of sharing social science data and it is definitely not intended to be a dark archive. He enumerated a similar but different set of barriers to sharing: ownership, freedom of errors, fear of scooping, poor documentation, and lack of rewards. Rob Grim (Tilburg University) was the final panelist. He said, “We want to link publications with data sets as in Economists Online, and we want to provide a number of additional services against the data.” He described data sharing incentive, “I will only give you my data if you provide me with sets of services against it such as who is using it as well as where it is being cited.” Grim described the social issues surrounding data sharing as the most important. He compared & contrasted sharing with preservation, and re-use with archiving. “Not only is it important to have the data but it is also important to have the tools that created the data.”

From what I could gather, Claudio Gennaro (IST-CNR) in An Approach to content-based image retrieval based on the Lucene search engine library converted the binary content of images in to strings, indexed the strings with Lucene, and then used Lucene’s “find more like this one” features to… find more like this one.

Stina Westman (Aalto University) gave a paper called Evaluation constructs for visual video summaries. She said, “I want to summarize video and measure things like quality, continuity, and usefulness for users.” To do this she enumerated a number of summarizing types: 1) storyboard, 2) scene clips, 3) fast forward technologies, and 4) user-controlled fast forwarding. After measuring satisfaction, scene clips provided the best recognition but storyboards were more enjoyable. The clips and fast forward technologies were perceived as the best video surrogates. “Summaries’ usefulness are directly proportional to the effort to use them and the coverage of the summary… There is little difference between summary types… There is little correlation between the type of performance and satisfaction.”

Frank Shipman (Texas A&M University) in his Visual expression for organizing and accessing music collections in MusicWiz asked himself, “Can we provide access to music collections without explicit metadata; can we use implicit metadata instead?” The implementation of his investigation was an application called MusicWiz which is divided into a user interface and an inference engine. It consists of six modules: 1) artist, 2) metadata, 3) audio signal, 4) lyrics, 5) a workspace expression, and 6) similarity. In the end Shipman found “benefits and weaknesses to organizing personal music collections based on context-independent metadata… Participants found the visual expression facilitated their interpretation of mood… [but] the lack of traditional metadata made it more difficult to locate songs…”


Day #2 (Wednesday, September 8)

Liina Munari (European Commission) gave the second day’s keynote address called Digital libraries: European perspectives and initiatives. In it she presented a review of the Europeana digital library funding and future directions. My biggest take-aways was the following quote: “Orphan works are the 20th Century black hole.”

Stephan Strodl (Vienna University of Technology) described a system called Hoppla facilitating back-up and providing automatic migration services. Based on OAIS, it gets its input from email, a hard disk, or the Web. It provides data management access, preservation, and storage management. The system outsources the experience of others to implement these services. It seemingly offers suggestions on how to get the work done, but it does not actually do the back-ups. The title of his paper was Automating logical preservation for small institutions with Hoppla.

Alejandro Bia (Miguel Hernández University) in Estimating digitization costs in digital libraries using DiCoMo advocated making a single estimate for digitizing, and then making the estimate work. “Most of the cost in digitization is the human labor. Other things are known costs.” Based on past experience Bia graphed a curve of digitization costs and applied the curve to estimates. Factors that go into the curve includes: skill of the labor, familiarity with the material, complexity of the task, the desired quality of the resulting OCR, and the legibility of the original document. The whole process reminded me of Medieval scriptoriums.

city hall
city hall
stair case
stair case

Andrew McHugh (University of Glasgow) presented In pursuit of an expressive vocabulary for preserved New Media art. He is trying to preserve (conserve) New Media art by advocating the creation of medium-independent descriptions written by the artist so the art can be migrated forward. He enumerated a number of characteristics of the art to be described: functions, version, materials & dependencies, context, stakeholders, and properties.

In An Analysis of the evolving coverage of computer science sub-fields in the DBLP digital library Florian Reitz (University of Trier) presented an overview of the Digital Bibliography & Library Project (DBLP) — a repository of computer science conference presentations and journal articles. The (incomplete) collection was evaluated, and in short he saw the strengths and coverage of the collection change over time. In a phrase, he did a bit of traditional collection analysis against is non-traditional library.

A second presentation, Analysis of computer science communities based on DBLP, was then given on the topic of the DBLP, this time by Maria Biryukov (University of Luxembourg). She first tried to classify computer science conferences into sets of subfields in an effort to rank which conferences were “better”. One way this was done was through an analysis of who participated, the number of citations, the number of conference presentations, etc. She then tracked where a person presented and was able to see flows and patterns of publishing. Her conclusion — “Authors publish all over the place.”

In Citation graph based ranking in Invenio by Ludmila Marian (European Organization for Nuclear Research) the question was asked, “In a database of citations consisting of millions of documents, how can good precision be achieved if users only supply approximately 2-word queries?” The answer, she says, may lie in citation analysis. She weighed papers based on the number and locations of citations in a manner similar to Google PageRank, but in the end she realized the imperfection of the process since older publications seemed to unnaturally float to the top.

Day #3 (Thursday, September 9)

Sandra Toze (Dalhousie University) wanted to know how digital libraries support group work. In her Examining group work: Implications for the digital library as sharium she described the creation of an extensive lab for group work. Computers. Video cameras. Whiteboards. Etc. Students used her lab and worked in a manner she expected doing administrative tasks, communicating, problem solving, and the generation of artifacts. She noticed that the “sharium” was a valid environment for doing work, but she noticed that only individuals did information seeking while other tasks were done by the group as a whole. I found this later fact particularly interesting.

In an effort to build and maintain reading lists Gabriella Kazai (Microsoft) presented Architecture for a collaborative research environment based on reading list sharing. The heart of the presentation was a demonstration of ScholarLynk as well as Research Desktop — tools to implement “living lists” of links to knowledge sources. I went away wondering whether or not such tools save people time and increase knowledge.

The last presentation I attended was by George Lucchese (Texas A&M University) called CritSpace: A Workplace for critical engagement within cultural heritage digital libraries where he described a image processing tool intended to be used by humanities scholars. The tool does image processing, provides a workspace, and allows researchers to annotate their content.

Bothwell Castle
Bothwell Castle
Stirling Castle
Stirling Castle
Doune Castle
Doune Castle

Observations and summary

It has been just more than one month since I was in Glasgow attending the Conference, and much of the “glow” (all onomonopias intended) has worn off. The time spent was productive. For example, I was able to meet up with James McNulty (Open University) who spent time at Notre Dame with me. I attended eighteen presentations which were deemed innovative and scholarly by way of extensive review. I discussed digital library issues with numerous people and made an even greater number of new acquaintances. Throughout the process I did some very pleasant sight seeing both with conference attendees and on my own. At the same time I do not feel as if my knowledge of digital libraries was significantly increased. Yes, attendance was intellectually stimulating demonstrated by the number of to-do list items written in my notebook during the presentations, but the topics of discussion seemed worn out and not significant. Interesting but only exemplifying subtle changes from previous research.

My attendance was also a mission. More specifically, I wanted to compare & contrast the work going on here with the work being done at the 2010 Digital Humanities conference. In the end, I believe the two groups are not working together but rather, as one attendee put it, “talking past one another.” Both groups — ECDL and Digital Humanities — have something in common — libraries and librarianship. But on one side are computer scientists, and on the other side are humanists. The first want to implement algorithms and apply them to many processes. If such a thing gets out of hand, then the result is akin to a person owning a hammer and everything looking like a nail. The second group is ultimately interested in describing the human condition and addressing questions about values. This second process is exceedingly difficult, if not impossible, to measure. Consequently any sort of evaluation is left up to a great deal of subjectivity. Many people would think these two processes are contradictory and/or conflicting. In my opinion, they are anything but in conflict. Rather, these two processes are complementary. One fills the deficiencies of the other. One is more systematic where the other is more judgmental. One relates to us as people, and the other attempts to make observations devoid of human messiness. In reality, despite the existence of these “two cultures”, I see the work of the scientists and the work of the humanists to be equally necessary in order for me to make sense of the world around me. It is nice to know libraries and librarianship seem to represent a middle ground in this regard. Not ironically, that is one of most important reasons I explicitly chose my profession. I desired to practice both art and science — arscience. It is just too bad that these two groups do not work more closely together. There seems to be too much desire for specialization instead. (Sigh.)

Because of a conflict in acronyms, the ECDL conference has all but been renamed to Theory and Practice of Digital Libraries (TPDL), and next year’s meeting will take place in Berlin. Despite the fact that this was my third for fourth time attending ECDL, and I doubt I will attend next year. I do not think information retrieval and metadata standards are as important as they have been. Don’t get me wrong. I didn’t say they were unimportant, just not as important as they used to be. Consequently, I think I will be spending more of my time investigating the digital humanities where content has already been found and described, and is now being evaluated and put to use.

River Clyde
River Clyde
River Teith
River Teith

WiLSWorld, 2010

WiLS logoI had the recent honor, privilege, and pleasure of attending WiLSWorld (July 21-22, 2010 in Madison, Wisconsin), and this posting outlines my experiences there. In a sentence, I was pleased so see the increasing understanding of “discovery” interfaces defined as indexes as opposed to databases, and it is now my hope we — as a profession — can move beyond search & find towards use & understand.

Wednesday, July 21

With an audience of about 150 librarians of all types from across Wisconsin, the conference began with a keynote speech by Tim Spalding (LibraryThing) entitled “Social cataloging and the future”. The heart of his presentation was a thing he called the Ladder of Social Cataloging which has six “rungs”: 1) personal cataloging, 2) sharing, 3) implicit social cataloging, 4) social networking, 5) explicitly social cataloging, and 6) collaboration. Much of what followed were demonstrations of how each of these things are manifested in LibraryThing. There were a number meaty quotes sprinkled throughout the talk:

…We [LibraryThing] are probably not the biggest book club anymore… Reviews are less about buying books and more about sharing minds… Tagging is not about something for everybody else, but rather about something for yourself… LibraryThing was about my attempt to discuss the things I wanted to discuss in graduate school… We have “flash mobs” cataloging peoples’ books such as the collections of Thomas Jefferson, John Adams, Ernest Hemingway, etc… Traditional subject headings are not manifested in degrees; all LCSH are equally valid… Library data can be combined but separate from patron data.

I was duly impressed with this presentation. It really brought home the power of crowd sourcing and how it can be harnessed in a library setting. Very nice.

Peter Gilbert (Lawrence University) then gave a presentation called “Resource discovery: I know it when I see it”. In his words, “The current problem to solve is to remove all of the solos: books, articles, digitized content, guides to subjects, etc.” The solution, in his opinion, is to implement “discovery systems” similar to Blacklight, eXtensible Catalog, Primo & Primo Central, Summon, VUFind, etc. I couldn’t have said it better myself. He gave a brief overview of each system.

Ken Varnum (University of Michigan Library) described a website redesign process in “Opening what’s closed: Using open source tools to tear down vendor silos”. As he said, “The problem we tried to solve in our website redesign was the overwhelming number of branch library websites. All different. Almost schizophrenic.” The solution grew out of a different premise for websites. “Information not location.” He went on to describe a rather typical redesign process complete with focus group interviews, usability studies, and advisory groups, but there were a couple of very interesting tidbits. First, inserting the names and faces of librarian in search results has proved popular with students. Second, I admired the “participatory design” process he employed. Print a design. Allow patrons to use pencils to add, remove, or comment on aspects of the layout. I also think the addition of a professional graphic designer helped their process.

I then attended Peter Gorman‘s (University of Wisconsin-Madison) “Migration of digital content to Fedora”. Gorman had the desire to amalgamate institutional content, books, multimedia and finding aids (EAD files) into a single application… yet another “discovery system” description. His solution was to store content into Fedora, index the content, and provide services against the index. Again, a presenter after my own heart. Better than anyone had done previously, Gorman described Fedora’s content model complete with identifiers (keys), a sets of properties (relationships, audit trails, etc.), and a data streams (JPEG, XML, TIFF, etc.). His description was clear and very easy to digest. The highlight was a description of Fedora “behaviors”. These are things people are intended to do with data streams. Examples include enlarging a thumbnail image or transforming a online finding aid into something designed for printing. These “behaviors” are very much akin — if not exactly like — the “services against texts” I have been advocating for a few years.

Thursday, July 22

The next day I gave a presentation called “Electronic texts and the evolving definition of librarianship”. This was an extended version of my presentation at ALA given a few weeks ago. To paraphrase, “As we move from databases towards indexes to facilitate search, the problems surrounding find are not as acute. Given the increasing availability of digitized full text content, library systems have the opportunity to employ ‘digital humanities computing techniques’ against collections and enable people to do ‘distant reading’.” I then demonstrated how the simple counting of words and phrases, the use of concordances, and the application of TFIDF can facilitate rudimentary comparing & contrasting of corpora. Giving this presentation was an enjoyable experience because it provided me the chance to verbalize and demonstrate much of my current “great books” research.

Later in the morning helped facilitate a discussion on the process a library could go through to implement the ideas outlined in my presentation, but the vast majority of people attended the presentation by Keith Mountin (Apple Computer, Inc.) called “The iPad and its application in libraries”.


Madison was just as nice as I remember. Youthful. Liberal. Progressive. Thanks go to Deb Shapiro and Mark Beatty. They invited me to sit with them on the capitol lawn and listen to the local orchestra play Beatles music. The whole thing was very refreshing.

The trip back from the conference was a hellacious experience in air travel, but it did give me the chance to have an extended chat with Tim Spalding in the airport. We discussed statistics and statistical measures that can be applied to content we are generating. Many of the things he is doing with metadata I may be able to do with full text. The converse is true as well. Moreover, by combining our datasets we may find that the sum is greater than the parts — all puns intended. Both Tim and I agreed this is something we should both work towards. Afterwards I ate macaroni & cheese with a soft pretzel and a beer. It seemed apropos for Wisconsin.

This was my second or third time attending WiLSWorld. Like the previous meetings, the good folks at WiLS — specifically Tom Zilner, Mark Beatty, and Shirley Schenning — put together a conference providing librarians from across Wisconsin with a set of relatively inexpensive professional development opportunities. Timely presentations. Plenty of time for informal discussions. All in a setting conducive to getting away and thinking a bit outside the box. “Thank you.”

Digital Humanities 2010: A Travelogue

I was fortunate enough to be able to attend a conference called Digital Humanities 2010 (London, England) between July 4th and 10th. This posting documents my experiences and take-aways. In a sentence, the conference provided a set of much needed intellectual stimulation and challenges as well as validated the soundness of my current research surrounding the Great Books.

lunch castle castle

Pre-conference activities

All day Monday, July 5, I participated in a workshop called Text mining in the digital humanities facilitated by Marco Büchler, et al. of the University of Leipzig. A definition of “e-humanities” was given, “The application of computer science to do qualitative evaluation of texts without the use of things like TEI.” I learned that graphing texts illustrates concepts quickly — “A picture is worth a thousand words.” Also, I learned I should consider creating co-occurrence graphs — pictures illustrating what words co-occur with a given word. Finally, according to the Law of Least Effort, the strongest content words in a text are usually the ones that do not occur most frequently, nor the ones occurring the least, but rather the words occurring somewhere in between. A useful quote includes, “Text mining allows one to search even without knowing any search terms.” Much of this workshop’s content came from the eAQUA Project.

On Tuesday I attended the first half of a THATCamp led by Dan Cohen (George Mason University) where I learned THATCamps are expected to be: 1) fun, 2) productive, and 3) collegial. The whole thing came off as a “bar camp” for scholarly conferences. As a part of the ‘Camp I elected to participate in the Developer’s Challenge and submitted an entry called “How ‘great’ is this article?“. My hack compared texts from the English Women’s Journal to the Great Books Coefficient in order to determine “greatness”. My entry did not win. Instead the prize went to Patrick Juola with honorable mentions going to Loretta Auvil, Marco Büchler, and Thomas Eckart.

Wednesday morning I learned more about text mining in a workshop called Introduction to text analysis using JiTR and Voyeur led by Stéfan Sinclair (McMaster University) and Geoffrey Rockwell (University of Alberta). The purpose of the workshop was “to learn how to integrate text analysis into a scholar’s/researcher’s workflow.” More specifically, we learned how to use a tool called Voyeur, an evolution of the TAPoR. The “kewlest” thing I learned was the definition of word density, (U / W) 1000, where U is the total number of unique words in a text and W is the total number of words in a text. The closer the result is to 1000 the richer and more dense a text is. In general, denser documents are more difficult to read. (For a good time, I wrote — a program to compute density given an arbitrary plain text file.)

In keeping with the broad definition of humanities, I was “seduced” in the afternoon by listening to recordings of a website called CHARM (Center for History and Analysis of Recorded Music). The presentation described and presented digitized classical music from the very beginnings of recorded music. All apropos since the BBC was located just across the street from King’s College where the conference took place. When this was over we retired to the deck for tea and cake. There I learned the significant recording time differences between 10″ and 12″ 78/rpm records. Like many mediums, the recording artist needed to make accommodations accordingly.

me abbey abbey

Plenty of presentations

The conference officially began Wednesday evening and ended Saturday afternoon. According to my notes, I attended at many as eighteen sessions. (Wow!?) Listed below are summaries of most of the ones I attended:

  • Charles Henry (Council on Library and Information Resources) and Hold up a mirror – In this keynote presentation Henry compared & contrasted manifestations (oral, written, and digital) of Homer, Beowulf, and a 9-volume set of religious ceremonies compiled in the 18th century. He then asked the question, “How can machines be used to capture the interior of the working mind?” Or, in my own words, “How can computers be used to explore the human condition?” The digital versions of the items listed above were used as example answers, and a purpose of the conference was to address this question in other ways. He said, “There are many types of performance, preservation, and interpretation.”
  • Patrick Juola (Duquesne University) and Distant reading and mapping genre space via conjecture-based distance measures – Juola began by answering the question, “What do you do with a million books?”, and enumerated a number of things: 1) search, 2) summarize, 3) sample, and 4) visualize. These sorts of proceses against texts is increasingly called “distant reading” and is contrasted with the more traditional “close reading”. He then went on to describe his “Conjecturator” — a system where assertions are randomly generated and then evaluated. He demonstrated this technique against a set of Victorian novels. His presentation was not dissimilar to the presentation he gave at digital humanities conference in Chicago the previous year.
  • Jan Rybicki (Pedagogical University) and Deeper delta across genres and language: Do we really need the most frequent words? – In short Rybicki said, “Doing simple frequency counts [to do authorship analysis] does not work very well for all languages, and we are evaluating ‘deeper deltas'” — an allusion to the work for J.F. Burrows and D.L. Hoover. Specifically, using a “moving window” of stop words he looked for similarities in authorship between a number of texts and believed his technique has proved to be more or less successful.
  • David Holms (College of New Jersey) and The Diary of a public man: A Case study in traditional and non-traditional author attribution – Soon after the civil war a book called The Diary Of A Public Man was written by an anonymous author. Using stylometric techniques, Holms asserts the work really was written as a diary and was authored by William Hurlbert.
  • David Hoover (New York University) and Teasing out authorship and style with t-tests and zeta – Hoover used T-tests and Zeta tests to validated whether or not a particular author finished a particular novel from the 1800s. Using these techniques he was successfully able to illustrate writing styles and how they changed dramatically between one chapter in the book and another chapter. He asserted that such analysis would have been extremely difficult through rudimentary casual reading.
  • Martin Holmes (University of Victoria) and Using the universal similarity metric to map correspondences between witnesses – Holmes described how he was comparing the similarity between texts through the use of a compression algorithm. Compress texts. Compare their resulting lengths. The closer to lengths the greater the similarity. The process works for a variety of file types, languages, and when there there is no syntactical knowledge.
  • Dirk Roorda (Data Archiving and Networked Services) and The Ecology of longevity: The Relevance of evolutionary theory for digital preservation – Roorda drew parallels between biology and preservation. For example, biological systems use and retain biological characteristics. Preservation systems re-use and thus preserve content. Biological systems make copies and evolve. Preservation can be about migrating formats forward thus creating different forms. Biological systems employ sexual selections. “Look how attractive I am.” Repositories or digital items displaying “seals of approval” function similarly. Finally, he went on to describe how these principles could be integrated in a preservation system where fees are charged for storing content and providing access to it. He emphasized such systems would not necessarily be designed to handle intellectual property rights.
  • Lewis Ulman (Ohio State University) & Melanie Schlosser (Ohio State University) and The Specimen case and the garden: Preserving complex digital objects, sustaining digital projects – Ulman and Schlosser described a dichotomy manifesting itself in digital libraries. On one hand there is a practical need for digital library systems to be similar between each other because “boutique” systems are very expensive to curate and maintain. At the same time specialized digital library applications are needed because they represent the frontiers of research. How to accomodate both, that was their question. “No one group (librarians, information technologist, faculty) will be able to do preservation alone. They need to work together. Specifically, they need to connect, support, and curate.”
  • George Buchanan (City University) and Digital libraries of scholarly editions – Similar to Ulman/Schlosse above, Buchanan said, “It is difficult to provide library services against scholarly editions because each edition is just too much different from the next to create a [single] system.” He advocated the Greenstone digital library system.

book ice cream beer

  • Joe Raben (Queens College of the City University of New York) and Humanities computing in an age of social change – In this presentation, given after being honored with the community’s Busa Award, Raben first outlined the history of the digital humanities. It included the work done by Father Busa who collaborated with IBM in the 1960s to create a concordance against some of Thomas Aquinas‘s work. It included a description of a few seminal meetings and the formulation of the Computing in the Humanities journal. He alluded to “machine readable texts” — a term which is no longer in vogue but reminded me of “machine readable cataloging” (MARC) and how the library profession has not moved on. He advocated for a humanities wiki where ideas and objects could be shared. It sounded a lot like the website. He discussed the good work of a Dante project hosted at Princeton University, and I was dismayed because Notre Dame’s significant collection of Dante materials has not played a role in this particular digital library. A humanist through and through, he said, “Computers are increasingly controlling our lives and the humanities have not effected how we live in the same way.” To this I say, computers represent close trends compared to the more engrained values of the human condition. The former are quick to change, the later change oh so very slowly yet they are more pervasive. Compared to computer technology, I believe the humanists have had more long-lasting effects on the human condition.
  • Lynne Siemens (University of Victoria) and A Tale of two cities: Implications of the similarities in collaborative approaches within the digital libraries and digital humanities communities – Siemans reported on the results of survey in an effort to determine how and why digital librarians and digital humanists collaborate. “There are cultural differences between librarians and academics, but teams [including both] are necessary. The solution is to assume the differences rather than the similarities. Everybody brings something to the team.”
  • Fenella France (Library of Congress) and Challenges of linking digital heritage scientific data with scholarly research: From navigation to politics – France described some of the digital scanning processes of the Library of Congress, and some the consequences. For example, their technique allowed archivists to discover how Thomas Jefferson wrote, crossed out, and then replaced the word “subjects” with “citizens” in a draft of the Declaration of Independence. A couple of interesting quotes included, “We get into the optical archeology of the documents”, and “Digitization is access, not preservation.”
  • Joshua Sternfeld (National Endowment for the Humanities) and Thinking archivally: Search and metadata as building blocks for a new digital historiography – Sternfeld advocated for different sets of digital library evaluation. “There is a need for more types of reviews against digital resource materials. We need a method for doing: selection, search, and reliability… The idea of provenance — the order of document creation — needs to be implemented in the digital realm.”
  • Wendell Piez (Mulberry Technologies, Inc.) and Towards hermeneutic markup: An Architectural outline – Hermeneutic markup are annotations against a text that are purely about interpretation. “We don’t really have the ability to do hermeneutic markup… Existing schemas are fine, but every once in a while exceptions need to be made and such things break the standard.” Numerous times Piez alluded to the “overlap problem” — the inability to demarcate something crossing the essentially strict hierarchal nature of XML elements. Textual highlighting is a good example. Piez gave a few examples of how the overlap problem might be resolved and how hermeneutic markup may be achieved.
  • Jane Hunter (University of Queensland) and The Open Annotation collaboration: A Data model to support sharing and interoperability of scholarly annotations – Working with a number of other researchers, Hunter said, “The problem is that there is an extraordinarily wide variety of tools, lack of consistency, no standards, and no sharable interoperability when it comes to Web-based annotation.” Their goal is to create a data model to enable such functionality. While the model is not complete, it is being based on RDF, SANE, and OATS. See
  • Susan Brown (University of Alberta and University of Guelph) and How do you visualize a million links? – Brown described a number of ways she is exploring visualization techniques. Examples included link graphs, tag clouds, bread board searches, cityscapes, and something based on “six degrees of separation”.
  • Lewis Lancaster (University of California, Berkeley) and From text to image to analysis: Visualization of Chinese Buddhist canon – Lancaster has been doing research against a (huge) set of Korean glyphs for quite a number of years. Just like other writing techniques, the glyphs change over time. Through the use digital humanities computing techniques, he has been able to discover much more quickly patterns and bigrams that he was not able to discover previously. “We must present our ideas as images because language is too complex and takes too much time to ingest.”

church gate alley


In the spirit of British fast food, I have a number of take-aways. First and foremost, I learned that my current digital humanities research into the Great Books is right on target. It asks questions of the human condition and tries to answer them through the use of computing techniques. This alone was the worth the total cost of my attendance.

Second, as a relative outsider to the community, I percieved a pervasive us versus them mentality being described. Us digital humanists and those traditional humanists. Us digital humanists and those computer programmers and systems administrators. Us digital humanists and those librarians and archivists. Us digital humanists and those academic bureaucrats. If you consider yourself a digital humanist, then please don’t take this observation the wrong way. I believe communities inherently do this as a matter of fact. It is a process used to define one’s self. The heart of much of this particular differenciation seems to be yet another example of C.P. Snow‘s The Two Cultures. As a humanist myself, I identify with the perception. I think the processes of art and science complement each other, not contradict nor conflict. A balance of both are needed in order to adequantly create a cosmos out of the apparent chaos of our existance — a concept I call arscience.

Third, I had ample opportunities to enjoy myself as a tourist. The day I arrived I played frisbee disc golf with a few “cool dudes” at Lloyd Park in Croydon. On the Monday I went to the National Theater and saw Welcome to Thebes — a depressing tragedy where everybody dies. On the Tuesday I took in Windsor Castle. Another day I carried my Culver Citizen newspaper to have its photograph taken in front of Big Ben. Throughout my time there I experienced interesting food, a myriad of languages & cultures, and the almost overwhelming size of London. Embarassingly, I had forgotten how large the city really is.

Finally, I actually enjoyed reading the formally published conference abstracts — all three pounds and 400 pages of it. It was thorough, complete, and even included an author index. More importantly, I discovered more than a few quotes supporting an idea for library systems that I have been calling “services against texts”:

The challenge is to provide the researcher with a means to perceiving or specifying subsets of data, extracting the relevent information, building the nodes and edges, and then providing the means to navigate the vast number of nodes and edges. (Susan Brown in “How do you visualize a million links” on page 106)

However, current DL [digital library] systems lack critical features: they have too simple a model of documents, and lack scholarly apparatus. (George Buchanan in “Digital libraries of scholarly editions” on page 108.)

This approach takes us to the what F. Moretti (2005) has termed ‘distant reading,’ a method that stresses summarizing large bodies of text rather than focusing on a few texts in detail. (Ian Gregory in “GIS, texts and images: New approaches to landscape appreciation in the Lake District” on page 159).

And the best quote is:

In smart digital libraries, a text should not only be an object but a service: not a static entity but an interactive method. The text should be computationally exploitable so that it can be sampled and used, not simply reproduced in its entirety… the reformulation of the dictionary not as an object, but a service. (Toma Tasovac in “Reimaging the dictionary, or why lexicography needs digital humanities” on page 254)

In conclusion, I feel blessed with the ability to attended the conference. I learned a lot, and I will recommend it to any librarian or humanist.

ALA 2010

ALA 2010This is the briefest of travelogues describing my experience at the 2010 ALA Annual Meeting in Washington (DC).

Pat Lawton and I gave a presentation at the White House Four Points Hotel on the “Catholic Portal“. Essentially it was a status report. We shared the podium with Jon Miller (University of Southern California) who described the International Mission Photography Archive — an extensive collection of photographs taken by missionaries from many denominations.

I then took the opportunity to visit my mother in Pennsylvania, but the significant point is the way I got out of town. I had lost my maps, and my iPad came to the rescue. The Google Maps application was very, very useful.

On Monday I shared a podium with John Blyberg (Darien Library) and Tim Spalding (LibraryThing) as a part of a Next-Generation Library Catalog Special Interest Group presentation. John provided an overview of the latest and greatest features of SOPAC. He emphasized a lot of user-centered design. Tim described library content and services as not (really) being a part of the Web. In many ways I agree with him. I outlined how a few digital humanities computing techniques could be incorporated into library collections and services in a presentation I called “The Next Next-Generation Library Catalog“. That afternoon I participated in a VUFind users-group meeting, and I learned that I am pretty much on target in regards to the features of this “discovery system”. Afterwards a number of us from the Catholic Research Resources Alliance (CRRA) listened to folks from Crivella West describe their vision of librarianship. The presentation was very interesting because they described how they have taken many collections of content and mined them for answers to questions. This is digital humanities to the extreme. Their software — the Knowledge Kiosk — is being used to analyze the content of John Henry Newman at the Newman Institute.

Tuesday morning was spent more with the CRRA. We ratified next year’s strategic plan. In the afternoon I visited a few of my friends at the Library of Congress (LOC). There I learned a bit how the LOC may be storing and archiving Twitter feeds. Interesting.

Inaugural Code4Lib “Midwest” Regional Meeting

I believe the Inaugural Code4Lib “Midwest” Regional Meeting (June 11 & 12, 2010 at the University of Notre Dame) was a qualified success.

About twenty-six people attended. (At least that was the number of people who went to lunch.) They came from Michigan, Ohio, Iowa, Indiana, and Illinois. Julia Bauder won the prize for coming the furthest distance away — Grinnell, Iowa.

Day #1

We began with Lightning Talks:

  • ePub files by Michael Kreyche
  • FRBR and MARC data by Kelley McGrath
  • Great Books by myself
  • jQuery and the OPAC by Ken Irwin
  • Notre Dame and the Big Ten by Michael Witt
  • Solr & Drupal by Rob Casson
  • Subject headings via a Web Service by Michael Kreyche
  • Taverna by Rick Johnson and Banu Lakshminarayanan
  • VUFind on a hard disk by Julia Bauder

We dined in the University’s South Dining Hall, and toured a bit of the campus on the way back taking in the “giant marble”, the Architecture Library, and the Dome.

In the afternoon we broke up into smaller groups and discussed things including institutional repositories, mobile devices & interfaces, ePub files, and FRBR. In the evening we enjoyed varieties of North Carolina barbecue, and then retreated to the campus bar (Legend’s) for a few beers.

I’m sorry to say the Code4Lib Challenge was not successful. Us hackers were either to engrossed to notice whether or not anybody came to the event, or nobody showed up to challenge us. Maybe next time.

Day #2

There were fewer participants on Day #2. We spent the time listening to Ken elaborate on the uses and benefits of jQuery. I hacked at something I’m calling “The Great Books Survey”.

The event was successful in that it provided plenty of opportunity to discuss shared problems and solutions. Personally, I learned I need to explore statistical correlations, regressions, multi-varient analysis, and principle component analysis to a greater degree.

A good time was had by all, and it is quite possible the next “Midwest” Regional Meeting will be hosted by the good folks in Chicago.

For more detail about Code4Lib “Midwest”, see the wiki:

Cyberinfrastructure Days at the University of Notre Dame

ci daysOn Thursday and Friday, April 29 and 30, 2010 I attended a Cyberinfrastructure Days event at the University of Notre Dame. Through this process my personal definition of “cyberinfrastructure” was updated, and my basic understanding of “digital humanities computing” was confirmed. This posting documents the experience.

Day #1 – Thursday, April 29

The first day was devoted to cyberinfrastructure and the humanities.

After all of the necessary introductory remarks, John Unsworth (University of Illinois – Urbana/Champagne) gave the opening keynote presentation entitled “Reading at library scale: New methods, attention, prosthetics, evidence, and argument“. In his talk he posited the impossibility of reading everything currently available. There is just too much content. Given some of the computing techniques at our disposal, he advocated additional ways to “read” material, but cautioned the audience in three ways: 1) there needs to be an attention to prosthetics, 2) an appreciation for evidence and statistical significance, and 3) a sense of argument so the skeptic may be able to test the method. To me this sounded a whole lot like applying scientific methods to the process of literary criticism. Unsworth briefly described MONK and elaborated how part of speech tagging had been done against the corpus. He also described how Dunning’s Log-Likelihood statistic can be applied to texts in order to determine what a person does (and doesn’t) include in their writings.

Stéfan Sinclair (McMaster University) followed with “Challenges and opportunities of Web-based analytic tools for the humanities“. He gave a brief history of the digital humanities in terms of computing. Mainframes and concordances. Personal computers and even more concordances. Webbed interfaces and locally hosted texts. He described digital humanities as something that has evolved in cycles since at least 1967. He advocated the new tools will be Web apps — things that can be embedded into Web pages and used against just about any text. His Voyeur Tools were an example. Like Unsworth, he advocated the use of digital humanities computing techniques because they can supplement the analysis of texts. “These tools allow you to see things that are not evident.” Sinclair will be presenting a tutorial at the annual digital humanities conference this July. I hope to attend.

In a bit of change of pace, Russ Hobby (Internet2) elaborated on the nuts & bolts of cyberinfrastructure in “Cyberinfrastructure components and use“. In this presentation I learned that many scientists are interested in the… science, and they don’t really care about the technology supporting it. They have an instrument in the field. It is collecting and generating data. They want to analyze that data. They are not so interested in how it gets transported from one place to another, how it is stored, or in what format. As I knew, they are interested in looking for patterns in the data in order to describe and predict events in the natural world. “Cyberinfrastructure is like a car. ‘Car, take me there.'” Cyberinfrastructure is about controls, security systems, storage sets, computation, visualization, support & training, collaboration tools, publishing, communication, finding, networking, etc. “We are not there to answer the question, but more to ask them.”

In the afternoon I listened to Richard Whaling (University of Chicago) present on “Humanities computing at scale“. Given from the point of view of a computer scientist, this presentation was akin to Hobby’s. On one hand there are people do analysis and there are people who create the analysis tools. Whaley is more like the later. I thought his discussion on the format of texts was most interesting. “XML is good for various types of rendering, but not necessarily so good for analysis. XML does not necessarily go deep enough with the encoding because the encoding is too expensive; XML is not scalable. Nor is SQL. Indexing is the way to go.” This perspective jives with my own experience. Encoding texts in XML (TEI) is so very tedious and the tools to do any analysis against the result are few and far between. Creating the perfect relational database (SQL) is like seeking the Holy Grail, and SQL is not designed to do full text searching nor “relevancy ranking”. Indexing texts and doing retrieval against the result has proven to be much more fruitful or me, but such an approach is an example of “Bag of Words” computing, and thus words (concepts) often get placed out of context. Despite that, I think the indexing approach holds the most promise. Check out Perseus under Philologic and Digital South Asia Library to see some of Whaley’s handiwork.

Chris Clarke (University of Notre Dame), in “Technology horizons for teaching and learning“, enumerated ways the University of Notre Dame is putting into practice many of the things described in the most recent Horizon Report. Examples included the use of ebooks, augmented reality, gesture-based computing, and visual data analysis. I thought the presentation was a great way to bring the forward-thinking report down to Earth and place it into a local context. Very nice.

William Donaruma (also from the University of Notre Dame) described the process he was going through to create 3-D movies in a presentation called “Choreography in a virtual space“. Multiple — very expensive — cameras. Dry ice. Specific positioning of the dancers. Special glasses. All of these things played into the creation of an illusion of three-dimensions on a two-dimensional space. I will not call it three-dimensional until I can walk around the object in question. The definition of three-dimensional needs to be qualified.

The final presentation of the day took place after dinner. The talk, “The Transformation of modern science” was given virtually by Edward Seidel (National Science Foundation). Articulate. Systematic. Thorough. Insightful. These are the sorts of words I use to describe Seidel’s talk. Presented remotely through a desktop camera and displayed on a screen to the audience, we were given a history of science and a description of how it has changed from single-man operations to large-group collaborations. We were shown the volume of information created previously and compared it to the volume of information generated now. All of this led up to the most salient message — “All future National Science Foundation grant proposals must include a data curation plan.” Seidel mentioned libraries, librarians, and librarianship quite a number of times during the talk. Naturally my ears perked up. My profession is about the collection, preservation, organization, and dissemination of data, information, and knowledge. The type of content to which these processes are applied — books, journal articles, multi-media recordings, etc — is irrelevant. Given a collection policy, it can all be important. The data generated by scientists and their machines is no exception. Is our profession up to the challenge, or are we too much wedded to printed, bibliographic materials? It is time for librarians to aggressively step up to the plate, or else. Here is an opportunity being laid at our feet. Let’s pick it up!

Day #2 – Friday, April 30

The second day centered more around the sciences as opposed to the humanities.

The day began with a presentation by Tony Hey (Microsoft Research) called “The Fourth Paradigm: Data-intensive scientific discovery“. Hey described cyberinfrastructure as the new name for e-science. He then echoed much of content of Seidel’s message from the previous evening and described the evolution of science in a set of paradigms: 1) theoretical, 2) experimental, 3) computational, and 4) data-intensive. He elaborated on the infrastructure components necessary for data-intensive science: 1) acquisition, 2) collaboration & visualization, 3) analysis & mining, 4) dissemination & sharing, 5) archiving & preservation. (Gosh, that sounds a whole lot like my definition of librarianship!) He saw Microsoft’s role as one of providing the necessary tools to facilitate e-science (or cyberinfrastructure) and thus the Fourth Paradigm. Hey’s presentation sounded a lot like open access advocacy. More Association of Research Library library directors as well as university administrators need to hear what he has to say.

Boleslaw Syzmanski (Rensselaer Polytechnic Institute) described how better science could be done in a presentation called “Robust asynchronous optimization for volunteer computing grids“. Like Hobby and Whaley mentioned (above), Syzmanski separated the work of the scientist and the work of cyberinfrastructure. “Scientists do not want to be bothered with the computer science of their work.” He then went on to describe a distributed computing technique for studying the galaxy — MilkyWay@home. He advocated cloud computing as a form of asynchronous computing.

The third presentation of the day was entitled “Cyberinfrastructure for small and medium laboratories” by Ian Foster (University of Chicago). The heart of this presentation was advocacy for software as a service (SaaS) computing for scientific laboratories.

Ashok Srivastava (NASA) was the first up in the second session with “Using Web 2.0 and collaborative tools at NASA“. He spoke to one of the basic principles of good science when he said, “Reproducibility is a key aspect of science, and with access to the data this reproducibility is possible.” I’m not quite sure my fellow librarians and humanists understand the importance of such a statement. Unlike work in the humanities — which is often built on subjective and intuitive interpretation — good science relies on the ability for many to come to the same conclusion based on the same evidence. Open access data makes such a thing possible. Much more of Srivastava’s presentation was about DASHlink, “a virtual laboratory for scientists and engineers to disseminate results and collaborate on research problems in health management technologies for aeronautics systems.”

Scientific workflows and bioinformatics applications” by Ewa Deelman (University of Southern California) was up next. She echoed many of the things I heard from library pundits a few years ago when it came to institutional repositories. In short, “Workflows are what are needed in order for e-science to really work… Instead of moving the data to the computation, you have to move the computation to the data.” This is akin to two ideas. First, like Hey’s idea of providing tools to facilitate cyberinfrastructure, Deelman advocates integrating the cyberinfrastructure tools into the work of scientists. Second, e-science is more than mere infrastructure. It also approaches the “services against text” idea which I have been advocating for a few years.

Jeffrey Layton (Dell, Inc.) rounded out the session with a presentation called “I/O pattern characterization of HPC applications“. In it he described how he used the output of strace commands — which can be quite voluminous — to evaluate storage input/output patterns. “Storage is cheap, but it is only one of a bigger set of problems in the system.”

By this time I was full, my iPad had arrived in the mail, and I went home.


It just so happens I was given the responsibility of inviting a number of the humanists to the event, specifically: John Unsworth, Stéphan Sinclair, and Richard Whaley. That is was an honor, and I appreciate the opportunity. “Thank you.”

I learned a number of things, and a few other things were re-enforced. First, the word “cyberinfrastructure” is the newly minted term for “e-science”. Many of the presenters used these two words interchangeably. Second, while my experience with the digital humanities is still in its infancy, I am definitely on the right track. Concordances certainly don’t seem to be going out of style any time soon, and my use of indexes is a movement in the right direction. Third, the cyberinfrastructure people see themselves as support to the work of scientists. This is similar to the work of librarians who see themselves supporting their larger communities. Personally, I think this needs to be qualified since I believe it is possible for me to expand the venerable sphere of knowledge too. Providing library (or cyberinfrastructure) services does not preclude me from advancing our understanding of the human condition and/or describing the natural world. Lastly, open source software and open access publishing were common underlying themes but not rarely explicitly stated. I wonder whether or not the the idea of “open” is a four letter word.

Michael Hart in Roanoke (Indiana)

On Saturday, February 27, Paul Turner and I made our way to Roanoke (Indiana) to listen to Michael Hart tell stories about electronic texts and Project Gutenberg. This posting describes our experience.

Roanoke and the library

To celebrate its 100th birthday, the Roanoke Public Library invited Michael Hart of Project Gutenberg fame to share his experience regarding electronic texts in a presentation called “Books & eBooks: Past, Present & Future Libraries”. The presentation was scheduled to start around 3 o’clock, but Paul Turner and I got there more than an hour early. We wanted to have time to visit the Library before it closed at 2 o’clock. The town of Roanoke (Indiana) — a bit south west of Fort Wayne — was tiny by just about anybody’s standard. It sported a single blinking red light, a grade school, a few churches, one block of shops, and a couple of eating establishments. According to the man in the bar, the town got started because of the locks that had been built around town.

The Library was pretty small too, but it bursted with pride. About 1,800 square feet in size, it was overflowing with books and videos. There were a couple of comfy chairs for adults, a small table, a set of four computers to do Internet things, and at least a few clocks the wall. They were very proud of the fact that they had become an Evergreen library as a part Evergreen Indiana initiative. “Now is is possible to see what is owned in other, nearby libraries, and borrow things from them as well,” said the Library’s Board Director.

Michael Hart

The presentation itself was not held in the Library but in a nearby church. About fifty (50) people attended. We sat in the pews and contemplated the symbolism of the stained glass windows and wondered how the various hardware placed around the alter was going to be incorporated into the presentation.

Full of smiles and joviality, Michael Hart appeared in a tailless tuxedo, cumber bun, and top hat. “I am now going to pull a library out of my hat,” he proclaimed, and proceeded to withdraw a memory chip. “This chip contains 10’s of thousands of books, and now I’m going to pull a million books out of my pocket,” and he proceed to display a USB drive. Before the year 2020 he sees us capable of carrying around a billion books on some sort of portable device. Such was the essence of his presentation — computer technology enables the distribution and acquisition of “books” in ways never before possible. Through this technology he wants to change the world. “I consider myself to be like Johnny Appleseed, and I’m spreading the word,” at which time I raised my hand and told him Johnny Appleseed (John Chapman) was buried just up the road in Fort Wayne.

Mr. Hart displayed and described a lot of antique hardware. A hard drive that must have weighed fifty (50) pounds. Calculators. Portable computers. Etc. He illustrated how storage mediums were getting smaller and smaller while being able to save more and more data. He was interested in the packaging of data and displayed a memory chip a person can buy from Walmart containing “all of the hit songs from the 50’s and 60’s”. (I wonder how the copyright issues around that one had been addressed.) “The same thing,” he said, “could be done for books but there is something wrong with the economics and the publishing industry.”

Roanoke (Indiana)
Roanoke (Indiana)
pubic library
public library

He outlined how Project Gutenberg works. First a book is identified as a possible candidate for the collection. Second, the legalities of the making the book available are explored. Next, a suitable edition of the book is located. Fourth, the book’s content is transcribed or scanned. Finally, 100’s of people proof-read the result and ultimately make it available. Hart advocated getting the book out sooner rather than later. “It does not have to be perfect, and we can always the fix errors later.”

He described how the first Project Gutenberg item came into existence. In a very round-about and haphazard way, he enrolled in college. Early on he gravitated towards the computer room because it was air conditioned. Through observation he learned how to use the computer, and to do his part in making the expense of the computer worthwhile, he typed out the United States Declaration of the Independence on July 4th, 1971.

“Typing the books is fun,” he said. “It provides a means for reading in ways you had never read them before. It is much more rewarding than scanning.” As a person who recently learned how to bind books and as a person who enjoys writing in books, I asked Mr. Hart to compare & contrast ebooks, electronic texts, and codexes. “The things Project Gutenberg creates are electronic texts, not ebooks. They are small, portable, easily copyable, and readable by any device. If you can’t read a plain text document on your computer, then you have much bigger problems. Moreover, there is an enormous cost-benefit compared to printed books. Electronic texts are cheap.” Unfortunately, he never really answered the question. Maybe I should have phrased it differently and asked him, the way Paul did, to compare the experience of reading physical books and electronic texts. “I don’t care if it looks like a book. Electronic texts allow me to do more reading.”

“Two people invented open source. Me and Richard Stallman,” he said. Well, I don’t think this is exactly true. Rather, Richard Stahlman invented the concept of GNU software, and Michael Hart may have invented the concept of open access publishing. But the subtle differences between open source software and open access publishing are lost on most people. In both cases the content is “free”. I guess I’m too close to the situation. I too see open source software distribution and open access publishing having more things in common than differences.

stained glass

“I knew Project Gutenberg was going to be success when I was talking on the telephone with a representative of the Common Knowledge project and heard a loud crash on the other end of the line. It turns out the representative’s son and friends had broken an annorandak chair while clamoring to read an electronic text.” In any case, he was fanatically passionate about giving away electronic texts. He sited the World eBook Fair, and came to the presentation with plenty of CD’s for distribution.

In the end I had my picture taken with Mr. Hart. We then all retired to the basement for punch and cake where we sang Happy Birthday to Michael. Two birthdays celebrated at the same time.


Michael and Eric
Michael and Eric

Many people are drawn to the library profession as a matter of principle. Service to others. Academic freedom. Preservation of the historical record. I must admit that I am very much the same way. I was drawn to librarianship for two reasons. First, as a person with a BA in philosophy, I saw libraries as a places full of ideas, literally. Second, I saw the profession as a growth industry because computers could be used to disseminate the content of books. In many ways my gut feelings were accurate, but at the same time they were misguided because much of librarianship surrounds workflows, processes that are only a couple of steps away from factory work, and the curation of physical items. To me, just like Mr. Hart, the physical item is not as important as what it manifests. It is not about the book. Rather, it is what is inside the book. Us librarians have tied our identities to the physical book in such a way to be limiting. We have pegged ourselves, portrayed a short-sighted vision, and consequently painted ourselves into a corner. It the carpenter a hammer expert? Is the surgeon a scalpel technician? No, they are builders and healers, respectively. Why must librarianship be identified with books?

I have benefited from Mr. Hart’s work. My Alex Catalogue of Electronic Texts contains many Project Gutenberg texts. Unlike the books from the Internet Archive, the texts are much more amenable to digital humanities computing techniques because they have been transcribed by humans and not scanned by computers. At the same time, the Project Gutenberg texts are not formatted as well for printing or screen display as PDF versions of the same. This is why the use of electronic texts and ebooks is not an either/or situation but rather a both/and, especially when it comes to analysis. Read a well-printed book. Identify item of interest. Locate item in electronic version of book. Do analysis. Return to printed book. The process could work just as well the other way around. Ask a question of the electronic text. Get one or more answers. Examine them in the context of the printed word. Both/and, not either/or.

The company was great, and the presentation was inspiring. I applaud Michael Hart for his vision and seemingly undying enthusiasm. His talk made me feel like I really am on the right track, but change takes time. The free distribution of data and information — whether the meaning of free be denoted as liberty or gratis — is the right thing to do for society in general. We all benefit, and therefore the individual benefits as well. The political “realities” of the situation are more like choices and not Platonic truths. They represent immediate objectives as opposed to long-term strategic goals. I guess this is what you get when you mix the corporeal and ideal natures of humanity.

Who would have known that a trip to Roanoke would turn out to be a reflection of what it means to be human.

Valencia and Madrid: A Travelogue

I recently had the opportunity to visit Valencia and Madrid (Spain) to share some of my ideas about librarianship. This posting describes some of things I saw and learned along the way.

La Capilla de San Francisco de Borja
La Capilla de San Francisco de Borja
Capilla del Santo Cáliz
Capilla del Santo Cáliz

LIS-EPI Meeting

In Valencia I was honored to give the opening remarks at the 4th International LIS-EPI Meeting. Hosted by the Universidad Politécnica de Valencia and organized by Fernanda Mancebo as well as Antonia Ferrer, the Meeting provided an opportunity for librarians to come together and share their experiences in relation to computer technology. My presentation, “A few possibilities for librarianship by 2015” outlined a few near-term futures for the profession. From the introduction:

The library profession is at a cross roads. Computer technology coupled with the Internet have changed the way content is created, maintained, evaluated, and distributed. While the core principles of librarianship (collection, organization, preservation, and dissemination) are still very much apropos to the current milieu, the exact tasks of the profession are not as necessary as they once were. What is a librarian to do? In my opinion, there are three choices: 1) creating services against content as opposed to simply providing access to it, 2) curating collections that are unique to our local institutions, or 3) providing sets of services that are a combination of #1 and #2.

And from the conclusion:

If libraries are representing a smaller and smaller role in the existing information universe, then two choice present themselves. First, the profession can accept this fact, extend it out to its logical conclusion, and see that libraries will eventually play in insignificant role in society. Libraries will not be libraries at all but more like purchasing agents and middle men. Alternatively, we can embrace the changes in our environment, learn how to take advantage of them, exploit them, and change the direction of the profession. This second choice requires a period of transition and change. It requires resources spent against innovation and experimentation with the understanding that innovation and experimentation more often generate failures as opposed to successes. The second option carries with it greater risk but also greater rewards.

robot sculpture
robot sculpture

Josef Hergert

Providing a similar but different vision from my own, Josef Hergert (University of Applied Sciences HTW Chur) described how librarianship ought to be embracing Web 2.0 techniques in a presentation called “Learning and Working in Time of Web 2.0: Reconstructing Information and Knowledge”. To say Hergert was advocating information literacy would be to over-simplify his remarks, yet if you broaden the definition of information literacy to include the use of blogs, wikis, social bookmarking sites — Web 2.0 technologies — then the phrase information literacy is right on target. A number of notable quotes included:

  • We are experiencing many changes in the environment: non-commercial sharing of content, legislative overkill, and “pirate parties”… The definition of “authorship” is changing.
  • The teaching of information literacy courses will help overcome some of the problems.
  • The process of learning is changing because of the Internet… We are now experiencing a greater degree of informal learning as opposed to formal learning… We need as librarians to figure out how to exploit the environment to support learning both formal and informal.
  • The current environment is more than paper, but also about a network of people, and the librarian can help create these networks with [Web 2.0 tools].
  • Provide not only the book but the environment and tools to do the work.

As an aside, I have been using networked computer technologies for more than twenty years. Throughout that time a number of truisms have become apparent. “If you don’t want it copied, then don’t put it on the ‘Net; give back to the ‘Net”, “On the Internet nobody knows that you are a dog”, and “It is like trying to drink from a fire hose” are just a few. Hergert used the newest one, “If it is not on the Internet, then it doesn’t exist.” For better or for worse, I think this is true. Convenience is a very powerful elixer. The ease of acquiring networked data and information is so great compared the time and energy needed to get data and information in analog format that people will get what is simple “good enough”. In order to remain relevant, libraries must put their (full text) content on the ‘Net or be seen as an impediment to learning as opposed to learning’s facilitator.

While I would have enjoyed learning what the other Meeting presenters has to say, it was unrealistic for me to attend the balance of the conference. The translators were going back to Switzerland, and I would not have been able to understand what the presenters were saying. In this regard is sort of felt like the Ugly American, but I have come to realize that the use of English is a purely practical matter. It as nothing to do with a desire to understand American culture.

Bibliteca Valenciana

The next day I have a few others had the extraordinary opportunity to get an inside tour of the Bibliteca Valenciana (Valencia Library). Starting out as a monastery, it was transformed into quite a number of other things, such as a prison, before it became a library. We got to go into the archives, see of of their treasures, and learn about the library’s history. They were very proud of their Don Quixote collection, and we saw their oldest book — a treatise on the Black Death which included receipts for treatments.

Biblioteca Nacional de España

In Madrid I believe visited the Biblioteca Nacional de España (National Library of Spain) and went to their museum. It was free, and I saw an exhibition of original Copernicus, Galileo, Brahe, Kepler, and Newton editions embodying Western scientific progress. Very impressive, and very well done, especially considering the admission fee.

Biblioteca Nacional de España
Biblioteca Nacional

International Institute

Finally, I shared the presentation from the LIS-EPI Meeting at the International Institute. While I advocated changes in the way’s our profession do its work, the attendees at both venues wondered how to about these changes. “We are expected to provide a certain set of services to our patrons here and now. What do we do to learn these new skills?” My answer was grounded in applied research & development. Time must be spent experimenting and “playing” with the new technologies. This should be considered an investment in the profession and its personnel, an investment that will pay off later in new skills and greater flexibility. We work in academia. It behooves us to work academically. This includes explorations into applying our knowledge in new and different ways.


Many thanks go to many people for making this professional adventure possible. I am indebted to Monica Pareja from the United Stated Embassy in Madrid. She kept me out of trouble. I thank Fernanda Mancebo and Antonia Ferrer who invited me to the Meeting. Last and certainly not least, I thank my family for allowing to to go to Spain in the first place since the event happened over the Thanksgiving holiday. “Thank you, one and all.”


Colloquium on Digital Humanities and Computer Science: A Travelogue

On November 14-16, 2009 I attended the 4th Annual Chicago Colloquium on Digital Humanities and Computer Science at the Illinois Institute of Technology in Chicago. This posting outlines my experiences there, but in a phrase, I found the event to be very stimulating. In my opinion, libraries ought to be embracing the techniques described here and integrating them into their collections and services.

Paul Galvin Library
Paul Galvin Library

Day #0 – A pre-conference workshop

Upon arrival I made my way directly to a pre-conference workshop entitled “Machine Learning, Sequence Alignment, and Topic Modeling at ARTFUL” presented by Mark Olsen and Clovis Gladstone. In the workshop they described at least two applications they were using to discover common phrases between texts. The first was called Philomine and the second was called Text::Pair. Both work similarly but Philomine needs to be integrated with Philologic, and Text::Pair is a stand-alone Perl module. Using these tools n-grams are extracted from texts, indexed to the file system, and await searching. By entering phrases into a local search engine, hits are returned that include the phrases and the works where the phrase was found. I believe Text::Pair could be successfully integrated in my Alex Catalogue.

orange, green, and gray
orange, green, and gray
orange and green
orange and green

Day #1

The Colloquium formally began the next day with an introduction by Russell Betts (Illinois Institute of Chicago). His most notable quote was, “We have infinite computer power at our fingertips, and without much thought you can create an infinite amount of nonsense.” Too true.

Marco Büchler (University of Leipzig) demonstrated textual reuse techniques in a presentation called “Citation Detection and Textual Reuse on Ancient Greek Texts”. More specifically, he used textual reuse to highlight differences between texts, graph ancient history, and explore computer science algorithms. Try for more.

Patrick Juola‘s (Duquesne University) “conjecturator” was the heart of the next presentation called “Mapping Genre Spaces via Random Conjectures”. In short, Juola generated thousands and thousands of “facts” in the form of [subject1] uses [subject2] more or less than [subject3]. He then tested each of these facts for truth against a corpus. Ironically, he was doing much of what Betts alluded to in the introduction — creating nonsense. On the other hand, the approach was innovative.

By exploiting a parts-of-speech (POS) parser, Devin Griffiths (Rutgers University) sought the use of analogies as described in “On the Origin of Theories: The Semantic Analysis of Analogy in Scientific Corpus”. Assuming that an analogy can be defined as a noun-verb-noun-conjunction-noun-verb-noun phrase, Griffith looked for analogies in Darwin’s Origin of Species, graphed the number of analogies against locations in the text, and made conclusions accordingly. He asserted that the use of analogy was very important during the Victorian Age, and he tried to demonstrate this assertion through a digital humanities approach.

The use of LSIDs (large screen information displays) was discussed by Geoffrey Rockwell (McMaster University). While I did not take a whole lot of notes from this presentation, I did get a couple of ideas: 1) figure out a way for a person to “step into” a book, or 2) display a graphic representation of a text on a planetarium ceiling. Hmm…

Kurt Fendt (MIT) described a number of ways timelines could be used in the humanities in his presentation called “New Insights: Dynamic Timelines in Digital Humanities”. Through the process I became aware of the SIMILE timeline application/widget. Very nice.

I learned of the existence of a number of digital humanities grants as described by Michael Hall (NEH). They are both start-up grants as well a grants on advanced topics. See:

The first keynote speech, “Humanities as Information Sciences”, was given by Vasant Honavar (Iowa State University) in the afternoon. Honavar began with a brief history of thinking and philosophy, which he believes lead to computer science. “The heart of information processing is taking one string and transforming it into another.” (Again, think the introductory remarks.) He advocated the creation of symbols, feeding them into a processor, and coming up with solutions out the other end. Language, he posited, is an information-rich artifact and therefore something that can be analyzed with computing techniques. I liked how he compared science with the humanities. Science observes physical objects, and the humanities observe human creations. Honavar was a bit arscient, and therefore someone to be admired.

subway tunnel
subway tunnel
skyscraper predecessor
skyscraper predecessor

Day #2

In “Computational Phonostylistics: Computing the Sounds of Poetry” Marc Plamondon (Nipissing University) described how he was counting phonemes in both Tennyson’s and Browning’s poetry to validate whether or not Tennyson’s poetry is “musical” or plosive sounding and Browning’s poetry is “harsh” or fricative. To do this he assumed one set of characters are soft and another set are hard. He then counted the number of times each of these sets of characters existed in each of the respective poets’ works. The result was a graph illustrating the “musical” or “harshness” of the poetry. One of the more interesting quotes from Plamondon’s presentation included, “I am interested in quantifying aesthetics.”

In C.W. Forstal‘s (SUNY Buffalo) presentation “Features from Frequency: Authorship and Stylistic Analysis Using Repetitive Sound” we learned how he too is counting sound n-grams to denote style. He applied the technique to D.H. Lawrence as well as to the Iliad and Odyssey, and to his mind the technique works to his satisfaction.

The second keynote presentation was give by Stephen Wolfram (Wolfram Research) via teleconference. It was called “What Can Be Made Computable in the Humanities?” He began by describing Mathematica as a tool he used to explore the world around him. All of this assumes that the world consists of patterns, and these patterns can be described through the use of numbers. He elaborated through something he called the Principle of Computational Equivalency — once you get to a certain threshold systems create a level of complexity. Such a principle puts pressure on having the simplest descriptive model as possible. (Such things are standard scientific/philosophic principles. Nothing new here.) Looking for patterns was the name of his game, and one such game was applied to music. Discover the patterns in a type of music. Feed the patterns to a computer. Have the computer generate the music. Most of the time the output works pretty well. He called this WolframTones. He went on to describe WolframAlpha as an attempt to make the world’s knowledge computable. Essentially a front-end to Mathematica, WolframAlpha is a vast collection of content associated with numbers: people and their birth dates, the agriculture output of countries, the price of gold over time, temperatures from across the world, etc. Queries are accepted into the system. Searches are done against its content. Results are returned in the form of best-guess answers complete with graphs and charts. WolframAlpha exposes mathematical processing to the general public in ways that have not been done previously. Wolfram described two particular challenges in the creation of WolframAlpha. One was the collection of content. Unlike Google, Wolfram Research does not necessarily crawl the Internet. Rather it selectively collects the content of a “reference library” and integrates it into the system. Second, and more challenging, has been the design of the user interface. People do not enter structured queries, but structured output is expected. Interpreting people’s input is a difficult task in and of itself. From my point of view, he is probably learning more about human thought processes than the natural world.

red girder sculpture
red girder sculpture
gray sculpture
gray sculpture

Some thoughts

This meeting was worth every single penny, especially considering the fact that there was absolutely no registration fee. Free, except of the my travel costs, hotel, and the price of the banquet. Unbelievable!

Just as importantly, the presentations given at this meeting demonstrate the maturity of the digital humanities. These things are not just toys but practical tools for evaluating (mostly) texts. Given the increasing amount of full text available in library collections, I see very little reason why these sorts of digital humanities applications could not be incorporate into library collections and services. Collect full text content. Index it. Provide access to the index. Get back a set of search results. Select one or more items. Read them. Select one or more items again, and then select an option such as graph analogies, graph phonemes, or list common phrases between texts. People need to do more than read the texts. People need to use the texts, to analyze them, to compare & contrast them with other texts. The tools described in this conference demonstrate that such things are more than possible. All that has to be done is to integrate them into our current (library) systems.

So many opportunities. So little time.

Collecting water and putting it on the Web (Part I of III)

This is Part I of an essay about my water collection, specifically the whys and hows of it. Part II describes the process of putting the collection on the Web. Part III is a summary, provides opportunities for future study, and links to the source code.

I collect water

It may sound strange, but I have been collecting water since 1978, and to date I believe I have around 200 bottles containing water from all over the world. Most of the water I’ve collected myself, but much of it has also been collected by friends and relatives.

The collection began the summer after I graduated from high school. One of my best friends, Marlin Miller, decided to take me to Ocean City (Maryland) since I had never seen the ocean. We arrived around 2:30 in the morning, and my first impression was the sound. I didn’t see the ocean. I just heard it, and it was loud. The next day I purchased a partially melted glass bottle for 59¢ and put some water, sand, and air inside. I was going keep some of the ocean so I could experience it anytime I desired. (Actually, I believe my first water is/was from the Pacific Ocean, collected by a girl named Cindy Bleacher. She visited there in the late Spring of ’78, and I asked her to bring some back so I could see it too. She did.) That is how the collection got started.

Cape Cod Bay
Cape Cod Bay
Robins Bay
Robins Bay
Gulf of Mexico
Gulf of Mexico

The impetus behind the collection was reinforced in college — Bethany College (Bethany, WV). As a philosophy major I learned about the history of Western ideas. That included Heraclitus who believed the only constant was change, and water was the essencial element of the universe. These ideas were elaborated upon by other philosophers who thought there was not one essencial element, but four: earth, water, air, and fire. I felt like I was on to something, and whenever I heard of somebody going abroad I asked them bring me back some water. Burton Thurston, a Bethany professor, went to the Middle East on a diplomatic mission. He brought back Nile River water and water from the Red Sea. I could almost see Moses floating in his basket and escaping from the Egyptians.

The collection grew significantly in the Fall of 1982 because I went to Europe. During college many of my friends studied abroad. They didn’t do much studying as much as they did traveling. They were seeing and experiencing all of the things I was learning about through books. Great art. Great architecture. Cities whose histories go back millennia. Foreign languages, cultures, and foods. I wanted to see those things too. I wanted to make real the things I learned about in college. I saved my money from my summer peach picking job. My father cashed in a life insurance policy he had taken out on me when I was three weeks old. Living like a turtle with its house on its back, I did the back-packing thing across Europe for a mere six weeks. Along the way I collected water from the Seine at Notre Dame (Paris), the Thames (London), the Eiger Mountain (near Interlaken, Switzerland) where I almost died, the Agean Sea (Ios, Greece), and many other places. My Mediterranean Sea water from Nice is the prettiest. Because of the all the alge, the water from Venice is/was the most biologically active.

Over the subsequent years the collection has grown at a slower but regular pace. Atlantic Ocean (Myrtle Beach, South Carolina) on a day of playing hooky from work. A pond at Versailles while on my honeymoon. Holy water from the River Ganges (India). Water from Lock Ness. I’m going to grow a monster from DNA contained therein. I used to have some of a glacier from the Canadian Rockies, but it melted. I have water from Three Mile Island (Pennsylvania). It glows in the dark. Amazon River water from Peru. Water from the Missouri River where Lewis & Clarke decided it began. Etc.

Many of these waters I haven’t seen in years. Moves from one home to another have relegated them to cardboard boxes that have never been unpacked. Most assuredly some of the bottles have broken and some of the water has evaporated. Such is the life of a water collection.

Lake Huron
Lake Huron
Trg Bana Jelacica
Trg Bana Jelacica
Jimmy Carter Water
Jimmy Carter Water

Why do I collect water? I’m not quite sure. The whole body of water is the second largest thing I know. The first being the sky. Yet the natural bodies of water around the globe are finite. It would be possible to collect water from everywhere, but very difficult. Maybe I like the challenge. Collecting water is cheap, and every place has it. Water makes a great souvenir, and the collection process helps strengthen my memories. When other people collect water for me it builds between us a special relationship — a bond. That feels good.

What do I do with the water? Nothing. It just sits around my house occupying space. In my office and in the cardboard boxes in the basement. I would like to display it, but over all the bottles aren’t very pretty, and they gather dust easily. I sometimes ponder the idea of re-bottling the water into tiny vials and selling it at very expensive prices, but in the process the air would escape, and the item would lose its value. Other times I imagine pouring the water into a tub and taking a bath it it. How many people could say they bathed in the Nile River, Amazon River, Pacific Ocean, Atlantic Ocean, etc. all at the same time.

How water is collected

The actual process of collecting water is almost trivial. Here’s how:

  1. Travel someplace new and different – The world is your oyster.
  2. Identify a body of water – This should be endemic of the locality such as an ocean, sea, lake, pond, river, stream, or even a public fountain. Natural bodies of water a preferable. Processed water is not.
  3. Find a bottle – In earlier years this was difficult, and I usually purchased a bottle of wine with my meal, kept the bottle and cork, and used the combination as my container. Now-a-days it is easier to root round in a trash can for a used water bottle. They’re ubiquitous, and they too are often endemic of the locality.
  4. Collect the water – Just fill the bottle with mostly water but some of what the water is flowing over as well. The air comes along for the ride.
  5. Take a photograph – Hold the bottle at arm’s length and take a picture it. What you are really doing here is two-fold. Documenting the appearance of the bottle but also documenting the authenticity of the place. The picture’s background supports the fact that water really came from where the collector says.
  6. Label the bottle – On a small piece of paper write the name of the body of water, where it came from, who collected it, and when. Anything else is extra.
  7. Save – Keep the water around for posterity, but getting it home is sometimes a challenge. With the advent of 911 it is difficult to get the water through airport security and/or customs. I have recently found myself checking my bags and incurring a handling fee just to bring my water home. Collecting water is not as cheap as it used to be.

Who can collect water for me? Not just anybody. I have to know you. Don’t take it personally, but remember, part of the goal is relationship building. Moreover, getting water from strangers would jeopardize the collection’s authenticity. Is this really the water they say it is? Call it a weird part of the “collection development policy”.

Pacific Ocean
Pacific Ocean
Rock Run
Rock Run
Salton Sea
Salton Sea

Read all the posts in this series:

  1. This post
  2. How the collection is put on the Web
  3. A summary, future directions, and source code

Visit the water collection.

Microsoft Surface at Ball State

Me and a number of colleagues from the University of Notre Dame visited folks from Ball State University and Ohio State University to see, touch, and discuss all things Microsoft Surface.

There were plenty of demonstrations surrounding music, photos, and page turners. The folks of Ball State were finishing up applications for the dedication of the new “information commons”. These applications included an exhibit of orchid photos and an interactive map. Move the scroll bar. Get a differnt map based on time. Tap locations. See pictures of buildings. What was really interesting about the later was the way it pulled photographs from the library’s digital repository through sets of Web services. A very nice piece of work. Innovative and interesting. They really took advantage of the technology as well as figured out ways to reuse and repurpose library content. They are truly practicing digital librarianship.

The information commons was nothing to sneeze at either. Plenty of television cameras, video screens, and multi-national news feeds. Just right for a school with a focus on broadcasting.

Ball State University. Hmm…

Mass Digitization Mini-Symposium: A Reverse Travelogue

The Professional Development Committee of the Hesburgh Libraries at the University of Notre Dame a “mini-symposium” on the topic of mass digitization on Thursday, May 21, 2009. This text documents some of what the speakers had to say. Given the increasingly wide availability of free full text information provided through mass digitization, the forum offered an opportunity for participants to learn how such a thing might affect learning, teaching, and scholarship. *

Setting the Stage

presenters and organizers
Presenters and organizers

After introductions by Leslie Morgan, I gave a talk called “Mass digitization in 15 minutes” where I described some of the types of library services and digital humanities processes that could be applied to digitized literature. “What might libraries be like if 51% or more of our collections were available in full text?”

Maura Marx

The Symposium really got underway with the remarks of Maura Marx (Executive Director of the Open Knowledge Commons) in a talk called “Mass Digitization and Access to Books Online.” She began by giving an overview of mass digitization (such as the efforts of the Google Books Project and the Internet Archive) and compared it with large-scale digitization efforts. “None of this is new,” she said, and gave examples including Project Gutenberg, the Library of Congress Digital Library, and the Million Books Project. Because the Open Knowledge Commons is an outgrowth of the Open Content Alliance, she was able to describe in detail the mechanical digitizing process of the Internet Archive with its costs approaching 10¢/page. Along the way she advocated the HathiTrust as a preservation and sharing method, and she described it as a type of “radical collaboration.” “Why is mass digitization so important?” She went on to list and elaborate upon six reasons: 1) search, 2) access, 3) enhanced scholarship, 4) new scholarship, 5) public good, and 6) the democratization of information.

The second half of Ms. Marx’s presentation outlined three key issues regarding the Google Books Settlement. Specifically, the settlement will give Google a sort of “most favored nation” status because it prevents Google from getting sued in the future, but it does not protect other possible digitizers the same way. Second, it circumvents, through contract law, the problem of orphan works; the settlement sidesteps many of the issues regarding copyright. Third, the settlement is akin to a class action suit, but in reality the majority of people affected by the suit are unknown since they fall into the class of orphan works holders. To paraphrase, “How can a group of unknown authors and publishers pull together a class action suit?”

She closed her presentation with a more thorough description of Open Knowledge Commons agenda which includes: 1) the production of digitized materials, 2) the preservation of said materials, and 3) and the building of tools to make the materials increasingly useful. Throughout her presentation I was repeatedly struck by the idea of the public good the Open Knowledge Commons was trying to create. At the same time, her ideas were not so naive to ignore the new business models that are coming into play and the necessity for libraries to consider new ways to provide library services. “We are a part of a cyber infrastructure where the key word is ‘shared.’ We are not alone.”

Gary Charbonneau

Gary Charbonneau (Systems Librarian, Indiana University – Bloomington) was next and gave his presentation called “The Google Books Project at Indiana University“.

Indiana University, in conjunction with a number of other CIC (Committee on Institutional Cooperation) libraries have begun working with Google on the Google Books Project. Like many previous Google Book Partners, Charbonneau was not authorized to share many details regarding the Project; he was only authorized “to paint a picture” with the metaphoric “broad brush.” He described the digitization process as rather straightforward: 1) pull books from a candidate list, 2) charge them out to Google, 3) put the books on a truck, 4) wait for them to return in few weeks or so, and 5) charge the books back into the library. In return for this work they get: 1) attribution, 2) access to snippets, and 3) sets of digital files which are in the public domain. About 95% of the works are still under copyright and none of the books come from their rare book library — the Lilly Library.

Charbonneau thought the real value of the Google Book search was the deep indexing, something mentioned by Marx as well.

Again, not 100% of the library’s collection is being digitized, but there are plans to get closer to that goal. For example, they are considering plans to digitize their “Collections of Distinction” as well as some of their government documents. Like Marx, he advocated the HathiTrust but he also suspected commercial content might make its way into its archives.

One of the more interesting things Charbonneau mentioned was in regards to URLs. Specifically, there are currently no plans to insert the URLs of digitized materials into the 856 $u field of MARC records denoting the location of items. Instead they plan to use an API (application programmer interface) to display the location of files on the fly.

Indiana University hopes to complete their participation in the Google Books Project by 2013.

Sian Meikle

The final presentation of the day was given by Sian Meikle (Digital Services Librarian, University of Toronto Libraries) whose comments were quite simply entitled “Mass Digitization.”

The massive (no pun intended) University of Toronto library system consisting of a whopping 18 million volumes spread out over 45 libraries on three campuses began working with the Internet Archive to digitize books in the Fall of 2004. With their machines (the “scribes”) they are able to scan about 500 pages/hour and, considering the average book is about 300 pages long, they are scanning at a rate of about 100,000 books/year. Like Indiana and the Google Books Project, not all books are being digitized. For example, they can’t be too large, too small, brittle, tightly bound, etc. Of all the public domain materials, only 9% or so do not get scanned. Unlike the output of the Google Book Project, the deliverables from their scanning process include images of the texts, a PDF file of the text, an OCRed version of the text, a “flip book” version of the text, and a number of XML files complete with various types of metadata.

Considering Meikle’s experience with mass digitized materials, she was able to make a number of observations and distinctions. For example, we — the library profession — need to understand the difference between “born digital” materials and digitized materials. Because of formatting, technology, errors in OCR, etc, the different manifestations have different strengths and weaknesses. Some things are more easily searched. Some things are displayed better on screens. Some things are designed for paper and binding. Another distinction is access. According to some of her calculations, materials that are in electronic form get “used” more than their printed form. In this case “used” means borrowed or downloaded. Sometimes the ratio is as high as 300-to-1. There are three hundred downloads to one borrow. Furthermore, she has found that proportionately, English language items are not used as heavily as materials in other languages. One possible explanation is that material in other languages can be harder to locate in print. Yet another difference is the type of reading one format offers over another; compare and contrast “intentional reading” with “functional reading.” Books on computers make it easy to find facts and snippets. Books on paper tend to lend themselves better to the understanding of bigger ideas.

Lastly, Meikle alluded to ways the digitized content will be made available to users. Specifically, she imagines it will become a part of an initiative called the Scholar’s Portal — a single index of journal article literature, full text books, and bibliographic metadata. In my mind, such an idea is the heart of the “next generation” library catalog.

Summary and Conclusion

The symposium was attended by approximately 125 people. Most were from the Hesburgh Libraries of the University of Notre Dame. Some were from regional libraries. There were a few University faculty in attendance. The event was a success in that it raised the awareness of what mass digitization is all about, and it fostered communication during the breaks as well as after the event was over.

The opportunities for librarianship and scholarship in general are almost boundless considering the availability of full text content. The opportunities are even greater when the content is free of licensing restrictions. While the idea of complete collections totally free of restrictions is a fantasy, the idea of significant amounts of freely available full text content is easily within our grasp. During the final question and answer period, someone asked, “What skills and resources are necessary to do this work?” The answer was agreed upon by the speakers, “What is needed? An understanding that the perfect answer is not necessary prior to implementation.” There were general nods of agreement from the audience.

Now is a good time to consider the possibilities of mass digitization and to be prepared to deal with them before they become the norm as opposed to the exception. This symposium, generously sponsored by the Hesburgh Libraries Professional Development Committee, as well as library administration, provided the opportunity to consider these issues. “Thank you!”


* This posting was orignally “published” as a part of the Hesburgh Libraries of the University of Notre Dame website, and it is duplicated here because “Lot’s of copies keep stuff safe.”

A day at CIL 2009

This documents my day-long experiences at the Computers in Libraries annual conference, March 31, 2009. In a sentence, the meeting was well-attended and covered a wide range of technology issues.

Washington Monument

The day began with an interview-style keynote address featuring Paul Holdengraber (New York Public Library) interviewed by Erik Boekesteijn (Library Concept Center). As the Director of Public Programs at the Public Library, Holdengraber’s self-defined task is to “levitate the library and make the lions on the front steps roar.” Well-educated, articulate, creative, innovative, humorous, and cosmopolitan, he facilitates sets of programs in the library’s reading room called “Live from the New York Public Library” where he interviews people in an effort to make the library — a cultural heritage institution — less like a mausoleum for the Old Masters and more like a place where great ideas flow freely. A couple of notable quotes included “My mother always told me to be porous because you have two ears and only one mouth” and “I want to take the books from the closed stacks and make people desire them.” Holdengraber’s enthusiasm for his job is contagious. Very engaging as well as interesting.

During the first of the concurrent sessions I gave a presentation called “Open source software: Controlling your computing environment” where I first outlined a number of definitions and core principles of open source software. I then tried to draw a number of parallels between open source software and librarianship. Finally, I described how open source software can be applied in libraries. During the presentation I listed four skills a library needs to become proficient in in order to take advantage of open source software (namely, relational databases, XML, indexing, and some sort of programming language), but in retrospect I believe basic systems administration skills are the things really required since the majority of open source software is simply installed, configured, and used. Few people feel the need to modify its functionality and therefore the aforementioned skills are not critical, only desirable.

Lincoln Memorial

In “Designing the Digital Experience” by David King (Topeka & Shawnee County Public Library) attendees were presented with ways websites can be created in a way that digitally supplements the physical presents of a library. He outlined the structural approaches to Web design such as the ones promoted by Jesse James Garrett, David Armano and 37Signals. He then compared & contrasted these approaches to the “community path” approaches which endeavor to create a memorable experience. Such things can be done, King says, through conversations, invitations, participation, creating a sense of familiarity, and the telling of stories. It is interesting to note that these techniques are not dependent on Web 2.0 widgets, but can certainly be implemented through their use. Throughout the presentation he brought all of his ideas home through the use of examples from the websites of Harley-Davidson, Starbucks, American Girl, and Webkinz. Not ironically, Holdengraber was doing the same thing for the Public Library except in the real world, not through a website.

In a session after lunch called “Go Where The Client Is” Natalie Collins (NRC-CISTI) described how she and a few co-workers converted library catalog data containing institutional repository information as well as SWETS bibliographic data into NLM XML and made it available for indexing by Google Scholar. In the end, she discovered that this approach was much more useful to her constituents when compared to the cool (“kewl”) Web Services-based implementation they had created previously. Holly Hibner (Salem-South Lyon District Library) compared & contrasted the use of tablet PC’s with iPods for use during roaming reference services. My two take-aways from this presentation were cool (“kewl”) services called and LinkBunch, websites making it easier to convert data from one format into another and bundle lists of link together into a single URL, respectively.

Jefferson Memorial

The last session for me that day was one on open source software implementations of “next generation” library catalogs, specifically Evergreen. Karen Collier and Andrea Neiman (both of Kent County Public Library) outlined their implementation process of Evergreen in rural Michigan. Apparently it began with the re-upping the of their contract for their computer hardware. Such a thing would cost more than they expected. This led to more investigations which ultimately resulted in the selection of Evergreen. “Open source seemd like a logical conclusion.” They appear to be very happy with their decision. Karen Schneider (Equinox Software) gave a five-minute “lightning talk” on the who and what of Equinox and Evergreen. Straight to the point. Very nice. Ruth Dukelow (Michigan Library Consortium) described how participating libraries have been brought on board with Evergreen, and she outlined the reasons why Evergreen fit the bill: it supported MLCat compliance, it offered an affordable hosted integrated library system, it provided access to high quality MARC records, and it offered a functional system to non-technical staff.

I enjoyed my time there in Washington, DC at the conference. Thanks go to Ellyssa Kroski, Steven Cohen, and Jane Dysart for inviting me, and allowing me to share some of my ideas. The attendees at the conference were not as technical as you might find at Access, Code4Lib, and certainly not JCDL nor ECDL. This is not a bad thing. The people were genuinely interested in the things presented, but I did overhear one person say, “This is completely over my head.” The highlight for me took a place during the last session where people were singing the praise of open source software for all the same reasons I had been expressing them over the past twelve years. “It is so much like the principles of librarianship,” she said. That made my day.

Quick Trip to Purdue

Last Friday, March 27, I was invited by Michael Witt (Interdisciplinary Research Librarian) at Purdue University to give a presentation to the library faculty on the topic of “next generation” library catalogs. During the presentation I made an effort to have the participants ask and answer questions such as “What is the catalog?”, “What is it expected to contain?”, “What functions is it expected to perform and for whom?”, and most importantly, “What problems is it expected to solve?”

I then described how most of the current “next generation” library catalog thingees are very similar. Acquire metadata records. Optionally store them in a database. Index them (with Lucene). Provide services against the index (search and browse). I then brought the idea home by describing in more detail how things like VuFind, Primo, Koha, Evergreen, etc. all use this model. I then made an attempt to describe how our “next generation” library catalogs could go so much further by providing services against the texts as well as services against the index. “Discovery is not the problem that needs to be solved.”

Afterwards a number of us went to lunch where we compared & contrasted libraries. It is a shame the Purdue University, University of Indiana, and University of Notre Dame libraries do not work more closely together. Our strengths compliment each other in so many ways.

“Michael, thanks for the opportunity!”

Something I saw on the way back home.

Library Technology Conference, 2009: A Travelogue

This posting documents my experiences at the Library Technology Conference at Macalester  College (St. Paul, Minnesota) on March 18-19, 2009. In a sentence, this well-organized regional conference provided professionals from near-by states an opportunity to listen, share, and discuss ideas concerning the use of computers in libraries.

Wallace Library
campus center
Dayton Center

Day #1, Wednesday

The Conference, sponsored by Macalester College — a small, well-respected liberal arts college in St. Paul — began with a keynote presentation by Stacey Greenwell (University of Kentucky) called “Applying the information commons concept in your library”. In her remarks the contagiously energetic Ms. Greenwell described how she and her colleagues implemented the “Hub“, an “active learning place” set in the library. After significant amounts of planning, focus group interviews, committee work, and on-going cooperation with the campus computing center, the Hub opened in March of 2007. The whole thing is designed to be a fun, collaborative learning commons equipped with computer technology and supported by librarian and computer consultant expertise. Some of the real winners in her implementation include the use of white boards, putting every piece of furniture on wheels, including “video walls” (displaying items from special collections, student art, basketball games, etc.), and hosting parties where as many as 800 students attend. Greenswell’s enthusiasm was inspiring.

Most of the Conference was made up of sets of concurrent sessions, and the first one I attended was given by Jason Roy and Shane Nackerund (both of the University of Minnesota) called “What’s cooking in the lab?” Roy began by describing both a top-down and bottom-up approach to the curation and maintenance of special collections content. Technically, their current implementation includes a usual cast of characters (DSpace, finding aids managed with DLXS, sets of images, and staff), but sometime in the near future he plans on implementing a more streamlined approach consisting of Fedora for the storage of content with sets of Web Services on top to provide access. It was also interesting to note their support for user-contributed content. Users supply images. Users tag content. Images and tags are used to supplement more curated content.

Nackerund demonstrated a number of tools he has been working on to provide enhanced library services. One was the Assignment Calculator — a tool to outline what steps need to be done to complete library-related, classroom-related tasks. He has helped implement a mobile library home page by exploiting Web Service interfaces to this underlying systems. While the Web Service APIs are proprietary, they are a step in the right direction for further exploitation. He has implementing sets of course pages — as opposed to subject guides — too. “I am in this class, what library resources should I be using?” (The creation of course guide seems to be a trend.) Finally, he is creating a recommender service of which the core is the creation of “affinity strings” — a set of codes used to denote the characteristics of an individual as opposed to specific identifiers. Of all the things from the Conference, the idea of affinity strings struck me the hardest. Very nice work, and documented in a Code4Lib Journal article too boot.

In the afternoon I gave a presentation called “Technology Trends and Libraries: So many opportunities“. In it I described why mobile computing, content “born digital”, the Semantic Web, search as more important than browse, and the wisdom of crowds represent significant future directions for librarianship. I also described the importance of not loosing the sight of the forest from the trees. Collection, organization, preservation, and dissemination of library content and services are still the core of the profession, and we simply need to figure out new ways to do the work we have traditionally done. “Libraries are not warehouses of data and information as much as they are gateways to learning and knowledge. We must learn to build on the past and evolve, instead of clinging to it like a comfortable sweater.”

Later in the afternoon Marian Rengal and Eric Celeste (both of the Minnesota Digital Library) described the status of the Minnesota Digital Library in a presentation called “Where we are”. Using ContentDM as the software foundation of their implementation, the library includes many images supported by “mostly volunteers just trying to do the right thing for Minnesota.” What was really interesting about their implementation is the way they have employed a building block approach. PMWiki to collaborate. The Flickr API to share. Pachyderm to create learning objects. One of the most notable quotes from the presentation was “Institutions need to let go of their content to a greater degree; let them have a life of their own.” I think this is something that needs to be heard by many of us in cultural heritage institutions. If we make our content freely available, then we will be facilitating the use of the content in unimagined ways. Such is a good thing.

St. Paul Cathedral
Balboa facade

Day #2, Thursday

The next day was filled with concurrent sessions. I first attended one by Alec Sonsteby (Concordia College) entitled “VuFind: the MnPALS Experience” where I learned how MnPALS — a library consortium — brought up VuFind as their “discovery” interface. They launched VuFind in August of 2008, and they seem pretty much satisfied with the results.

During the second round of sessions I lead a discussion/workshop regarding “next generation” library catalogs. In it we asked and tried to answer questions such as “What is the catalog?”, “What does it contain?”, “What functions is it expected to fulfill and for whom?”, and most importantly, “What is the problem it is expected to solve?” I then described how many of current crop of implementations function very similarly. Dump metadata records. Often store them in a database. Index them (with Lucene). Provide services against the index (search and browse). I then tried to outline how “next generation” library catalogs could do more, namely provide services against the texts as well as the index.

The last session I attended was about ERMs — Electronic Resource Management systems. Don Zhou (William Mitchel College of Law) described how he implemented Innovative Interface’s ERM. “The hard part was getting the data in.” Dani Roach and Carolyn DeLuca (both of University of St. Thomas) described how they implemented a Serials Solutions… solution. “You need to be adaptive; we decided to do things one way and then went another… It is complex, not difficult, just complex. There have to be many tools to do ERM.” Finally, Galadriel Chilton (University of Wisconsin – La Crosse) described an open source implementation written in Microsoft Access, but “it does not do electronic journals.”

In the afternoon Eric C. was gracious enough to tour me around the Twin Cities. We saw the Cathedral of Saint Paul, the Mississippi River, and a facade by Balboa. But the most impressive thing I saw was the University of Minnesota’s “cave” — an onsite storage facility for the University’s libraries. All the books they want to withdraw go here where they are sorted by size, placed into cardboard boxes assigned to a bar code, and put into rooms 100 yards long and three stories high. The facility is manned by two people, and in ten years they have only lost two books out of the 1.3 million. The place is so huge you can literally drive a tractor trail truck into the place. Very impressive, and I got a personal tour. “Thanks Eric!”

eric and eric
Eric and Eric
St. Anthony Falls


I sincerely enjoyed the opportunity to attend this conference. Whenever I give talks I feel the need to write up a one-page handout. That process forces me to articulate my ideas in writing. When I give the presentation it is not all about me, but rather learning about the environments of my peers. It is an education all around. This particular regional conference was the right size, about 250. Many of the attendees knew each other. They caught up and learned things along the way. “Good job Ron Joslin!” The only thing I missed was a photograph of Mary Tyler Moore. Maybe next time.

Code4Lib Conference, Providence (Rhode Island) 2009

logo This posting documents my experience at the Code4Lib Conference in Providence, Rhode Island between February 23-26, 2009. To summarize my experiences, I went away with a better understanding of linked data, it is an honor to be a part of this growing and maturing community, and finally, this conference is yet another example of the how the number of opportunities for libraries exist if only you are to think more about the whats of librarianship as opposed to the hows.

Day #0 (Monday, February 23) – Pre-conferences

On the first day I facilitated a half-day pre-conference workshop, one of many, called XML In Libraries. Designed as a full-day event, this workshop was not one of my better efforts. (“I sincerely apologize.”) Everybody brought their own computer, but some of them could not get on the ‘Net. The first half of the workshop should be trimmed down significantly since many of the attendees knew what was being explained. Finally, the hands-on part of the workshop with JEdit was less than successful because it refused to work for me and many of the participants. Lessons learned, and things to keep in mind for next time.

For the better part of the afternoon, I sat in on the WorldCat Grid Services pre-conference where we were given an overview of SRU from Ralph Levan. There was then a discussion on how the Grid Services could be put into use.

During the last part of the pre-conference afternoon I attended the linked data session. Loosely structured and by far the best attended event, I garnered an overview of what linked data services are and what are some of the best practices for implementing them. I had a very nice chat with Ross Singer who helped me bring some of these concepts home to my Alex Catalogue. Ironically, the Catalogue is well on its way to being exposed via a linked data model since I have previously written sets of RDF/XML files against its underlying content. The key seems to be to link together as many HTTP-based URIs as possible while providing content-negotiation services in order to disseminate your information in the most readable/usable formats possible.

Day #1 (Tuesday, February 24)

Code4Lib is a single-track conference, and its 300 or so attendees gathered in a refurbished Masonic Lodge — in the shadows of the Rhode Island State House — for the first day of the conference.

Roy Tennant played Master of Ceremonies for the Day #1 and opened the event with an outline of what he sees as the values of the Code4Lib community: egalitarianism, participation, democracy, anarchy, informality, and playfulness. From my point of view, that sums things up pretty well. In an introduction for first-timers, Mark Matienzo (aka anarchist) described the community as “a bit clique-ish”, a place where there are a lot of inside jokes (think bacon, neck beards, and ++), and a venue where “social capital” is highly valued. Many of these things can be most definitely been seen “in channel” by participating in the IRC #code4lib chat room.

In his keynote address, A Bookless Future For Libraries, Stefano Mazzocchi encouraged the audience to think of the “iPod for books” as an ecosystem necessity, not a possibility. He did this by first chronicling the evolution of information technology (speech to cave drawing to clay tablets to fiber to printing to electronic publishing). He outlined the characteristics of electronic publishing: dense, widely available, network accessible, distributed business models, no batteries, lots of equipment, next to zero marginal costs, and poor resolution. He advocated the Semantic Web (a common theme throughout the conference), and used Freebase as a real-world example. One of the most intriguing pieces of information I took away from this presentation was the idea of making games out of data entry in order to get people to contribute content. For example, make it fun to guess whether or not a person was live, dead, male, or female. Based on the aggregate responses of the “crowd” it is possible to make pretty reasonable guesses as to the truth of facts.

Next, Andres Soderback described his implementation of the Semantic Web world in Why Libraries Should Embrace Linked Data. More specifically, he said library catalogs should be: open, linkable, provide links, be a part of the network, not an end of themselves, and hackable. He went on to say that “APIs suck” because they are: specific, take too much control, not hackable enough, and not really “Web-able”. Not incidentally, he had previously exposed his entire library catalog — the National Library of Sweden — as a set of linked data, but it broke after the short-lived site by Ed Summers had been taken down.

Ross Singer described an implementation and extension to the Atom Publishing Protocol in his Like A Can Opener For Your Data Silo: Simple Access Through AtomPub and Jangle. I believe the core of his presentation can be best described through an illustration where an Atom client speaks to Jangle through Atom/RSS, Jangle communicates with (ILS-) specific applications through “connectors”, and the results are returned back to the client:

                   +--------+       +-----------+ 
  +--------+       |        | <---> | connector |
  | client | <---> | Jangle |       +-----------+ 
  +--------+       |        | <---> | connector |  
                   +--------+       +-----------+

I was particularly impressed with Glen Newton‘s LuSql: (Quickly And Easily) Getting Your Data From Your DBMS Into Lucene because it described a Java-based command-line interface for querying SQL databases and feeding the results to the community’s currently favorite indexer — Lucene. Very nice.

Terence Ingram‘s presentation RESTafarian-ism At The NLA can be summarized in the phrase “use REST in moderation” because too many REST-ful services linked together are difficult to debug, trouble shoot, and fall prey to over-engineering.

Based on the the number of comments in previous blog postings, Birkin James Diana‘s presentation The Dashboard Initiative was a hit. It described sets of simple configurable “widgets” used to report trends against particular library systems and services.

In Open Up Your Repository With A SWORD Ed Summers and Mike Giarlo described a protocol developed through the funding of the good folks at JISC used to deposit materials into an (institutional) repository through the use of AtomPub protocol.

In an effort view editorial changes over time against sets of EAD files, Mark Matienzo tried to apply version control software techniques against his finding aids. He described these efforts in How Anarchivist Got His Groove Back 2: DVCS, Archival Description, And Workflow but it seems as if he wasn’t as successful as he had hoped because of the hierarchal nature his source (XML) data.

Godmar Back in LibX 2.0 described how he was enhancing the LibX API to allow for greater functionality by enhancing its ability to interact with an increased number of external services such as the ones from Personally, I wonder how well content providers will accept the idea of having content inserted into “their” pages by the LibX extension.

The last formal presentation of the day, djatoka For djummies, was given by Kevin Clark and John Fereira. In it they described the features, functions, advantages, and disadvantages of a specific JPEG2000 image server. Interesting technology that could be exploited more if there were a 100% open source solution.

Day #1 then gave way to about a dozen five-minute “lightning talks”. In this session I shared the state of the Alex Catalogue in Alex4: Yet Another Implementation, and in retrospect I realize I didn’t say a single word about technology but only things about functionality. Hmmm…

Day #2 (Wednesday, February 25)

On the second day of the conference I had the honor of introducing the keynote speaker, Sebastian Hammer. Having known him for at least a few years, I described him as the co-author of the venerable open source Yaz and Zebra software — the same Z39.50 software that drives quite a number of such implementations across Library Land. I also alluded to the time I visited him and his co-workers at Index Data in Copenhagen where we talked shop and shared a very nice lunch in their dot-com-like flat. I thought there were a number of meaty quotes from his presentation. “If you have something to say, then say it in code… I like to write code but have fun along the way… We are focusing our efforts on creating tools instead of applications… We try to create tools to enable libraries to do the work that they do. We think this is fun… APIs are glorified loyalty schemes… We need to surrender our data freely… Standardization is hard and boring but essential… Hackers must become advocates within our organizations.” Throughout his talk he advocated local libraries that: preserve cultural heritage, converge authoritative information, support learning & research, and are pillars of democracy.

Timothy McGeary gave an update on the OLE Project in A New Frontier – The Open Library Environment (OLE). He stressed that the Project is not about the integrated library system but bigger: special collections, video collections, institutional repositories, etc. Moreover, he emphasized that all these things are expected to be built around a Service Oriented Architecture and there is a push to use existing tools for traditional library functions such as the purchasing department for acquisitions or identity management systems for patron files. Throughout his present he stressed that this project is all about putting into action a “community source process”.

In Blacklight As A Unified Discovery Platform Bess Sadler described Blacklight as “yet another ‘next-generation’ library catalog”. This seemingly off-hand comment should not be taken as such because the system implements many of the up-and-coming ideas our fledgling “discovery” tools espouse.

Joshua Ferraro walked us through the steps for creating open bibliographic (MARC) data using a free, browser-based cataloging service in a presentation called A New Platform for Open Data – Introducing ± Web Services. Are these sort of services, freely provided by the likes of LibLime and the Open Library, the sorts of services that make OCLC reluctant to freely distribute “their” sets of MARC records?

Building on LibLime’s work, Chris Catalfo described and demonstrated a plug-in for creating Dublin Core metadata records using ± Web Services in Extending ±biblios, The Open Source Web Based Metadata Editor.

Jodi Schneider and William Denton gave the best presentation I’ve ever heard on FRBR in their What We Talk About When We Talk About FRBR. More specifically, they described “strong” FRBR-ization complete with Works, Manifestations, Expressions, and Items owned by Persons, Families, and Corporate Bodies and having subjects grouped into Concepts, Objects, and Events. Very thorough and easy to understand. schneider++ & denton++ # for a job well-done

In Complete Faceting Toke Eskildsen described his institutions’s implementation called Summa from the State and University Library of Denmark.

Erik Hatcher outlined a number of ways Solr can be optimized for better performance in The Rising Sun: Making The Most Of Solr Power. Solr certainly seems to be on its way to becoming the norm for indexing in the Code4Lib community.

A citation parsing application was described by Chris Shoemaker in FreeCite – An Open Source Free-Text Citation Parser. His technique did not seem to be based so much on punctuation (syntax) as much as word groupings. I think we have something to learn from his technique.

Richard Wallis advocated the use of a Javascript library to update and insert added functionality to OPAC screens in his Great Facets, Like Your Relevance, But Can I Have Links To Amazon And Google Book Search? His tool — Juice — shares OPAC-specific information.

The Semantic Web came full-circle through Sean Hannan‘s Freebasing For Fun And Enhancement. One of the take-aways I got from this conference is to learn more ways Freebase and be used (exploited) in my everyday work.

During the Lightning Talks I very briefly outlined an idea that has been brewing in my head for a few years, specifically, the idea of an Annual Code4Lib Open Source Software Award. I don’t exactly know how such a thing would get established or be made sustainable, but I do think our community is ripe for such recognition. Good work is done by our people, and I believe it needs to be tangibly acknowledged. I am willing to commit to making this a reality by this time next year at Code4Lib Conference 2010.


I did not have the luxury for staying the last day of the Conference. I’m sure I missed some significant presentations. Yet, the things I did see where impressive. They demonstrated ingenuity, creativity, and as the same time, practicality — the desire to solve real-world, present-day problems. These things require the use of both sides of a person’s brain. Systematic thinking and intuition; an attention to detail but the ability to see the big picture at the same time. In other words, arscience.


group photo

Ball State, the movie!

Over the past few months the names of some fellow librarians at Ball State University repeatedly crossed my path. The first was Jonathan Brinley who is/was a co-editor on Code4Lib Journal. The second was Kelley McGrath who was mentioned to me as top-notch cataloger. The third was Todd Vandenbark who was investigating the use of MyLibrary. Finally, a former Notre Damer-er, Marcy Simons, recently started working at Ball State. Because Ball State is relatively close, I decided to take the opportunity to visit these good folks during this rather slow part of the academic year.

Compare & contrast

After I arrived we made our way to lunch. We compared and contrasted our libraries. For example, they had many — about say 200 — public workstations. The library was hustling and bustling. About 18,000 students go to Ball State and seemingly many of them go home on the weekends. Ball State was built with money from the canning jar industry, but upon a visit to the archives no canning jars could be seen. I didn’t really expect any.

Shop talk

Over lunch we talked a lot about FRBR and the possibilities of creating work-level records from the myriad of existing item-level (MARC) records. Since the work-related content is often times encoded as free text in some sort of 500 field, I wonder how feasible the process would be. Ironically, an article, “Identifying FRBR Work-Level Data in MARC Bibliographic Records for Manifestations of Moving Images” by Kelley had been published the day before in Code4Lib. Boy, it certainly is a small world.

I always enjoy “busman’s holidays” and visiting other libraries. I find we oftentimes have more things in common than differences.

A Day with OLE

This posting documents my experience at Open Library Environment (OLE) project workshop that took place at the University of Chicago, December 11, 2008. In a sentence, the workshop provided an opportunity to describe and flowchart a number of back-end library processes in an effort to help design an integrated library system.

What is OLE


full-scale gargoyle

As you may or may not know, the Open Library Environment is a Mellon-funded initiative in cooperation with a growing number of academic libraries to explore the possibilities of building an integrated library system. Since this initiative is more about library back-end and business processes (acquisitions, cataloging, circulation, reserves, ILL, etc.), it is complimentary to the the eXtensible Catalog (XC) project which is more about creating a “discovery” layer against and on top of existing integrated library system’s public access interfaces.

Why OLE?

Why do this sort of work? There are a few reasons. First, vendor consolidation makes the choices of commercial solutions few. Not a good idea; we don’t like monopolies. Second, existing applications do not play well with other (campus) applications. Better integration is needed. Third, existing library systems are designed for print materials, but with the advent of greater and greater amounts of electronic materials the pace of change has been inadequate and too slow.

OLE is an effort to help drive and increase change in Library Land, and this becomes even more apparent when you consider all of the Mellon-related library initiatives it is supporting: Portico (preservation), JSTOR and ArtSTOR (collections), XC (discovery), OLE (business processes/technical services).

The day’s events

The workshop took place at the Regenstein Library (University of Chicago). There were approximately thirty or forty attendees from universities such as Grinnell, Indiana, Notre Dame, Minnesota, Illinois, Iowa, and of course, Chicago.

After being given a short introduction/review of what OLE is and why, we were broken into four groups (cataloging/authorities, circulation/reserves/ILL, acquisitions, and serials/ERM), and we were first asked to enumerate the processes of our respective library activities. We were then asked to classify these activities into four categories: core process, shifting/changing process, processes that could be stopped, and processes that we wanted but don’t have. All of us, being librarians, were not terribly surprised by the enumerations and classifications. The important thing was to articulate them, record them, and compare them with similar outputs from other workshops.

After lunch (where I saw the gargoyle and made a few purchases at the Seminary Co-op Bookstore) we returned to our groups to draw flowcharts of any of our respective processes. The selected processes included checking in a journal issue, checking in an electronic resource, keeping up and maintaining a file of borrowers, acquiring a firm order book, cataloging a rare book, and cataloging a digital version of a rare book. This whole flowcharting process was amusing since the workflows of each participants’ library needed to be amalgamated into a single processes. “We do it this way, and you do it that way.” Obviously there is more than one way to skin a cat. In the end the flowcharts were discussed, photographed, and packaged up to ship back to the OLE home planet.

What do you really want?

The final, wrap-up event of the day was a sharing and articulation of what we really wanted in an integrated library system. “If there one thing you could change, then what would it be?” Based on my notes, the most popular requests were:

  1. make the system interoperable with sets of APIs (4 votes)
  2. allow the system to accommodate multiple metadata formats (3 votes)
  3. include a robust reporting mechanism; give me the ARL Generate Statistics Button (2 votes)
  4. implement a staff interface allowing work to be done without editing records (2 votes)
  5. implement consortial borrowing across targets (2 votes)
  6. separate the discovery processes from the business processes (2 votes)

Other wish list items I thought were particularly interesting included: integrating the collections process into the system, making sure the application was operating system independent, and implementing Semantic Web features.


I’m glad I had the opportunity to attend. It gave me a chance to get a better understanding of what OLE is all about, and I saw it as a professional development session where I learned more about where things are going. The day’s events were well-structured, well-organized, and manageable given the time restraints. I only regret there was too little “blue skying” by attendees. Much of the time was spent outlining how our work is done now. I hope any future implementation explores new ways of doing things in order to take better advantage of the changing environment as opposed to simply automating existing processes.

WorldCat Hackathon

I attended the first-ever WorldCat Hackathon on Friday and Saturday (November 7 & 8), and us attendees explored ways to take advantage of various public application programmer interfaces (APIs) supported by OCLC.

Web Services

logoThe WorldCat Hackathon was an opportunity for people to get together, learn about a number of OCLC-supported APIs, and take time to explore how they can be used. These APIs are a direct outgrowth of something that started at least 6 years ago with an investigation of how OCLC’s data can be exposed through Web Service computing techniques. To date OCLC’s services fall into the following categories, and they are described in greater detail as a part of the OCLC Grid Services Web page:

  • WorldCat Search API – Search and display content from WorldCat — a collection of mostly books owned by libraries
  • Registry Services – Search and display names, addresses, and information about libraries
  • Identifier Services – Given unique keys, find similar items found in WorldCat
  • WorldCat Identities – Search and display information about authors from a name authority list
  • Terminology Services – Search and display subject authority information
  • Metadata Crosswalk Service – Convert one metadata format (MARC, MARCXML, XML/DC, MODS, etc.) into another. (For details of how this works, see “Toward element-level interoperability in bibliographic metadata” in Issue #2 of the Code4Lib Journal).

The Hacks

The event was attended by approximately fifty (50) people. The prize going to the person coming the furthest went to someone from France. A number of OCLC employees attended. Most people were from academic libraries, and most people were from the surrounding states. About three-quarters of the attendees were “hackers”, and the balance were there to learn.

Taking place in the Science, Industry and Business Library (New York Public Library), the event began with an overview of each of the Web Services and the briefest outline of how they might be used. We then quickly broke into smaller groups to “hack” away. The groups fell into a number of categories: Drupal, VUFind, Find More Like This One/Miscellaneous, and language-specific hacks. We reconvened after lunch on the second day sharing what we had done as well as what we had learned. Some of the hacks included:

  • Term Finder – Enter a term. Query the Terminology Services. Get back a list of broader and narrower terms. Select items from results. Repeat. Using such a service a person can navigate a controlled vocabulary space to select the most appropriate subject heading.
  • Name Finder – Enter a first name and a last name. Get back a list of WorldCat Identities matching the queries. Display the subject terms associated with the works of this author. Select subject terms results are displayed in Term Finder.
  • Send It To Me – Enter an ISBN number. Determine whether or not the item is held locally. If so, then allow the user to borrow the item. If not, then allow the user to find other items like that item, purchase it, and/or facilitate an interlibrary load request. All three of these services were written by myself. The first two were written at during the Hackathon. The last was written more than a year ago. All three could be used on their own or incorporated into a search results page.
  • Find More Like This One in VUFind – Written by Scott Mattheson (Yale University Library) this prototype was in the form of a number of screen shots. It allows the user to first do a search in VUFind. If desired items are checked out, then it will search for other local copies.
  • Google Map Libraries – Greg McClellan (Brandeis University) combined the WorldCat Search API, Registries Services, the Google Maps to display the locations of nearby libraries who reportably own a particular item.
  • Recommend Tags – Chad Fennell (University of Minnesota Libraries) overrode a Drupal tagging function to work with MeSH controlled vocabulary terms. In other words, as items in Drupal are being tagged, this hack leads the person doing data entry to use MeSH headings.
  • Enhancing Metadata – Piotr Adamzyk (Metropolitan Museum of Art) has access to both bibliographic and image materials. Through the use of Yahoo Pipes technology he was able to read metadata from an OAI repository, map it to metadata found in WorldCat, and ultimately supplement the metadata describing the content of his collections.
  • Pseudo-Metasearch in VUFind – Andrew Nagy (Villanova University) demonstrated how a search could be first done in VUFind, and have subsequent searches done against WorldCat by simply clicking on a tabbed interface.
  • Find More Like This One – Mark Matienzo (NYPL Labs) created an interface garnering an OCLC number as input. Given this it returned subject headings an effort to return other items. It was at this point Ralph LeVan (OCLC) said, “Why does everybody use subject headings to find similar items? Why not map your query to Dewey numbers and find items expected to be placed right next to the given item on the shelf?” Good food for thought.
  • xISBN Bookmarklette – Liu Xiaoming (OCLC) demonstrated a Web browser tool. Enter your institution’s name. Get back a browser bookmarklette. Drag bookmarklette to your toolbar. Search things like Amazon. Select ISBN number from the Web page. Click bookmarklette. Determine whether or not your local library owns the item.


Obviously the hacks created in this short period of time by a small number of people illustrate just a tiny bit of what could be done with the APIs. More importantly and IMHO, what these APIs really demonstrate is the ways librarians can have more control over their computing environment if they were to learn to exploit these tools to their greatest extent. Web Service computing techniques are particularly powerful because they are not wedded to any specific user interface. They simply provide the means to query remote services and get back sets of data. It is then up to librarians and developers — working together — to figure out what to do the the data. As I’ve said somewhere previously, “Just give me the data.”

I believe the Hackathon was a success, and I encourage OCLC to sponsor more of them.


I attended a VUFind meeting at PALINET in Philadelphia today, November 6, and this posting summarizes my experiences there.

As you may or may not know, VUFind is a “discovery layer” intended to be applied against a traditional library catalog. Originally written by Andrew Nagy of Villanova University, it has been adopted by a handful of libraries across the globe and is being investigated by quite a few more. Technically speaking, VUFind is an open source project based on Solr/Lucene. Extract MARC records from a library catalog. Feed them to Solr/Lucene. Provide access to the index as well as services against the search results.

The meeting was attended by about thirty people. The three people from Tasmania won the prize for coming the furthest, but there were also people from Stanford, Texas A&M, and a number of more regional libraries. The meeting had a barcamp-like agenda. Introduce ourselves. Brainstorm topics for discussion. Discuss. Summarize. Go to bar afterwards. Alas, I didn’t get to go to the bar, but I was there for the balance. The following bullet points summarize each discussion topic:

  • Jangle – A desire was expressed to implement some sort of API (application programmer interface) to VUFind in order to ensure a greater degree of interoperability. The DLF-DI was mentioned quite a number of times, but Jangle was the focus of the discussion. Unfortunately, not a whole lot of people around the room knew about Jangle, the ATOM Publishing Protocol, nor REST-ful computing techniques in general. Because creating an API was desired there was some knowledge of the XC (eXtensible Catalog) project around the room, and there was curiosity/frustration as to why more collaboration could not be done with XC. Apparently the XC process and their software is not as open and transparent has I had thought. (Note to self: ping the folks at XC and bring this issue to their attention.) In the end, implementing something like Jangle was endorsed.
  • Non-MARC content – It was acknowledged that non-MARC content ought to be included in any sort of “discovery layer”. A number of people had experimented with including content from their local institutional repositories, digital libraries, and/or collection of theses & dissertations. The process is straight-forward. Get set of metadata. Map it to VUFind/Solr fields. Feed it to the indexer. Done. Other types of data people expressed an interest in incorporating included: EAD, TEI, images, various types of data sets, and mathematical models. From here the discussion quickly evolved into the next topic…
  • Solrmarc – Through the use of a Java class called MARC4J, a Solr plug-in has been created by the folks at the University of Virginia. This plug-in — Solrmarc — makes it easier to read MARC data and feed it to Solr. There was a lot of discussion whether or not this plug-in should be extended to include other data types, such as the ones outlined above, or to distribute Solrmarc as-is, more akin to a GNU “do one thing and one thing well” type of tool. From my perspective, no specific direction was articulated.
  • Authority control – We all knew the advantage of incorporating authority lists (names, authors, titles) into VUFind. The general ideas was to acquire authority lists. Incorporate this data into the underlying index. Implement “find more like this one” types of services against search results based on the related records linked through authorities. There was then much discussion on how to initially acquire the necessary authority data. We were a bit stymied. After lunch a slightly different tack was taken. Acquire some authority data, say about 1,000 records. Incorporate it into an implementation of VUFind. Demonstrate the functionality to wider audiences. Tackle the problem of getting more complete and updated authority data later.
  • De-duplication/FRBR – This was probably the shortest discussion point, and it really surrounded FRBR. We ended up asking ourselves, “To what degree do we want to incorporate Web Services such as xISBN into VUFind to implement FRBR-like functionality, or to what degree should ‘real’ FRBRization take place?” Compared to other things, de-duplication/FRBR seemed to be taking a lower priority.
  • Serials holdings – This discussion was around indexing and/or displaying serials holdings information. There was much talk about the ways various integrated library systems allow libraries to export holdings information, whether or not it was merged with bibliographic information, and how consistent it was from system to system. In general it was agreed that this holdings information ought to be indexed to enable searches such as “Time Magazine 2004”, but displaying the results was seen as problematic. “Why not use your link resolver to address this problem?” was asked. This whole issue too was given a lower priority since more and more serial holdings are increasingly electronic in nature.
  • Federated search – It was agreed that federated search s?cks, but it is a necessary evil. Techniques for incorporating it into VUFind ranged from: 1) side-stepping the problem by licensing bibliographic data from vendors, 2) side-stepping the problem by acquiring binary Lucene indexes of bibliographic data from vendors, 3) creating some sort of “smart” interface that looks at VUFind search results to automatically select and search federated search targets whose results are hidden behind a tab until selected by the user, or 4) allow the user to assume some sort of predefined persona (Thomas Jefferson, Isaac Newton, Kurt Godel, etc.) to point toward the selection of search targets. LibraryFind was mentioned as a store for federated search targets. Pazpar2 was mentioned as tool to do the actual searching.
  • Development process – The final discussion topic regarded the on-going development process. To what degree should the whole thing be more formalized? Should VUFind be hosted by a third party? Code4Lib? PALINET? A newly created corporation? Is it a good idea to partner with similar initiative such as OLE (Open Library Environment), XC, ILF-DI, or BlackLight? On one hand, such formalization would give the process more credibility and open more possibilities for financial support, but on the other hand the process would also become more administratively heavy. Personally, I liked the idea of allowing PALINET to host the system. It seems to be an excellent opportunity for such an library-support organization.

The day was wrapped up by garnering volunteers to see after each of the discussion points in the hopes of developing them further.

I appreciated the opportunity to attend the meeting, especially since it is quite likely I will be incorporating VUFind into a portal project called the Catholic Research Resources Alliance. I find it amusing the way many “next generation” library catalog systems — “discovery layers” — are gravitating toward indexing techniques and specifically Lucene. Currently, these systems include VUFind, XC, BlackLight, and Primo. All of them provide a means to feed an indexer data, and then user access to the index.

Of all the discussions, I enjoyed the one on federated search the most because it toyed with the idea of making the interfaces to our indexes smarter. While this smacks of artificial intelligence, I sincerely think this is an opportunity to incorporate library expertise into search applications.