I attended a VUFind meeting at PALINET in Philadelphia today, November 6, and this posting summarizes my experiences there.
As you may or may not know, VUFind is a “discovery layer” intended to be applied against a traditional library catalog. Originally written by Andrew Nagy of Villanova University, it has been adopted by a handful of libraries across the globe and is being investigated by quite a few more. Technically speaking, VUFind is an open source project based on Solr/Lucene. Extract MARC records from a library catalog. Feed them to Solr/Lucene. Provide access to the index as well as services against the search results.
The meeting was attended by about thirty people. The three people from Tasmania won the prize for coming the furthest, but there were also people from Stanford, Texas A&M, and a number of more regional libraries. The meeting had a barcamp-like agenda. Introduce ourselves. Brainstorm topics for discussion. Discuss. Summarize. Go to bar afterwards. Alas, I didn’t get to go to the bar, but I was there for the balance. The following bullet points summarize each discussion topic:
- Jangle – A desire was expressed to implement some sort of API (application programmer interface) to VUFind in order to ensure a greater degree of interoperability. The DLF-DI was mentioned quite a number of times, but Jangle was the focus of the discussion. Unfortunately, not a whole lot of people around the room knew about Jangle, the ATOM Publishing Protocol, nor REST-ful computing techniques in general. Because creating an API was desired there was some knowledge of the XC (eXtensible Catalog) project around the room, and there was curiosity/frustration as to why more collaboration could not be done with XC. Apparently the XC process and their software is not as open and transparent has I had thought. (Note to self: ping the folks at XC and bring this issue to their attention.) In the end, implementing something like Jangle was endorsed.
- Non-MARC content – It was acknowledged that non-MARC content ought to be included in any sort of “discovery layer”. A number of people had experimented with including content from their local institutional repositories, digital libraries, and/or collection of theses & dissertations. The process is straight-forward. Get set of metadata. Map it to VUFind/Solr fields. Feed it to the indexer. Done. Other types of data people expressed an interest in incorporating included: EAD, TEI, images, various types of data sets, and mathematical models. From here the discussion quickly evolved into the next topic…
- Solrmarc – Through the use of a Java class called MARC4J, a Solr plug-in has been created by the folks at the University of Virginia. This plug-in — Solrmarc — makes it easier to read MARC data and feed it to Solr. There was a lot of discussion whether or not this plug-in should be extended to include other data types, such as the ones outlined above, or to distribute Solrmarc as-is, more akin to a GNU “do one thing and one thing well” type of tool. From my perspective, no specific direction was articulated.
- Authority control – We all knew the advantage of incorporating authority lists (names, authors, titles) into VUFind. The general ideas was to acquire authority lists. Incorporate this data into the underlying index. Implement “find more like this one” types of services against search results based on the related records linked through authorities. There was then much discussion on how to initially acquire the necessary authority data. We were a bit stymied. After lunch a slightly different tack was taken. Acquire some authority data, say about 1,000 records. Incorporate it into an implementation of VUFind. Demonstrate the functionality to wider audiences. Tackle the problem of getting more complete and updated authority data later.
- De-duplication/FRBR – This was probably the shortest discussion point, and it really surrounded FRBR. We ended up asking ourselves, “To what degree do we want to incorporate Web Services such as xISBN into VUFind to implement FRBR-like functionality, or to what degree should ‘real’ FRBRization take place?” Compared to other things, de-duplication/FRBR seemed to be taking a lower priority.
- Serials holdings – This discussion was around indexing and/or displaying serials holdings information. There was much talk about the ways various integrated library systems allow libraries to export holdings information, whether or not it was merged with bibliographic information, and how consistent it was from system to system. In general it was agreed that this holdings information ought to be indexed to enable searches such as “Time Magazine 2004”, but displaying the results was seen as problematic. “Why not use your link resolver to address this problem?” was asked. This whole issue too was given a lower priority since more and more serial holdings are increasingly electronic in nature.
- Federated search – It was agreed that federated search s?cks, but it is a necessary evil. Techniques for incorporating it into VUFind ranged from: 1) side-stepping the problem by licensing bibliographic data from vendors, 2) side-stepping the problem by acquiring binary Lucene indexes of bibliographic data from vendors, 3) creating some sort of “smart” interface that looks at VUFind search results to automatically select and search federated search targets whose results are hidden behind a tab until selected by the user, or 4) allow the user to assume some sort of predefined persona (Thomas Jefferson, Isaac Newton, Kurt Godel, etc.) to point toward the selection of search targets. LibraryFind was mentioned as a store for federated search targets. Pazpar2 was mentioned as tool to do the actual searching.
- Development process – The final discussion topic regarded the on-going development process. To what degree should the whole thing be more formalized? Should VUFind be hosted by a third party? Code4Lib? PALINET? A newly created corporation? Is it a good idea to partner with similar initiative such as OLE (Open Library Environment), XC, ILF-DI, or BlackLight? On one hand, such formalization would give the process more credibility and open more possibilities for financial support, but on the other hand the process would also become more administratively heavy. Personally, I liked the idea of allowing PALINET to host the system. It seems to be an excellent opportunity for such an library-support organization.
The day was wrapped up by garnering volunteers to see after each of the discussion points in the hopes of developing them further.
I appreciated the opportunity to attend the meeting, especially since it is quite likely I will be incorporating VUFind into a portal project called the Catholic Research Resources Alliance. I find it amusing the way many “next generation” library catalog systems — “discovery layers” — are gravitating toward indexing techniques and specifically Lucene. Currently, these systems include VUFind, XC, BlackLight, and Primo. All of them provide a means to feed an indexer data, and then user access to the index.
Of all the discussions, I enjoyed the one on federated search the most because it toyed with the idea of making the interfaces to our indexes smarter. While this smacks of artificial intelligence, I sincerely think this is an opportunity to incorporate library expertise into search applications.
Tags: "next generation" library catalogs, open source software, VUFind
Eric, thanks for this write-up. I wish I could have been there, since it looks like there was some good, meaty discussion. I have a couple of questions, though… I’ll put them in the order of your points:
1) “[I]mplementing something like Jangle was endorsed.” Out of curiosity, why “something like” Jangle instead of Jangle itself? Jangle is still a clay that can be molded to meet the desires of what people need, not hardened stone. Any and all suggestions are welcome to make it do what developers need.
2) I think I lean towards keeping SolrMARC, well, SolrMARC, although I can see the argument for a pluggable framework of different sorts of metadata parsers, as well. I still think most of our other formats don’t have the immediate technical and non-technical “problems” that MARC carries with it.
3) There are a few problems that I see with authority records. The first is the lack of name authorities in the wild (the subject authorities are all I know of that are available). The second is the fundamental problem of matching the authority to records, since it’s just string matching.
4) I’m still not sure why, even in an increasingly electronic environment, you shouldn’t be able search for “Time Magazine 2004”. Couldn’t the electronic holdings be imported from the link resolver knowledgebase or ERMS?
5) One of the plans I had when I still worked at Georgia Tech was to create a consortium-wide “cache” for the federated search project (the major universities in Georgia consortially use Metalib), using something like a Solr, or even Sphinx store to keep recent results as a place that the federated search searches “first” while federating through the licensed targets in the background. With around 80,000 FTE (GT, UGA, Georgia State, and Emory) contributing to the cache, I think you’d have more than enough search results in there to make it work. The biggest hurdle would be working out who has access to what, but I still think that’s pretty doable (since they’d all be using the same search engine in the first place).
Thanks for the update, Eric.
“There was a lot of discussion whether or not this plug-in should be extended to include other data types, such as the ones outlined above, or to distribute Solrmarc as-is, more akin to a GNU “do one thing and one thing well” type of tool.”
Jeez, this seems obvious to me. No, Solrmarc shoudln’t do things that aren’t MARC. Solrmarc is a plug-in for SOLR to index MARC. You want to write a plug-in for SOLR to index other things, you can. Why oh why would you want a plug-in for SOLR to index half a dozen things all wrapped into one plug-in? That doesn’t make any sense at all. Ross suggests the sense of “a pluggable framework of different sorts of metadata parsers”–well, SOLR already IS this, and SolrMARC is a plug-in written for SOLR’s ‘pluggable framework’, to do MARC!
I don’t want to be mean, but that this was a lot of discussion doesn’t give me confidence in the software engineering experience represented in the room–making software engineering decisions about VuFind?
“In general it was agreed that this holdings information ought to be indexed to enable searches such as “Time Magazine 2004″, but displaying the results was seen as problematic.”
Well, the problem here is that most of our catalogs don’t actually contain sufficient semantic metadata to answer this question, regardless of what a discovery layer does. That means that until that problem is fixed, you won’t be able to have your discovery layer answer that question. I still think it’s important to have your discovery layer _list_ what issues of Time Magazine you hold, even if it can’t actually operate on the listing semantically.
“Why not use your link resolver to address this problem?” was asked.”
Oh boy, I think this was asked by someone with no experience with link resolvers. 1) Because, again, this data, for print, doesn’t exist anywhere that the link resolver can use to get the semantic info to answer the question. 2) If it did exist in your ILS, the link resolver would have to get it from the ILS. I wouldn’t hold my breath for most of our commercial link resolver vendors to provide this functionality, VuFind could provide it a lot quicker. 3) If you’re talking electronic, then, yes, our link resolvers can generally do it.
I think they missed the boat on this one. In general, from your review of the discussion, I see people rationalizing hard-but-important problems as “well, gee, that’s not really neccesary after all.” It’s one thing to say “It’s hard, let’s work on easier stuff first.” But don’t convince yourself it doesn’t really matter just because it’s hard.
PS: I said “If you’re talking electronic, then, yes, our link resolvers can generally do it. ” — I mean, answer the question “Do we have Time magazine 2004”.
I think there’s still a need for the discovery tool to tell people, in a list, the full extent of what issues of Time Magazine the library has, in print, or in electronic.
Maybe it would do this working _with_ the link resolver, as it’s already assumed the discovery tool has to work with the ILS, naturally. But, in an academic library, or at least in my academic library, this is clearly a need.
Eric,
Thanks for posting this. I wish I could have been there, but the OLE Project meetings had to take priority. Keep us posted on your incorporation of VUFind into the Catholic Research Resources Alliance.
Tim
A few comments to the comments…
First, “something like Jangle” means. Implement Jangle but more so implement as many standards-driven things as possible. For example, while it was not discussed, I would imagine that if an OpenSearch and/or SRU interface were suggested people would have said, “‘Sounds like a good idea.”
Regarding authority records, yes, the largest problem was finding authority records in the wild.
Yes, I believe the idea of importing holdings information from an ERM was mentioned.
Last, yes, “working on the easier things first” would be a better way of prioritizing the items that were outlined during the meeting.
Eric — In your summary under the heading of non-MARC data, you said the process was as simple as “Get set of metadata. Map it to VUFind/Solr fields. Feed it to the indexer. Done.” I’m not all that familiar with VUfind, but maybe you or someone else knows the answer. Is step #2 effectively “Map [the metadata] to MARC fields”?