Archive for August, 2009

Web-scale discovery services

Thursday, August 27th, 2009

Last week (Tuesday, August 18) Marshall Breeding and I participated in a webcast sponsored by Serials Solutions and Library Journal on the topic of “‘Web-scale’ discovery services”.

Our presentations complimented one another in that we both described the current library technology environment and described how the creation of amalgamated indexes of book and journal article content have the potential to improve access to library materials.

Dodie Ownes summarized the event in an article for Library Journal. From there you can also gain access to an archive of the one-hour webcast. (Free registration required.) I have made my written remarks available on the Hesburgh Libraries website as well as mirrored them locally. From the remarks:

It is quite possible the do-it-yourself creation and maintenance of an index to local book holdings, institutional repository content, and articles/etexts is not feasible. This may be true for any number of reasons. You may not have the full complement of resources to allocate, whether that be time, money, people, or skills. You and your library may have a set of priorities forcing the do-it-yourself approach lower on the to-do list. You might find yourself stuck in never-ending legal negotiations for content from “closed” access providers. You might liken the process of normalizing myriads of data formats into a single index to Hercules cleaning the Augean stables.

technical expertise
technical expertise
money
money
people with vision
people with vision
energy
energy

If this be the case, then the purchasing (read, “licensing”) of a single index service might be the next best thing — Plan B.

I sincerely believe the creation of these “Web-scale” indexes is a step in the right direction, but I believe just as strongly that the problem to be solved now-a-days does not revolve around search and discovery, but rather use and context.

“Thank you Serials Solutions and Library Journal for the opportunity to share some of my ideas.”

How to make a book (#1 of 3)

Sunday, August 23rd, 2009

This is a series of posts where I will describe and illustrate how to make books. In this first post I will show you how to make a book with a thermo-binding machine. In the second post I will demonstrate how to make a book by simply tearing and folding paper. In the third installment, I will make a traditional book with a traditional cover and binding. The book — or more formally, the codex — is a pretty useful format for containing information.

Fellowes TB 250 thermo-binding machine

The number of full text books found on the Web is increasing at a dramatic pace. A very large number of these books are in the public domain and freely available for downloading. While computers make it easy to pick through smaller parts of books, it is diffcult to read and understand them without printing. Once they are printed you are then empowered to write in the margins, annotate them as you see fit, and share them with your friends. On the other hand, reams of unbound paper is difficult to handle. What to do?

Enter a binding machine, specifically a thermo-binding machine like the Fellowes TB 250. This handy-dandy gizmo allows you to print bunches o’ stuff, encase it in inexpensive covers, and bind it into books. Below is an outline of the binding process and a video demonstration is also available online:

  1. Buy the hardware – The machine costs less than $100 and available from any number of places on the Web. Be sure to purchase covers in a variety of sizes.
  2. Print and gather your papers – Be sure to “jog” your paper nice and neatly.
  3. Turn the machine on – This makes the heating element hot.
  4. Place the paper into the cover – The inside of each cover’s spine is a ribbon of glue. Make sure the paper is touching the glue.
  5. Place the book into the binder – This melts the glue.
  6. Remove the book, and press the glue – The larger the book the more important it is to push the adhesive into the pages.
  7. Go to Step #5, at least once – This makes the pages more secure in the cover.
  8. Remove, and let cool – The glue is hot. Let it set.
  9. Enjoy your book – This is the fun part. Read and scribble in your book to your heart’s content.

Binding and the Alex Catalogue

The Alex Catalogue of Electronic Texts is a collection of fulltext books brought together for the purposes of furthering a person’s liberal arts eduction. While it supports tools for finding, analyzing, and comparing texts, the items are intended to be read in book form as well. Consider printing and binding the PDF or fully transcribed versions of the texts. Your learning will be much more thorough, and you will be able to do more “active” reading.

Binding and libraries

Binding machines are cheap, and they facilitate a person’s learning by enabling users to organize their content. Maybe providing a binding service for library patrons is apropos? Make it easy for people to print things they find in a library. Make it easy for them to use some sort of binding machine. Enable them to take more control over the stuff of their learning, teaching, and research. It certainly sounds like good idea to me. After all, in this day and age, libraries aren’t so much about providing access to information as they are about making information more useful. Binding — books on demand — is just one example.

Book review of Larry McMurtry’s Books

Sunday, August 23rd, 2009

I read with interest Larry McMurtry’s Books: A Memoir (Simon & Schuster, 2008), but from my point of view, I would be lying if I said I thought the book had very much to offer.

The book’s 259 pages are divided into 109 chapters. I was able to read the whole thing in six or seven sittings. It is an easy read, but only because the book doesn’t say very much. I found the stories rarely engaging and never very deep. They were full of obscure book titles and the names of “famous” book dealers.

Much of this should not be a surprise, since the book is about one person’s fascination with books as objects, not books as containers of information and knowledge. From page 38 of my edition:

Most young dealers of the Silicon Chip Era regard a reference library as merely a waste of space. Old-timers on the West Coast, such as Peter Howard of Serendipity Books in Berkeley or Lou and Ben Weinstein of the (recently closed) Heritage Books Shop in Los Angeles, seem to retain a fondness of reference books that goes beyond the practical. Everything there is to know about a given volume may be only a click away, but there are still a few of us who’d rather have the book than the click. A bookman’s love of books is a love of books, not merely the information in them.

Herein lies the root of my real problem with the book, it shares with the reader one person’s chronology of a love of books and book selling. It describes various used bookstores and give you an idea of what it is like to be a book dealer. Unfortunately, I believe McMurtry misses the point about books. They are essentially a means to an end. A tool. A medium for the exchange of ideas. The ideas they contain and the way they contain them are the important thing. There are advantages & disadvantages to the book as a technology, and these advantages & disadvantages ought not be revered or exaggerated to dismiss the use of books or computers.

I also think McMurtry’s perception of libraries, which seems to be commonly held in and outside my profession, points to one of librarianship’s pressing issues. From page 221:

But they [computers] don’t really do what books do, and why should they usurp the chief function of a public library, which is to provide readers access to books? Books can accommodate the proximity of computers but it doesn’t seem to work the other way around. Computers now literally drive out books from the place they should, by definition, be books’ own home: the library.

Is the chief function of a public library to provide readers access to books? Are libraries defined as the “home” of books? Such a perception may have been more or less true in an environment where data, information, and knowledge were physically manifested, but in an environment where the access to information is increasingly digital the book as a thing is not as important. Books are not central to the problems to be solved.

Can computers do what books do? Yes and no. Computers can provide access to information. They make it easier to “slice and dice” their content. They make it easier to disseminate content. They make information more findable. The information therein is trivial to duplicate. On the other hand, books require very little technology. They are relatively independent of other technologies, and therefore they are much more portable. Books are easy to annotate. Just write on the text or scribble in the margin. A person can browse the contents of a book much faster than the contents of electronic text. Moreover, books are owned by their keepers, not licensed, which is increasingly the case with digitized material. There are advantages & disadvantages to both computers and books. One is not necessarily better than the other. Each has their place.

As a librarian, I had trouble with the perspectives of Larry McMurtry’s Books: A Memoir. It may be illustrative of the perspectives of book dealers, book sellers, etc., but I think the perspective misses the point. It is not so much about the book as much as it is about what the book contains and how those contents can be used. In this day and age, access to data and information abounds. This is a place where libraries increasingly have little to offer because libraries have historically played the role of middleman. Producers of information can provide direct access to their content much more efficiently than libraries. Consequently a different path for libraries needs to be explored. What does that path look like? Well, I certainly have ideas about that one, but that is a different essay.

Browsing the Alex Catalogue

Friday, August 21st, 2009

The Alex Catalogue is browsable by author names, subject tags, and titles. Just select a browsable list, then a letter, and finally an item.

Browsability is an important feature of any library catalog. It gives you an opportunity to see what the collection contains without entering a query. It is also possible to use browsability to identify similar names, terms, or titles. “Oh look, I hadn’t thought of that idea, and look at the alternative spellings I can use.”

Creating the browsable list is rather trivial. Since all of the underlying content is saved in a relational database, it is rather easy to loop through the fields of “controlled” vocabulary terms and “authority” lists to identify matching etext titles. These lists include:

The later is probably the most interesting since it gives you an idea of the most common words and two-word phrases used in the corpus. For example, look at the list of words starting with the letter “k” and all the ways the word “kant” has been extracted from collection

Indexing and searching the Alex Catalogue

Monday, August 17th, 2009

The Alex Catalogue of Electronic Texts uses state-of-the-art software to index both the metadata and full text of its content. While the interface accepts complex Boolean queries, it is easier to enter a single word, a number of words, or a phrase. The underlying software will interpret what you enter and do much of hard query syntax work for you.

Indexing

The Catalogue consists of a number of different types of content harvested from different repositories. Most of the content is in the form of electronic texts (“etexts” as opposed to “ebooks”). Think Project Gutenberg, but also items from a defunct gopher archive from Virginia Tech, and more recently digitized materials from the Internet Archive. All of these items benefit from metadata and full text indexing. In other words, things like title words, author names, and computer-generated subject tags are made searchable as well as the full texts of the items.

The collection is supplemented with additional materials such as open access journal titles, open access journal article titles, some content from the HaitiTrust, as well as photographs taken by myself. Presently the full text of these secondary items is not included, just metadata: titles, authors, notes, and subjects. Search results return pointers to the full texts.

Regardless of content type, all metadata and full text is managed in an underlying MyLibrary database. To make the content searchable reports are written against the database and fed to Solr/Lucene for indexing. The Solr/Lucene data structure is rather simple consisting only of a number of Dublin Core-like fields, a default search field, and three facets (creator, subject/tag, and sub-collection). From a 30,000 foot view, this is the process used to index the content of the Catalogue:

  1. extract metadata and full text records from the database
  2. map each record’s fields to the Solr/Lucene data structure
  3. insert each record into Solr/Lucene; index the record
  4. go to Step #1 until all records have been indexed
  5. optimize the index for faster retrieval

Solr/Lucene works pretty well, and interfacing with it was made much simpler through the use of a set of Perl modules called WebService::Solr. On the other hand, there are many ways the index could be improved such as implementing facilitates for sorting and adding weights to various fields. An indexer’s work is never done.

Searching

Because of people’s expectations, searching the index is a bit more complicated and not as straight-forward, but only because the interface is trying to do you some favors.

Solr/Lucene supports single-word, multiple-word, and phrase searches through the use of single or double quote marks. If multi-word queries are entered without Boolean operators, then a Boolean and is assumed.

Since people often enter multiple-word queries, and it is difficult to know whether or not they are really wanting to do a phrase search, the Alex Catalogue converts ambiguous multiple-word queries into more robust Boolean queries. For example a search for “william shakespeare” (sans the quote marks) will get converted into “(william AND shakespeare) OR ‘william shakespeare'” (again, sans the double quote marks) on behalf of the user. This is considered a feature of the Catalogue.

To some degree Solr/Lucene tokenizes query terms, and consequently searches for “book” and “books” return the same number of hits.

Search results are returned in a relevance ranked order. Some time in the future there will be the option of sorting results by date, author, title, and/or a couple of other criteria. Unlike other catalogs, Alex only has a single display — search results. There is no intermediary detailed display; the Catalogue only displays search results or the full text of the item.

In the hopes of making it easier for the user to refine their search, the results page allows the user to automatically turn queries into subject, author, or title searches. It takes advantage of a thesaurus (WordNet) to suggest alternative queries. The system returns “facets” (author names, subject tags, or material types) allowing the user to limit their query with additional terms and narrow search results. The process is not perfect and there are always ways of improving the interface. Usability is never done either.

Summary

Do not try to out think the Alex Catalogue. Enter a word or two. Refine your query using the links on the resulting page. Read & enjoy the discovered texts. Repeat.

Microsoft Surface at Ball State

Friday, August 14th, 2009

Me and a number of colleagues from the University of Notre Dame visited folks from Ball State University and Ohio State University to see, touch, and discuss all things Microsoft Surface.

There were plenty of demonstrations surrounding music, photos, and page turners. The folks of Ball State were finishing up applications for the dedication of the new “information commons”. These applications included an exhibit of orchid photos and an interactive map. Move the scroll bar. Get a differnt map based on time. Tap locations. See pictures of buildings. What was really interesting about the later was the way it pulled photographs from the library’s digital repository through sets of Web services. A very nice piece of work. Innovative and interesting. They really took advantage of the technology as well as figured out ways to reuse and repurpose library content. They are truly practicing digital librarianship.

The information commons was nothing to sneeze at either. Plenty of television cameras, video screens, and multi-national news feeds. Just right for a school with a focus on broadcasting.

Ball State University. Hmm…