Category Archives: Reviews

Fun with RSS and the RSS aggregator called Planet

This posting outlines how I refined a number of my RSS feeds and then aggregated them into a coherent whole using Planet.

Many different RSS feeds

I have, more or less, been creating RSS (Real Simple Syndication) feeds since 2002. My first foray was not really with RSS but rather with RDF. At that time the functions of RSS and RDF were blurred. In any event, I used RDF as a way of syndicating randomly selected items from my water collection. I never really pushed the RDF, and nothing really became of it. See “Collecting water and putting it on the Web” for details.

In December of 2004 I started marking up my articles, presentations, and travelogues in TEI and saving the result in a database. The webified version of these efforts was something called Musings on Information and Librarianship. I described the database supporting the process is a specific entry called “My personal TEI publishing system“. A program — make-rss.pl — was used to make the feed.

Since then blogs have become popular, and almost by definition, blogs support RSS in a really big way. My RSS was functional, but by comparison, everybody else’s was exceptional. For many reasons I started drifting away from my personal publishing system in 2008 and started moving towards WordPress. This manifested itself in this blog — Mini-Musings.

To make things more complicated, I started blogging on other sites for specific purposes. About a year ago I started blogging for the “Catholic Portal”, and more recently I’ve been blogging about research data management/curation — Days in the Life of a Librarian — at the University of Notre Dame.

In September of 2009 I started implementing a reading list application. Print an article. Read it. Draw and scribble on it. (Read, “Annotate it.”) Scan it. Convert it into a PDF document. Do OCR against it. Save the result to a Web-accessible file system. Do data entry against a database to describe it. Index the metadata and extracted OCR. And finally, provide a searchable/browsable interface to the whole lot. The result is a fledgling system I call “What’s Eric Reading?” Since I wanted to share my wealth (after all, I am a librarian) I created an RSS feed against this system too.

I was on a roll. I went back to my water collection and created a full-fledged RSS feed against it as well. See the simple Perl script — water2rss.pl — to see how easy it is.

Ack! I now have six different active RSS feeds, not counting the feeds I can get from Flickr and YouTube:

  1. Catholic Portal
  2. Life of a Librarian
  3. Mini-musings
  4. Musings
  5. What’s Eric Reading?
  6. Water collection

That’s too many, even for an ego surfer like myself. What to do? How can I consolidate these things? How can I present my writings in a single interface? How can I make it easy to syndicate all of this content in a standards-compliant way?

Planet

The answer to my questions is/was Planet — “an awesome ‘river of news’ feed reader. It downloads news feeds published by web sites and aggregates their content together into a single combined feed, latest news first.”

A couple of years ago the Code4Lib community created an RSS “planet” called Planet Code4Lib — “Blogs and feeds of interest to the Code4Lib community, aggregated.” I think it is maintained by Jonathan Rochkind, but I’m not sure. It is pretty nice since it brings together the RSS feeds from quite a number of library “hackers”. Similarly, there is another planet called Planet Cataloging which does the same thing for library cataloging feeds. This one is maintained by Jennifer W. Baxmeyer and Kevin S. Clarke. The combined planets work very well together, except when individual blogs are in both aggregations. When this happens I end up reading the same blog postings twice. Not a big deal. You get what you pay for.

After a tiny bit of investigation, I decided to use Planet to aggregate and serve my RSS feeds. Installation and configuration was trivial. Download and unpack the distribution. Select an HTML template. Edit a configuration file denoting the location of RSS feeds and where the output will be saved. Run the program. Tweak the template. Repeat until satisfied. Run the program on a regular basis, preferably via cron. Done. My result is called Planet Eric Lease Morgan.

Planet Eric Lease Morgan

The graphic design may not be extraordinarily beautiful, but the content is not necessarily intended to be read via an HTML page. Instead the content is intended to be read from inside one’s favorite RSS reader. Planet not only aggregates content but syndicates it too. Very, very nice.

What I learned

I learned a number of things from this process. First I learned that standards evolve. “Duh!”

Second, my understanding of open source software and its benefits was re-enforced. I would not have been able to do nearly as much if it weren’t for open source software.

Third, the process provided me with a means to reflect on the processes of librarianship. My particular processes for syndicating content needed to evolve in order to remain relevant. I had to go back and modify a number of my programs in order for everything to work correctly and validate. The library profession seemingly hates to do this. We have a mindset of “Mark it and park it.” We have a mindset of “I only want to touch book or record once.” In the current environment, this is not healthy. Change is more the norm than not. The profession needs to embrace change, but then again, all institutions, almost by definition, abhor change. What’s a person to do?

Forth, the process enabled me to come up with a new quip. The written word read transcends both space and time. Fun!?

Finally, here’s an idea for the progressive librarians in the crowd. Use the Planet software to aggregate RSS fitting your library’s collection development policy. Programatically loop through the resulting links to copy/mirror the remote content locally. Curate the resulting collection. Index it. Integrate the subcollection and index into your wider collection of books, jourals, etc. Repeat.

Book reviews for Web app development

This is a set of tiny book reviews covering the topic of Web app development for the iPhone, iPad, and iPod Touch.

Unless you’ve been living under a rock for the past three or four years, then you know the increasing popularity of personal mobile computing devices. This has manifested itself through “smart phones” like the iPhone and “tablet computers” like the iPad and to some extent the iPod Touch. These devices, as well as other smart phones and tablet computers, get their network connections from the ether, their screens are smaller than the monitors of desktop computers, and they employ touch screens for input instead of keyboards and mice. All of these things significantly change the user’s experience and thus their expectations.

As a librarian I am interested in providing information services to my clientele. In this increasingly competitive environment where the provision of information services includes players like Google, Amazon, and Facebook, it behooves me to adapt to the wider environment of my clientele as opposed to the other way around. This means I need to learn how to provide information services through mobile computing devices. Google does it. I have to do it too.

Applications for mobile computing devices fall into two categories: 1) native applications, and 2) “Web apps”. The former are binary programs written in compiled languages like Objective-C (or quite possibly Java). These types of applications are operating system-specific, but they are also able to take full advantage of the underlying hardware. This means applications for things like iPhone or iPad can interoperate with the devices’ microphone, camera, speakers, geo-location functions, network connection, local storage, etc. Unfortunately, I don’t know any compiled languages to any great degree, and actually I have little desire to do so. After all, I’m a lazy Perl programmer, and I’ve been that way for almost twenty years.

The second class of applications are Web apps. In reality, these things are simply sets of HTML pages specifically designed for mobiles. These “applications” have the advantage of being operating system independent but are dead in the water without the existence of a robust network connection. These applications, in order to be interactive and meet user expectations, also need to take full advantage of CSS and Javascript, and when it comes to Javascript it becomes imperative to learn and understand how to do AJAX and AJAX-like data acquisition. If I want to provide information services through mobile devices, then the creation of Web apps seems much more feasible. I know how to create well-formed and valid HTML. I can employ the classic LAMP stack to do any hard-core computing. There are a growing number of CSS frameworks making it easy to implement the mobile interface. All I have to do is learn Javascript, and this is not nearly as difficult as it used to be with the emergence of Javascript debuggers and numerous Javascript libraries. For me, Web apps seem to be the way to go.

Over the past couple of years I went out and purchased the following books to help me learn how to create Web apps. Each of them are briefly described below, but first, here’s a word about WebKit. There are at least three HTML frameworks driving the majority of Web browsers these days. Gecko which is the heart of Firefox, WebKit which is the heart of Safari and Chrome, and whatever Microsoft uses as the heart of Internet Explorer. Since I do not own any devices that run the Android or the Windows operating systems, all of my development is limited to Gecko or WebKit based browsers. Luckily, WebKit seems to be increasing in popularity, and this makes it easier for me to rationalize my development in iPhone, iPad, and iPod Touch. The books reviewed below also lean in this direction.

  • Beginning iPhone And iPad Web Apps (2010, 488 pgs.) by Chris Apers and Daniel Paterson – This is one my more recent purchases and I think I like this book the best. First and foremost, it is the most agnostic of all the books, even though some of the examples use WebKit. True to its title, it describes the use of HTML5, CSS, and Javascript to implement mobile interfaces. This includes whole chapters to the use of vector graphics and fonts, audio and video content, special effects with (WebKit-specific) CSS, touch and gesture events with Javascript, location-aware programming, and client-side data storage. Moreover, this book is the best of the bunch when it comes to describing how mobile interfaces are different from browser-based interfaces. Mobile interfaces are not just smaller versions of their older siblings! If you are going to buy one book, then buy this one. I think it will serve you for the longest period of time.
  • Building iPhone Apps With HTML, CSS, and Javascript (2010, 166 pgs.) by Jonathan Stark – Being shorter than the previous book, this one is not as thorough but still covers all the bases. On the other hand, unlike the previous title, it does describe how to use a Javascript library for mobile (JQTouch), and how to use PhoneGap to convert a Web app into a native application with many of the native application benefits. This book is a quick read and a good introduction.
  • Dashcode For Dummies (2011, 436 pgs.) by Jesse Feiler – Dashcode is a development environment originally designed to facilitate the creation of Macintosh OS X dashboard widgets. As you may or may not know, these widgets are self-contained HTML/Javascript/CSS files intended to support simple utility functions. Tell the time. Display the weather. Convert currencies. Render XML files. Etc. Dashcode evolved and now enables the developer to create Web apps for the Macintosh family of i-devices. I bought this book because I own these devices, and I thought the book might help me exploit their particular characteristics. It does not. Dashcode includes no internal links to the underlying hardware. This book describes how to use Dashcode very well, but Dashcode applications are not really the kind I want to create. I suppose I could use Dashcode to create the skin of my application but the overhead may be excessive and the result may be too device dependent.
  • Developing Hybrid Applications For The iPhone (2009, 195 pgs.) by Lee S. Barney – By introducing the idea of a “hybrid” application, this book picks up where the Dashcode book left off. It does this by describing two Javascript packages (QuickConnectiPhone and PhoneGap) allowing the developer to interact with the underlying hardware. I’ve read this book a couple of times, I’ve looked over it a few more, and in the end I am still challanged. I’m excited about accessing things like hardware’s camera, GPS funcationality, and file system, but after reading this book I’m still confused on actually how to do it. The content of this book is an advanced topic to be tackled after the basics have been mastered.
  • Safari And WebKit Development For iPhone OS 3.0 (2010, 383 pgs.) by Richard Wagner – This book is practical, and the one I relied upon the most, but only before I bought Beginning iPhone And iPad Web Apps. It gives an overview of WebKit, Javascript, and CSS. It advocates Web app frameworks like iUI, iWebKit, and UIUIKit. It describes how to design interfaces for the small screen of iPhone and iPod Touch. It has a chapter the specific Javascript events supported by iPhone and iPod Touch. Like a couple of the other books, it describes how to use the HTML5 canvas to render graphics. I was excited to learn how to interact with the phone, maps, and SMS functions of the devices, but learned that this is done simply through specialized URLs. When the book talks about “offline applications” it is really talking about local database storage — another feature of HTML5. A couple things I should have explored but haven’t yet include bookmarklets and data URLs. The book describes how to take advantage of these concepts. This book is really a second edition of similar book with a different title but written by the same author in 2008. Its content is not as current as it could be, but the fundamentals are there.

Based on the things I’ve learned from these books, I’ve created several mobile interfaces. Each of them deserve their own blog posting so I will only outline them here:

  1. iMobile – A rough mobile interface to much of the Infomotions domain. Written a little more than a year ago, it combines backend Perl scripts with the iUI Javascript framework to render content. Now that I look back on it, the hacks there are pretty impressive, if I do say so myself. Of particular interest is the image gallery which gets its content from OAI-PMH data stored on the server, and my water collection which reads an XML file of my own design and plots where the water was collected on a Google map. iMobile was created from the knowledge I gained from Safari And WebKit Development For iPhone OS 3.0.
  2. DH@ND – The home page for a fledgling initiative called Digital Humanities at the University of Notre Dame. The purpose of the site is to support sets of tools enabling students and scholars to simultaneously do “close reading” and “distant reading”. It was built using the principles gleaned from the books above combined with a newer Javascript framework called JQueryMobile. There are only two things presently of note there. The first is Alex Lite for Mobile, a mobile interface to a tiny catalogue of classic novels. Browse the collection by author or title. Download and read selected books in ePub, PDF, or HTML formats. The second is Geo-location. After doing named-entity extraction against a limited number of classic novels, this interface displays a word cloud of place names. The user can then click on place names and have them plotted on a Google Map.

Remember, the sites listed above are designed for mobile, primarly driven by the WebKit engine. If you don’t use a mobile device to view the sites, then your milage will vary.

Image Gallery
Image Gallery
Water Collection
Water Collection
Alex Lite
Alex Lite
Geo-location
Geo-Location

Web app development is beyond a trend. It has all but become an expectation. Web app implementation requires an evolution in thinking about Web design as well as an additional skill set which includes advanced HTML, CSS, and Javascript. These are not your father’s websites. There are a number of books out there that can help you learn about these topics. Listed above are just a few of them.

Ruler & Compass by Andrew Sutton

I most thoroughly enjoyed reading and recently learning from a book called Ruler & Compass by Andrew Sutton.

The other day, while perusing the bookstore for a basic statistics book, I came across Ruler & Compass by Andrew Sutton. Having always been intrigued by geometry and the use of only a straight edge and compass to describe a Platonic cosmos, I purchased this very short book, a ruler, and a compass with little hesitation. I then rushed home to draw points, lines, and circles for the purposes of constructing angles, perpendiculars, bisected angles, tangents, all sorts of regular polygons, and combinations of all the above to create beautiful geometric patterns. I was doing mathematics, but not a single number was to be seen. Yes, I did create ratios but not with integers, and instead with the inherent lengths of lines. Facinating!

triangle
square pentagon
hexagon elipse “golden” ratio

Geometry is not a lot unlike both music and computer programming. All three supply the craftsman with a set of basic tools. Points. Lines. Circles. Tones. Durations. Keys. If-then statements. Variables. Outputs. Given these “things” a person is empowered to combine, compound, synthesize, analyze, create, express, and describe. They are mediums for both the artist and scientists. Using them effectively requires thinking as well as “thinquing“. All three are arscient processes.

Anybody could benefit by reading Sutton’s book and spending a few lovely hours practicing the geometric constructions contained therein. I especially recommend this activity to my fellow librarians. The process is not only intellectually stimulating but invigorating. Librarianship is not all about service or collections. It is also about combining and reconstituting core principles — collection, organization, preservation, and dissemination. There is an analogy to be waiting to be seen here. Reading and doing the exercises in Ruler & Compass will make this plainly visible.

Book review of Larry McMurtry’s Books

I read with interest Larry McMurtry’s Books: A Memoir (Simon & Schuster, 2008), but from my point of view, I would be lying if I said I thought the book had very much to offer.

The book’s 259 pages are divided into 109 chapters. I was able to read the whole thing in six or seven sittings. It is an easy read, but only because the book doesn’t say very much. I found the stories rarely engaging and never very deep. They were full of obscure book titles and the names of “famous” book dealers.

Much of this should not be a surprise, since the book is about one person’s fascination with books as objects, not books as containers of information and knowledge. From page 38 of my edition:

Most young dealers of the Silicon Chip Era regard a reference library as merely a waste of space. Old-timers on the West Coast, such as Peter Howard of Serendipity Books in Berkeley or Lou and Ben Weinstein of the (recently closed) Heritage Books Shop in Los Angeles, seem to retain a fondness of reference books that goes beyond the practical. Everything there is to know about a given volume may be only a click away, but there are still a few of us who’d rather have the book than the click. A bookman’s love of books is a love of books, not merely the information in them.

Herein lies the root of my real problem with the book, it shares with the reader one person’s chronology of a love of books and book selling. It describes various used bookstores and give you an idea of what it is like to be a book dealer. Unfortunately, I believe McMurtry misses the point about books. They are essentially a means to an end. A tool. A medium for the exchange of ideas. The ideas they contain and the way they contain them are the important thing. There are advantages & disadvantages to the book as a technology, and these advantages & disadvantages ought not be revered or exaggerated to dismiss the use of books or computers.

I also think McMurtry’s perception of libraries, which seems to be commonly held in and outside my profession, points to one of librarianship’s pressing issues. From page 221:

But they [computers] don’t really do what books do, and why should they usurp the chief function of a public library, which is to provide readers access to books? Books can accommodate the proximity of computers but it doesn’t seem to work the other way around. Computers now literally drive out books from the place they should, by definition, be books’ own home: the library.

Is the chief function of a public library to provide readers access to books? Are libraries defined as the “home” of books? Such a perception may have been more or less true in an environment where data, information, and knowledge were physically manifested, but in an environment where the access to information is increasingly digital the book as a thing is not as important. Books are not central to the problems to be solved.

Can computers do what books do? Yes and no. Computers can provide access to information. They make it easier to “slice and dice” their content. They make it easier to disseminate content. They make information more findable. The information therein is trivial to duplicate. On the other hand, books require very little technology. They are relatively independent of other technologies, and therefore they are much more portable. Books are easy to annotate. Just write on the text or scribble in the margin. A person can browse the contents of a book much faster than the contents of electronic text. Moreover, books are owned by their keepers, not licensed, which is increasingly the case with digitized material. There are advantages & disadvantages to both computers and books. One is not necessarily better than the other. Each has their place.

As a librarian, I had trouble with the perspectives of Larry McMurtry’s Books: A Memoir. It may be illustrative of the perspectives of book dealers, book sellers, etc., but I think the perspective misses the point. It is not so much about the book as much as it is about what the book contains and how those contents can be used. In this day and age, access to data and information abounds. This is a place where libraries increasingly have little to offer because libraries have historically played the role of middleman. Producers of information can provide direct access to their content much more efficiently than libraries. Consequently a different path for libraries needs to be explored. What does that path look like? Well, I certainly have ideas about that one, but that is a different essay.

Text mining: Books and Perl modules

This posting simply lists some of the books I’ve read and Perl modules I’ve explored in regards to the field of text mining.

Through my explorations of term frequency/inverse document frequency (TFIDF) I became aware of a relatively new field of study called text mining. In many ways, text mining is similar to data mining only applied to unstructured texts instead of database rows and columns. Think plain text books such as items from Project Gutenberg or the Open Content Alliance. Text mining is a process including automatic classification, clustering (similar but distinct from classification), indexing and searching, entity extraction (names, places, organization, dates, etc.), statistically significant keyword and phrase extraction, parts of speech tagging, and summarization.

As a librarian, I found the whole thing extremely fascinating, consequently I read more.

Books

I have found the following four books helpful. They have enabled me to learn about the principles of text mining.

  • Bilisoly, R. (2008). Practical text mining with Perl. Wiley series on methods and applications in data mining. Hoboken, N.J.: Wiley. – Of all the books listed here, this one includes the most Perl programming examples, and it is not as scholarly as the balance of the list. Much of the book surrounds the description of regular expressions against texts. Its strongest suit is the creation of terminal-based concordance scripts. Very nice. Lot’s of fun. The concordances return very interesting results. The book does describe clustering techniques too, but the on the overall topic of automatic metadata generation the book is not very strong.
  • Konchady, M. (2006). Text mining application programming. Charles River Media programming series. Boston, Mass: Charles River Media. – This book is a readable survey of text mining covering parts of speech (POS) tagging, information extraction, search engines, clustering, classification, summarization, and question/answer processing. Many models for each aspect of text mining are described, compared, and contrasted. To put the author’s knowledge into practice, the book comes with a CD containing a Perl library for text mining, sample applications, and CGI scripts. This library is freely available on the Web.
  • Nugues, P. M. (2006). An introduction to language processing with Perl and Prolog: An outline of theories, implementation, and application with special consideration of English, French, and German. Cognitive technologies. Berlin: Springer. – Of the four books listed here, this one is probably the most dense. I found its Perl scripts used to parse text more useful than the ones in Bilisoly, but this one included no concordance applications. I also found the description of n-grams to be very interesting — the extraction of multi-word phrases. I suspect the model they describe can be extended to n number of words. This book also discusses parts of speech (POS) processing but it is the only one that describes how to really parse language. Think semantics, lexicons, discourse, and dialog. After the first couple of chapters the Perl examples disappear and give way to exclusively Prologue examples.
  • Weiss, S. M. (2005). Text mining: Predictive methods for analyzing unstructured information. New York: Springer. – The complexity of this book lies between Konchady and Nugues; it includes a greater number of mathematical models than Konchady, but it is easier to read than Nugues. Broad topics include textual documents as numeric vectors, using text for prediction, information retrieval, clustering & classification, and looking for information in documents. Each chapter includes a section called “Historical and Bibliographical Remarks” which has proved to be very interesting reading.

When it comes to the process of text mining I found each of these books useful in their own right. Each provided me with ways to reading texts, parsing texts, counting words, counting phrases, and through the application of statistical analysis create lists and readable summaries denoting the “aboutness” of given documents.

Perl modules

As a Perl hacker I am interested in writing scripts putting into practice some of the things I learn. Listed here are a number of modules that have gotten me further along in regard to text mining:

  • Lingua::EN::Fathom – This library outputs interesting statistics regarding a given document: number of words and the number of times each occurs, number of sentences, complexity of words, number of paragraphs, etc. Of greatest interest are numbers (Fog, Flesch, and Flesch-Kincaid) denoting the readability of the text. Quick. Easy. Useful.
  • Lingua::EN::Keywords – Given a text, this library outputs a list of what it thinks are the most significant individual words in a document, sans stop words. Not fancy.
  • Lingua::EN::NamedEntity – Given a text, I believe this library comes pre-trained to extract names, places, and organizations from texts. It returns a Perl data structure listing the probabilities of a word or phrase being any particular entity. It may need to be re-trained to work for your corpus.
  • Lingua::EN::Semtags::Engine – Given text this module will return words and phrases in a relevancy ranked order. Initially, I have had some problems using this module because it seems to take a long time to return. On the other hand, it looks promising since it returns both individual words as well as phrases.
  • Lingua::EN::Summarize – Given a text this library returns sentences it thinks encapsulates the essence of the document. The result is readable — grammatically correct. The process it uses to accomplish its task is self-proclaimed as unscientific.
  • Lingua::EN::Tagger – This library marks up a document in pseudo XML with tags denoting parts of speech in a given document. To do this work it also can extract words, noun phrases, and sentences from a text. Zippy. Probability-based. Developers are expected to parse the tagged output and do analysis against it, such as count the number of times particular parts of speech occur.
  • Lingua::StopWords – Returns a simple list of stop words. Easy, but I can’t figure out how customizable it is. “One person’s stop word list is another person research topic.”
  • Net::Dict – A network interface to DICT (dictionary) servers. While the DICT protocol is a bit long in the tooth, and not quite as cool as Web interfaces to things like Google or Wikipedia, this module does provide a handy way to look up definitions, a complimentary functionality to WordNet.
  • Text::Aspell – A Perl interface to GNU Aspell which is great for spell-checking applications.
  • TextMine – This is a set of modules written by Manu Konchady the author of Text Mining Application Programming. It includes submodules named Cluster, Entity, Index, Pos, Quanda (Q & A), Summary, Tokens, and WordNet. While this set of modules is the most comprehensive I’ve seen, and while they are probably the most theoretically based interfacing with things like WordNet to be thorough, my initial experience has been a bit frustrating since scripts written against the libraries do not turn very quickly. Maybe I’m feeding them documents that are too large and if so, then the libraries are not necessarily scalable.
  • WordNet – There are a bevy of modules providing functionality against WordNet — a “lexical database of English… Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations.” Any truly thorough text mining application of English will take advantage of WordNet.

Text mining and librarianship

Given the volume of “born digital” material being created, it is not possible to apply traditional library methods against them. The hand-crafted, heavy human touch process is not scalable. Given the amounts of mass digitized text being generated from the Google Books Project and the Open Content Alliance, new opportunities for literary analysis make themselves evident. Again, the traditional library processes can not fill the bill in these regards.

Text mining techniques offer possible solutions to these problems. Count words. Count phrases. Compare these words, phrases, and counts to other texts. Determine their statistical significance. Assign them to documents in the form of subject headings, keywords, author names, and other added entries in our metadata formats. Given large numbers of books, articles, and other “wordy” documents, learn how to “save the time of the reader” by summarizing these documents and ranking them in some sort of order in addition to alphabetical or date. Compare and contrast full text works by learning what words and types of words are used in documents. Are the words religious in nature? Mathematic and scientific? Poetic? Such things will provide additional means for understanding and interpreting everything from scholarly journal articles to works of classic fiction and philosophy. These techniques are not intended to replace existing methods of understanding and organization, but rather to supplement and build upon them. This is an evolutionary process.`

If libraries and librarians desire to remain relevant in the evolving information environment, then they will need to do the good work they do differently. The problem to be solved now-a-days is less about access and more about use. Text mining is one way of making the content of libraries more useful.

Ralph Waldo Emerson’s Essays

It was with great anticipation that I read Ralph Waldo Emerson’s Essays (both the First Series as well as the Second Series), but my expectations were not met. In a sentence I thought Emerson used too many words to say things that could have been expressed more succinctly.

The Essays themselves are a set of unsystematic short pieces of literature describing what one man thinks of various classic themes, such as but not limited to: history, intellect, art, experience, gifts, nature, etc. The genre itself — the literary essay or “attempts” — was apparently first popularized by Montaigne and mimicked by other “great” authors in the Western tradition including Bacon, Rousseau, and Thoreau. Considering this, maybe the poetic and circuitous nature of Emerson’s “attempts” should not be considered a fault.

Art

Because it was evident that later essays did not necessarily build on previous ones, I jumped around from chapter to chapter as whimsy dictated. Probably one of the first I read was “Art” where he describes the subject as the product of men detached from society.

It is the habit of certain minds to give an all-excluding fulness to the objects, the thought, the world, they alight upon, and to make that for the time the deputy of the world. These are the artists, the orators, the leaders of society. The power to detach and to magnify by detaching, is the essence of rhetoric in the hands of the orator and the poet.

But at the same time he seems to contradict himself earlier when he says:

No man can quite emancipate himself from the age and country, or produce a model in which the education, the religion, the politics, usages, and arts, of his times shall have not share. Though he were never so original, never so wilful and fantastic, he cannot wipe out of his work every trace of the thoughts amidst which it grew.

How can something be the product of a thing detached from society when it is not possible become detached in the first place?

Intellect

I, myself, being a person of mind more than heart, was keenly interested in the essay entitled “Intellect” where Emerson describes it as something:

…void of affection, and sees an object as it stands in the light of science, cool and disengaged… Intellect pierces the form, overlaps the wall, detects intrinsic likeness between remote things, and reduces all things into a few principles.

At the same time, intellect is not necessarily genius, since genius also requires spontaneity:

…but the power of picture or expression, in the most enriched and flowing nature, implies a mixture of will, a certain control over the spontaneous states, without which no production is possible. It is a conversation of all nature into the rhetoric of thought under the eye of judgement, with the strenuous exercise of choice. And yet the imaginative vocabulary seems to be spontaneous also. It does not flow from experience only or mainly, but from a richer source. Not by any conscious imitation of particular forms are the grand strokes of the painter executed, but by repairing to the fountain-head of all forms in his mind.

The Poet

Emerson apparently carried around his journal wherever he went. He made a living writing and giving talks. Considering this, and considering the nature of his writing, I purposely left his essay entitled “The Poet” until last. Not surprisingly, he had a lot to say on the subject, and I found this to be the hilight of my readings:

The poet is the person in whom these powers [the reproduction of senses] are in balance, the man without impediment, who sees and handles that which others dream of, traverses the whole scale of experience, and is representative of man, in virtue offering the largest power to receive and to impart… The poet is the sayer, the namer, and represents beauty… The poet does not wait for the hero or the sage, but as they act and think primarily, so he writes pirmarily what will and must be spoken, reckoning the others, though primaries also, yet, in repsect to him, secondaries and servants.

I found it encouraging that science was mentioned a few times during his discourse on the poet, since I believe a better understanding of one’s environment comes from the ability to think both artistically as well as scientifically, an idea I call arscience:

…science always goes abreast with the just elevation of the man, keeping step with religion and metaphysics; or, the state of science is an index of our self-knowledge… All the facts of the animal economy, — sex, nutriment, gestation, birth, growth — are symbols of passage of the world into the soul of man, to suffer there a change, and reappear a new and higher fact. He uses forms according to the life, and not according to the form. This is true science.

Back to the beginning

I think Emerson must have been a bit frustrated (or belittling himself in order be percieved as more believable) with a search for truth when he says, “I look in vain for the poet whom I describe.” But later on he summarizes much of what the Essays describe when he says, “Art is the path of the creator to his work,” and he then goes on to say what I said at the beginning of this review:

The poet pours out verses in every solitude. Most of the things he says are conventional, no doubt; but by and by he says something which is original and beautiful. That charms him.

I was hoping to find more inspriation regarding the definition of Unitarianism throughout the book, but alas, the term was only mentioned a couple of times. Instead, I learnd more indirectly that Emerson affected my thinking in more subtle ways. I have incorporated much of his thought into my own without knowing it. Funny how one’s education manifests itself.

Word cloud

Use this word cloud of the combined Essays to get an idea of what they are “about”:

nature  men  life  world  good  shall  soul  great  thought  like  love  power  know  let  mind  truth  make  society  persons  day  old  character  heart  genius  god  come  beauty  law  being  history  fact  true  makes  work  virtue  better  art  laws  self  form  right  eye  best  action  poet  friend  think  feel  eyes  beautiful  words  human  spirit  little  light  facts  speak  person  state  natural  intellect  sense  live  force  use  seen  thou  long  water  people  house  certain  individual  end  comes  whilst  divine  property  experience  look  forms  hour  read  place  present  fine  wise  moral  works  air  poor  need  earth  hand  common  word  thy  conversation  young  stand  

And since a picture is worth a thousand words, here is a simple graph illustrating how the 100 most frequently used words in the Essays (sans stop words) compare to one another:

emerson words

Henry David Thoreau’s Walden

As I sit here beside my fire at the cabin, I reflect on the experiences documented by Henry David Thoreau in his book entitled Walden.

Being human

On one level, the book is about a man who goes off to live in a small cabin by a pond named Walden. It describes how be built his home, tended his garden, and walked through the woods. On another level, it is collection of self-observations and reflections on what it means to be human. “I went to the woods because I wished to live deliberately, to front only the essential facts of life, and see if I could not learn what it has to teach, and not, when I came to die, discover that I had not lived… I wanted to live deep and suck out all the marrow of life, to live so sturdily and Spartan-like as to put to rout all that was not life, to cut a broad swath and shave close, to drive life into a corner, and reduce it to its lowest terms, and, if it proved to be mean, why then to get the whole and genuine meanness of it, and publish its meanness to the world.”

Selected chapters

The book doesn’t really have beginning, a middle, and an end. There is no hero, no protagonist, no conflict, and no climax. Instead, the book is made up of little stories amassed over the period of one and a half years while living alone. Economy — an outline of the necessities of life such as clothing, shelter, and food. It cost him $28 to built his cabin, and he grew much of his own food. “Yet men have come to such a pass that they frequently starve not for want of necessities, but for want of luxuries.”

I also enjoyed the chapter called “The Bean-Field”. “I have come to love my rows, my beans, though so many more than I wanted.” Apparently he had as many as seven miles of beans, if they were all strung in a row. Even over two acres of ground, I find that hard to believe. He mentions woodchucks often in the chapter as well as throuhout the book, and he dislikes them because they eat his crop. I always thought woodchucks — ground hogs — were particularly interesting since they were abundant around the property where I grew up. In relation to economy, Thoreau spent just less than $14 on gardening expenses, and after selling his crop made a profit of almost $9. “Daily the beans saw me come to their rescue armed with a hoe, and thin their ranks of the enemies, filling up the trenches with weedy dead.”

The chapter called “Sounds” is full of them or allusions to them: voice, rattle, whistle, scream, shout, ring, announce, hissing, bells, sung, lowing, serenaded, music chanted, cluck, buzzing, screech, wailing, trilled, sighs, hymns, threnodies, gurgling, hooting, baying, trump, bellowing, crow, bark, laughing, cackle, creaking, and snapped. Almost a cacophony, but at the same time a possible symphony. It depends on your perspective.

While he lived alone, he was never seemingly lonely. In fact, he seemed to attract visitors or sought them out himself. Consider the wood chopper who was extra skilled at this job. Reflect on the Irish family who lived “rudely”. Compare and contrast the well-to-do professional with manners to the man who lived in a hollow log. (I wonder whether or not that second man really existed.)

Thoreau’s description of the pond itself were arscient. [1] He describes its color, its depth, and over all size. He ponders where it got its name, its relation to surrounding ponds, and where its water comes from and goes. He fishes in it regularly, and walk upon its ice in the winter. He describes how men harvest its ice and how the pond keeps most of the effort. He appreciates the appearance of the pond as he observes it during different times of year as well as from different vantage points. In my mind, it is a good thing to observe anything and just about everything from many points of view, both literally and figuratively.

Conclusion

The concluding chapter has a number of meaty thoughts. “I left the woods for a good a reason as I went there. Perhaps it seemed to me that I had several more lives to live, and could not spare any more time for that one… I learned this, at least, by my experiment: that if one advances confidently in the direction of his dreams, and endeavors to live the life which he has imagined, he will meet with a success unexpected in common hours… If a man does not keep pace with his companions, perhaps it is because he hears a different drummer. Let him step to the music which he hears, however measured or far away… However mean your life is, meet it and live it; do not shun it and call it hard names… Love your life, poor as it is… Rather than love, than money, than fame, give me truth.”

Word cloud

As a service against the text, and as a means to learning about it more quickly, I give you the following word cloud (think “concordance”) complete with links to the places in the text where the words can be found:

life  pond  most  house  day  though  water  many  time  never  about  woods  without  much  yet  long  see  before  first  new  ice  well  down  little  off  know  own  old  nor  good  part  winter  far  way  being  last  after  heard  live  great  world  again  nature  shore  morning  think  work  once  same  walden  thought  feet  spring  earth  here  perhaps  night  side  sun  things  surface  few  thus  find  found  summer  must  true  got  also  years  village  enough  myself  half  poor  seen  air  better  put  read  till  small  within  wood  cannot  fire  ground  deep  end  bottom  left  nothing  went  away  place  almost  least  

Note

[1] Arscience — art-science — is a term I use to describe a way of thinking incorporating both artistic and scientific elements. Arscient thinking is poetic, intuitive, free-flowing, and at the same time it is systematic, structured, and repeatable. To my mind, a person requires both in order to create a cosmos from the apparent chaos of our surroundings.

ASIS&T Bulletin on open source software

The following is a verbatim duplication of an introduction I wrote for a special issue of the ASIS&T Bulletin on open source software in libraries. I appreciate the opportunity to bring the issue together because I sincerely believe open source software provides a way for libraries to have more control over their computing environment. This is especially important for a profession that is about learning, teaching, scholarship, data, information, and knowledge. Special thanks goes to Irene L. Travis who brought the opportunity to my attention. Thank you.

Open Source Software in Libraries

It is a privilege and an honor to be the guest editor for this special issue of the Bulletin of the American Society for Information Science and Technology on open source software. In it you will find a number of articles describing open source software and how it has been used in libraries. Open source software or free and open source software is defined and viewed in a variety of ways, and the definition will be refined and enriched by our authors. However, very briefly, for those readers unfamiliar with it, open source software is software that is distributed under one of a number of licensing arrangements that (1) require that the software’s source code be made available and accessible as part of the package and (2) permit the acquirer of the software to modify the code freely to fit their own needs provided that, (3) if they distribute the software modifications they create, they do so under an open source license. If these basic elements are met, there is no requirement that the resulting software be distributed at no cost or non-commercially, although much widely used open source software such as the web browser Firefox is also distributed without charge. 

In This Issue

The articles begin with Scot Colford’s “Explaining Free and Open Source Software,” in which he describes how the process of using open source software is a lot like baking a cake. He goes on to outline how open source software is all around us in our daily computing lives.

Karen Schneider’s “Thick of the Fray” lists some of the more popular open source software projects in libraries and describes how these sorts of projects would not have been nearly as feasible in an era without the Internet.

Marshall Breeding’s “The Viability of Open Source ILS” provides a balanced comparison between open source software integrated library systems and closed source software integrated library systems. It is a survey of the current landscape.

Bob Molyneux’s “Evergreen in Context” is a case study of one particular integrated library system, and it is a good example of the open source adage “scratching an itch.”

In “The Development and Usage of the Greenstone Digital Library Software,” Ian Witten provides an additional case study but this time of a digital library application. It is a good example of how many different types of applications are necessary to provide library service in a networked environment.

Finally, Thomas Krichel expands the idea of open source software to include open data and open libraries. In “From Open Source to Open Libraries,” you will learn that many of the principles of librarianship are embodied in the principles of open source software. In a number of ways, librarianship and open source software go hand-in-hand.

What Is Open Source Software About?

Open source software is about quite a number of things. It is about taking more complete control over one’s computer infrastructure. In a profession that is a lot about information, this sort of control is increasingly necessary. Put another way, open source software is about “free.” Not free as in gratis, but free as in liberty. Open source software is about community – the type of community that is only possible in a globally networked computer environment. There is no way any single vendor of software will be able to gather together and support all the programmers that a well-managed open source software project can support. Open source software is about opportunity and flexibility. In our ever-dynamic environment, these characteristics are increasingly important.

Open source software is not a panacea for libraries, and while it does not require an army of programmers to support it, it does require additional skills. Just as all libraries – to some degree or another – require collection managers, catalogers and reference librarians, future-thinking libraries require people who are knowledgeable about computers. This background includes knowledge of relational databases, indexers, data formats such as XML and scripting languages to glue them together and put them on the web. These tools are not library-specific, and all are available as open source.

Through reading the articles in this issue and discussing them with your colleagues, you should become more informed regarding the topic of open source software. Thank you for your attention and enjoy.

On the move with the Mobile Web

On The Move With The Mobile Web by Ellyssa Kroski provides a nice overview of mobile technology and what it presently means for libraries.

What is in the Report

In my most recent list of top technology trends I mentioned mobile devices. Because of this Kroski had a copy of the Library Technology Report she authored, above, sent to me. Its forty-eight pages essentially consists of six chapters (articles) on the topic of the Mobile Web:

  1. What is the Mobile Web? – An overview of Web technology and its use on hand-held, portable devices. I liked the enumeration of Mobile Web benefits such as: constant connectivity, location-aware services, limitless access, and interactive capabilities. Also, texting was described here as a significant use of the Mobile Web. Ironically, I sent my first text message just prior to the 2008 ALA Annual Meeting.
  2. Mobile devices – A listing and description of the hardware, software (operating systems as well as applications), networks, and companies working in the sphere of the Mobile Web. Apparently three companies (Verizon, AT&T, and Sprint Nextel) have 70% of the market share in terms of network accessibility in the United States.
  3. What can you do with the Mobile Web? – Another list and description but this time of application types: email, text messaging, ringtones & wallpaper, music & radio, software & games, instant messaging, social networking, ebooks, social mapping networks (sort of scary if you ask me), search, mapping, audiobooks, television, travel, browsers, news, blogging, food ordering, and widgets.
  4. Library mobile initiatives – A listing and description of what some libraries are doing with the Mobile Web. Ball State University’s Mobile Web presence seems to be out in front in this regard, and PubMed seems pretty innovative as well. For some commentary regarding iPhone-specific applications for libraries see Peter Brantley’s “The Show Room Library“.
  5. How to create a mobile experience – This is more or less a set of guidelines for implementing Mobile Web services. Some of the salient points include: it is about providing information to people who don’t have a computer, think a lot about location-based services, understand the strengths & weaknesses of the technology. I found this chapter to be the most useful.
  6. Getting started with the Mobile Web – A list of fun things to do to educate yourself on what the Mobile Web can do.

Each chapter is complete with quite a number of links and citations for further reading.

Cellphone barcodes

Through my reading of this Report my knowledge of the Mobile Web increased. The most interesting thing I learned was the existence of Semapedia, a project that “strives to tag real-world objects with 2D barcodes that can be read by camera phones.” Go to Semapedia. Enter a Wikipedia URL. Get back a PDF document containing “barcodes” that your cellphone should be able to read (with the appropriate application). Label real-world things with the barcode. Scan the code with your cellphone. See a Wikipedia article describing the thing. Interesting. Below is one of these barcodes for the word “blog” which links to the Mobile Web-ready Wikipedia entry on blogs:

barcode

Read the report

I still believe the Mobile Web is going to play larger role in people’s everyday lives. (Duh!) By extension, I believe it is going to play a larger role in libraries. Ellyssa Kroski’s On The Move With The Mobile Web will give you a leg up on the technology.

TPM — technological protection measures

I learned a new acronym a few weeks ago — TPM — which stands for “technological protection measures”, and in the May 2008 issue of College & Research Libraries Kristin R. Eschenfelder wrote an article called “Every library’s nightmare?” and enumerated various types of protection measures employed by publishers to impede the use of electronic scholarly material.

Types of restrictions

In today’s environment, where digital information is increasingly bought, sold, and/or licensed, publishers feel the need to protect their product from duplication. As described by Eschenfelder, these protections — restrictions — come in two forms: soft and hard.

Soft restrictions are “configurations of hardware or software that make certain uses such as printing, saving, copy/pasting, or e-mailing more difficult — but not impossible — to achieve.” The soft restrictions have been divided into the following subtypes:

  • extent of use – page print limits; PDF download limits; data export limits; suspicious use tracking
  • obfuscation – need to select items before options become available
  • omission – not providing buttons or links to enact users
  • decomposition – saving document results in many files, making recreating or e-mailing the document difficult
  • frustration – page chunking in e-books
  • warning – copyright warnings; end-user licenses on startup

Hard restrictions are “configurations of software or hardware that strictly prevent certain uses.” The hard restrictions have been divided into the following subtypes:

  • restricted copy and paste OCR – OCR exposed for searching, but not for copying and pasting of text
  • secure container TPM – use rights vary by resource

To investigate what types of restricts were put into everyday practice Eschenfelder studied a total of about seventy-five resources from three different disciplines (engineering, history, art history) and tallied the types of restrictions employed.

Salient quotes

A few salient quotes from the article exemplify Eschenfelder’s position on TPM:

  • “This paper suggests that the soft restrictions that are present in licensed products may haver already changed user’s and librarian’s expectations about what the use rights they ought to expect from vendors and their products.” (Page 207)
  • “One concern is that the library community has already accepted many of the soft use restrictions identified in this paper.” (Page 219)
  • “[Librarians] should also advocate for removal of use restrictions, or encourage new vendors to offer competing restriction-free products.” (Page 219)
  • “A more realistic solution might be a shared knowledge base of vendor interfaces and known use restrictions.” (Page 219)
  • “The paper argues that soft use restrictions deserve more attention from the library community, and that librarians should not accept these restrictions as the natural order of things.” (Page 220)

My commentary

I agree with Eschenfelder.

Many people who work in libraries seem to be there because of the values libraries portray. Examples include but are not limited to: intellectual freedom, education, diversity, equal access to information, preservation of the historical record for future generations, etc. Heaven know, people who work in libraries are not in it for the money! I fall into the equal access to information camp, and that is why I advocate things like open access publishing and open source software development.

TPM inhibits the free and equal access of information, and I think Eschenfelder makes a good point when she says the “library community has already accepted many of the soft use restrictions.” Why do we accept them? Librarians are not required to purchase and/or license these materials. We have choice. If much of the scholarly publishing industry is driven by the marketplace — supply & demand — then why don’t/can’t we just say, “No”. Nobody is forcing us spend our money this way. If vendors don’t provide the sort of products and services we desire, then the marketplace will change. Right?

In any event, consider educating yourself on the types of TPM and read Eschenfelder’s article.

Against The Grain is not

Against The Grain is not your typical library-related serial.

Last year I had the opportunity to present at the 27th Annual Charleston Conference where I shared my ideas regarding the future of search and how some of those ideas can implemented in “next-generation” library catalogs. In appreciation of my efforts I was given a one-year subscription to Against The Grain. From the website’s masthead:

Against the Grain (ISSN: 1043-2094) is your key to the latest news about libraries, publishers, book jobbers, and subscription agents. It is a unique collection of reports on the issues, literature, and people that impact the world of books and journals. ATG is published on paper six times a year, in February, April, June, September, and November and December/January.

I try to read the issues as they come out, but I find it difficult. This not because the content is poor, but rather because the there is so much of it! In a few words and phrases, Against The Grain is full, complete, dense, tongue-in-cheek, slightly esoteric, balanced, graphically challenging and at the same time graphically interesting, informative, long, humorous, supported by advertising, somewhat scholarly, personal, humanizing, a realistic reflection of present-day librarianship (especially in regards to technical services in academic libraries), predictable, and consistent. For example, the every issue contains a “rumors” article listing bunches and bunches of people, where they are going, and what they are doing. Moreover, the articles are printed in a relatively small typeface in a three-column format. Very dense. To make things easier to read, sort of, all names and titles are bolded. I suppose the dutiful reader could simply scan for names of interest and read accordingly, but there are so many of them. (Incidentally, the bolded names pointed me to the Tenth Fiesole Retreat which piqued my interest because I had given a modified SIG-IR presentation on MyLibrary at the Second Fiesole Retreat. Taking place at Oxford, that was a really cool meeting!)

Don’t get me wrong. I like Against The Grain but it so full of information and has been so thoroughly put together that I feel almost embarrassed not reading it. I feel like the amount of work put into each issue warrants the same amount of effort on my part to read it.

The latest issue (volume 20, number 3, June 2008) includes a number of articles about Google. For me, the most interesting articles included:

  • “Kinda just like Google” by Jimmy Ghaphery – an examination of the number of search targets appearing on ARL library home pages. Almost all of them include a search of the catalog. Just fewer have searches of meta-search engines. Just fewer than that are pages including searches of Google and its relatives, and just fewer than that, if not non-existent, were searches of locally created indexes like institution repositories or digital collections. Too many search boxes?
  • “Giggling Over Google” by Lilia Murray – a description of how Google Docs and Google Custom Search engines can be used and harnessed in libraries. Well-documented. Well-written. Advocates the creation of more Custom Search Engines by librarians. Sounds like a great idea to me.
  • “Keeping the Enemy Close” by John Wender – compares and contrasts the advantages and disadvantages of including/supporting Google Scholar in an academic library setting. I liked the allusion to Carl Shapiro and Hall Varian’s idea of “information as an ‘experience good'”. Kinda like, “A bird in the hand is worth two in the bush.”
  • “Measuring the ‘Google Effect’ at JSTOR by Bruce Heterick – a description of how JSTOR’s usage skyrocketed after its content was indexed by Google.
  • “Prescription vs. Description in the information-seeking process, or should we encourage our patrons to use Google Scholar?” by Bruce Sanders – contrasts “prescription” and “description” librarianship. One encourages competent, sophisticated searching of databases. The other tailors the library Website to make the patron search strategies as effective as possible. An interesting comparison.
  • “Medium rare books, PODS wars, instant books brought to you by algorithms” by John D. Riley – describes how a fortune of books was found in the stacks of the Forbes Library as opposed to the library’s special collections.

If you have the time, spent it reading Against The Grain.

E-journal archiving solutions

A JISC-funded report on e-journal archiving solutions is an interesting read, and it seems as if no particular solution is the hands-down “winner”.

Terry Morrow, et al. recently wrote a report sponsored by JISC called “A Comparative study of e-journal archiving solutions“. Its goal was to compare & contrast various technical solutions to archiving electronic journals and present the informed opinion on the subject.

Begged and unanswered questions

The report begins by setting the stage. Of particular note is the increased movement to e-only journal solutions many libraries are adopting. This e-only approach begs unanswered questions regarding the preservation and archiving of electronic journals — two similar but different aspects of content curation. To what degree will e-journals suffer from technical obsolescence? Just as importantly, how will the change in publishing business models, where access, not content, is provided through license agreements effect perpetual access and long-term preservation of e-journals?

Two preservation techniques

The report outlines two broad techniques to accomplish the curation of e-journal content. On one hand there is “source file” preservation where content (articles) are provided by the publisher to a third-party archive. This is the raw data of the articles — possibly SGML files, XML files, Word documents, etc. — as opposed to the “presentation” files intended for display. This approach is seen as being more complete, but relies heavily on active publisher and third party participation. This is the model employed by Portico. The other technique is harvesting. In this case the “presentation” files are archived from the Web. This method is more akin to the traditional way libraries preserved and archived their materials. This is the model employed by LOCKSS.

Compare & contrast

In order to come their conclusions, Morrow et al. compared & contrasted six different e-journal preservation initiatives while looking through the lense of four possible trigger events. These initiatives (technical archiving solutions) included:

  1. British Library e-Journal Digital Archive – a fledgling initiative by a national library
  2. CLOCKSS – a dark archive of articles using the same infrastructure as LOCKSS
  3. e-Depot – a national library initiative from The Netherlands
  4. LOCKSS – an open source and distributed harvesting implementation
  5. OCLC ECO – an aggregation of aggregators, not really preservation
  6. Portico – a Mellon-backed “source file” approach

The trigger events included:

  1. cancelation of an e-journal title
  2. e-journal no longer available from a publisher
  3. publisher ceased operation
  4. catastrophic hardware or network failure

These characteristics made up a matrix and enabled Morrow, et al. to describe what would happen with each initiative under each trigger event. In summary, they would all function but it seems the LOCKSS solution would provide immediate access to content whereas most of the other solutions would only provide delayed access. Unfortunately, the LOCKSS initiative seems to have less publisher backing than the Portico initiative. On the other hand, the Portico initiative costs more money and assumes a lot of new responsibilities from publishers.

In today’s environment where information is more routinely sold and licensed, I wonder whether or not what level of trust can be given to publishers. What’s in it for them? In the end, neither solution — LOCKSS nor Portico — can be considered ideal, and both ought to be employed at the present time. One size does not fit all.

Recommendations

In the end there were ten recommendations:

  1. carry out risk assessments
  2. cooperate with one or more external e-journal archiving solutions
  3. develop standard cross-industry definitions of trigger events and protocols
  4. ensure archiving solutions cover publishers of value to UK libraries
  5. explicitly state perpetual access policies
  6. follow the Transfer Code of Practice
  7. gather and share statistical information about the likelihood of trigger events
  8. provide greater detail of coverage details
  9. review and update this study on a regular basis
  10. take the initiative by specifying archiving requirements when negotiating licenses

Obviously the report went into much greater detail regarding all of these recommendations and how they derived. Read the report for the details.

There are many aspects that make up librarianship. Preservation is just one of them. Unfortunately, when it comes to preservation of electronic, born-digital content, the jury is still out. I’m afraid we are suffering from a wealth of content right now, but in the future this content may not be accessible because society has not thought very long into the future regarding preservation and archiving. I hope we are not creating a Digital Dark Age as we speak. Implementing ideas from this report will help reduce the possibility of this problem from becoming a reality.

Web 2.0 and “next-generation” library catalogs

A First Monday article systematically comparing & contrasting Web 1.0 and Web 2.0 website technology recently caught my interest, and I think it points a way to making more informed decisions regarding “next-generation” library catalog interfaces and Internet-based library services in general.

Web 1.0 versus Web 2.0

Graham Cormode and Balachander Krishnamurthy in “Key differences between Web 1.0 and Web 2.0“, First Monday, 13(6): June 2008 thoroughly describe the characteristics of Web 2.0 technology. It outlines the features of Web 2.0, describes the structure of Web 2.0 sites, identifies problem with measurement of Web 2.0 usage, and covers technical issues.

I really liked the how it listed some of the identifying characteristics. Web 2.0 sites usually:

  • encourage user-generated content
  • exploit AJAX
  • have a strong social component
  • support some sort of public API
  • support the ability to form connections between people
  • support the posting of content in many forms
  • treat users as first class entities in the system

The article included a nice matrix of popular websites across the top and services down the side. At the intersection of the rows and columns check marks were placed denoting whether or not the website supported the services. Of all the websites Facebook, YouTube, Flicr, and MySpace ranked as being the most Web 2.0-esque. Not surprising.

The compare & contrast between Web 1.0 and Web 2.0 sites was particular interesting, and can be used as a sort of standard/benchmark for comparing existing (library) websites to the increasingly expected Web 2.0 format. For example, Web 1.0 sites are characterized as being:

  • stateless
  • shaped like a “bow-tie” where there is a front-page linked to many sub-pages and supplimented with many cross links between sub-pages
  • covering a single topic

Whereas Web 2.0 websites generally:

  • include a broader mixture of content types
  • produce groups or feeds of content
  • rely on user-provided content
  • represent a shared space
  • require some sort of log-in function
  • see “portalization” is a trend

For readers who feel they they do not understand the meaning of Web 2.0, the items outlined above and elaborated upon in the article will make the definition of Web 2.0 clearer. Good reading.

Library “catalogs”

The article also included an interesting graphic, Figure 1, illustrating the paths from content creator to consumer in Web 2.0. The images is linked from the article, below:

Figure 1: Paths from content creator to consumer in Web 2.0

The far left denotes people creating content. The far right denotes people using content. In the middle are services. When I look at the image I see everything from the center to the far right of the following illustration (of my own design):

infrastructure for a next-generation library catalog

This illustration represents a model for a “next-generation” library catalog. On the far left is content aggregation. In the center is content normalization and indexing. On the right are services against the content. The right half of the illustration above is analgous to the entire illustration from Cormode and Krishnamurthy.

Like the movement from Web 1.0 to Web 2.0, library websites (online “catalogs”) need to be more about users, their content, and services applied against it. “Next-generation” library catalogs will fall short if they are only enhanced implementations of search and browse interfaces. With the advent of digization, everybody has content. What is needed are tools — services — to make it more useful.

Encoded Archival Description (EAD) files everywhere

I’m beginning to see Encoded Archival Description (EAD) files everywhere, but maybe it is because I am involved with a project called the Catholic Research Resources Alliance (CRRA).

As you may or may not know, EAD files are the “MODS files” of the archival community. These XML files provide the means to administratively describe archival collections as well as describe the things in the collections at the container, folder, or item level.

Columbia University and MARC records

During the past few months, I helped edit and shepherd an article for Code4Lib Journal by Terry Catapano, Joanna DiPasquale, and Stuart Marquis called “Building an archival collections portal“. The article describes the environment and outlines the process folks at Columbia University use to make sets of their archival collections available on the Web. Their particular process begins with sets of MARC records dumped from their integrated library system. Catapano, DiPasquale, and Marquis then crosswalk the MARC to EAD, feed the EAD to Solr/Lucene, and provide access to the resulting index. Their implementation uses a mixture of Perl, XSLT, PHP, and Javascript. What was most interesting was the way they began the process with MARC records.

Florida State University and tests/tools

Today I read an article by Plato L. Smith II from Information Technology and Libraries (volume 27, number 2, pages 26-30) called “Preparing locally encoded electronic finding aid inventories for union environments: A Publishing model for Encoded Archival Description”. [COinS] Smith describes how the Florida State University Libraries create their EAD files with Note Tab Light templates and then convert them into HTML and PDF documents using XSLT. They provide access to the results through the use of content management system — DigiTool. What I found most intriguing about this article where the links to test/tools used to enrich their EAD files, namely the RLG EAD Report Card and the Online Archive of California Best Practices Guidelines, Appendix B. While I haven’t set it up yet, the former should check EAD files for conformity (beyond validity), and the later will help create DACS-compliant EAD Formal Public Identifiers.

Catholic Research Resources Alliance portal

Both of these articles will help me implement the Catholic Research Resources Alliance (CRRA) portal. From a recent workshop I facilitated:

The ultimate goal of the CRRA is to facilitate research in Catholic scholarship. The focus of this goal is directed towards scholars but no one is excluded from using the Alliance’s resources. To this end, participants in the Alliance are expected to make accessible rare, unique, or infrequently held materials. Alliance members include but are not limited to academic libraries, seminaries, special collections, and archives. Similarly, content might include but is not limited to books, manuscripts, letters, directories, newspapers, pictures, music, videos, etc. To date, some of the Alliance members are Boston College, Catholic University, Georgetown University, Marquette University, Seton Hall University, University of Notre Dame, and University of San Diego.

Like the Columbia University implementation, the portal is expected to allow Alliance members to submit MARC records describing individual items. The Catapano, DiPasquale, and Marquis article will help me map my MARC fields to my local index. Like the Florida Sate University implementation, the portal is expected to allow Alliance members to submit EAD files. The Smith article will help me create unique identifiers. For Alliance members who have neither MARC nor EAD files, the portal is expected to allow Alliance members submit their content via a fill-in-the-blank interface which I am adopting from the good folks at the Archives Hub.

The CRRA portal application is currently based on MyLibrary and an indexer/search engine called KinoSearch. After submitting them to the portal, EAD files and MARC records are parsed and saved to a MySQL database using the Perl-based MyLibrary API. Various reports are then written against the database, again, using the MyLibrary API. These reports are used to create on-the-fly browsable lists of formats, names, subjects, and CRRA “themes”. They are used to create sets of XML files for OAI-PMH harvesting. They are used to feed data to Kinosearch to create an index. (For example, see mylibrary2files.pl and then ead2kinosearch.pl.) Finally, the whole thing is brought together with a single Perl script for searching (via SRU) and browsing.

It is nice to see a growing interest in EAD. I think the archival community has a leg up on it library brethren regarding metadata. They are using XML more and more. Good for them!

Finally, let’s hear it for the ‘Net, free-flowing communication, and open source software. Without these things I would not have been able to accomplish nearly as much as I have regarding the portal. “Thanks guys and gals!”

eXtensible Catalog (XC): A very transparent approach

An article by Jennifer Bowen entitled “Metadata to support next-generation library resource discovery: Lessons from the eXtensible Catalog, Phase 1″ appeared recently in Information Technology & Libraries, the June 2008 issue. [1]

The article outlines next-steps for the XC Project and enumerates a number of goals for their “‘next-generation’ library catalog” application/system:

  1. provide access to all library resources, digital and non-digital
  2. bring metadata about library resources into a more open Web environment
  3. provide an interface with new Web functionality such as Web 2.0 features and faceted browsing
  4. conduct user research to inform system development
  5. publish the XC code as open-source software

Because I am somewhat involved in the XC Project from past meetings and as a Development Partner, the article did not contain a lot of new news for me, but it did elaborate on a number of points.

Its underlying infrastructure is a good example. Like many “next-generation” library catalog applications/systems, it proposes to aggregate content from a wide variety of sources, normalize the data into a central store (the “hub”), index the content, and provide access to the central store or index through a number of services. This is how Primo, VUFind, AquaBrowser operate. Many others work in a similar manner; all of these systems have more things in common than differences. Unlike other applications/systems, XC seems to embrace a more transparent and community-driven process.

One of the things that intrigued me most was goal #2. “XC will reveal library metadata not only through its own separate interface.., but will also allow library metadata to be revealed through other Web applications.” This definitely the way to go. A big part of librarianship is making data, information, and knowledge widely accessible. Our current systems do this very poorly. XC is moving in the right direction in this regard. Kudos.

Another thing that caught my eye was a requirement for goal #3, “The XC system will capture metadata generated by users from any one of the system’s user environments… and harvest it back into the system’s metadata services hub for processing.” This too sounds like a good idea. People are the real sources of information. Let’s figure out ways to harness the knowledge, expertise, and experiences of our users.

What is really nice about XC is the approach they are taking. It is not all about their software and their system. Instead, it is about building on the good work of others and providing direct access to their improvements. “Projects such as the eXtensible Catalog can serve as a vehicle for moving forward by providing an opportunity for libraries to experiment and to then take informed action to move the library community toward a next generation of resource discovery systems.”

I wish more librarians would be thinking about their software development processes in the manner of XC.

[1] The article is immediately available online at http://hdl.handle.net/1802/5757.

DLF ILS Discovery Internet Task Group Technical Recommendation

I read the great interest the DLF ILS Discovery Internet Task Group (ILS-DI) Technical Recommendation [1], and I definitely think it is a step in the right direction for making the content of library systems more accessible.

In regards to the integrated systems of libraries, the primary purpose of the Recommendations is to:

  • improve discovery and use of library resources
  • articulate a clear set of expectations for developers
  • make recommendations applicable to existing and future systems
  • ensure the recommendations are feasible
  • support interoperation and cooperation
  • be responsive to the user and developer community

To this end the Recommendations list a set of abstract functions integrated library systems “should” implement, and it enumerate a number of concrete bindings that can be used to implement these functions. Each of the twenty-five (25) functions can be grouped into one of four overall categories:

  1. data aggregation – harvest content en masse from the underlying system
  2. search – supply a query and get back a list of matching records
  3. patron services – support things like renew, hold, recall, etc.
  4. OPAC integration – provide ways to link to outside services

The Recommendations also group the functions into levels of interoperability:

  1. Level 1: basic interface – simple harvest, search, and display record
  2. Level 2: supplemental – Level 1 plus more robust harvest and search
  3. Level 3: alternative – Level 2 plus patron services
  4. Level 4: robust – Level 3 plus reserves functions and support of an explain function

After describing the things outlined above in greater detail, the Recommendations get down to business, list each function, its parameters, why it is recommended, and suggests one or more “bindings” — possible ways the function can be implemented. Compared to most recommendations in my experience, this one is very easy to read, and it is definitely approachable by anybody who calls themselves a librarian. A few examples illustrate the point.

The Recommendations suggest a number of harvest functions. These functions allow a harvesting system to specify a number of date ranges and get back a list records that have been created or edited within those ranges. These records may be bibliographic, holdings, or authority in nature. These records may be in MARC format, but is strongly suggested they be in some flavor of XML. The search functions allow a remote application to query the system and get back a list of matching records. Like the harvest functions, records may be returned in MARC but XML is prefered. Patron functions support finding patrons, listing patron attributes, allowing patrons to place holds, recalls, or renewals on items, etc.

There was one thing I especially liked about the Recommendations. Specifically, whenever possible, the bindings were based on existing protocols and “standards”. For example, they advocated the use of OAI-PMH, SRU, OpenSearch, NCIP, ISO Holdings, SIP2, MODS, MADS, and MARCXML.

From my reading, there were only two slightly off kilter things regarding the Recommendations. First, it advocated the possible use of an additional namespace to fill in some blanks existing XML vocabularies are lacking. I suppose this was necessary in order to glue the whole thing together. Second, it took me a while to get my head around the functions supporting links to external services — the OPAC interaction functions. These functions are expected to return Web pages that is static, writable, or transformative in nature. I’ll have to think about these some more.

It is hoped vendors of integrated library systems support these functions natively or they are supported through some sort of add-on system. The eXstensible Catalog (XC) is a good example here. The use of Ex Libris’s X-Server interface is another. At the very least a number of vendors have said they would make efforts to implement Level 1 functionality, and this agreement been called the “Berkley Accord” and includes: AquaBrowser, BiblioCommonsCalifornia Digital Library, Ex Libris, LibLime, OCLC, Polaris Library Systems, SirsiDynix, Talis, and VTLS.

Within my own sphere of hack-dom, I think I could enhance my Alex Catalogue of Electronic Texts to support these Recommendations. Create a (MyLibrary) database. Populate it with the metadata and full-text data of electronic books, open access journal articles, Open Content Alliance materials, records from Wikipedia, and photographic images of my own creation. Write reports in the form of browsable lists or feeds expected to be fed to an indexer. Add an OAI-PMH interface. Make sure the indexer is accessible via SRU. Implement a “my” page for users and enhance it to support the Recommendations. Ironically, much of this work has already been done.

In summary, and as I mentioned previously, these Recommendations are a step in the right direction. The implementation of a “next generation” library catalog is not about re-inventing a better wheel and trying to corner the market with superior or enhanced functionality. Instead it is about providing a platform for doing the work libraries do. For the most part, libraries and their functions have more things in common than they have differences. These Recommendations articulate a lot of these commonalities. Implement them, and kudos to Team DLF ILS-DI.

[1] PDF version of Recommendation – http://tinyurl.com/3lqxx2