Infomotions Mini-Musings

Final blog posting

Eric Lease Morgan — Sat, 19 Dec 2020 16:02:35 +0000

This is probably my final blog posting using the WordPress software, and I hope to pick up posting on Infomotions’ Musings.

WordPress is a significant piece of software, and while its functionality is undeniable, maintaining the software in a constant process. It has become too expensive for me.

Moreover over, blog software, such as WordPress, was suppose to enable two additional types of functionality that have not really come to fruition. The first is/was syndication. Blog software was expected to support things like RSS feeds. While blog software does support RSS, people to not seem to create/maintain lists of blogs and RSS feeds for reading. The idea of RSS has not come to fruition in the expected way. Similarly, blog were expected to support commenting in the form of academic dialog, but that has not really come to fruition either; blog comments are usually terse and do not really foster discussion.

For these reasons, I am foregoing WordPress, and I hope to return to use the of my personal TEI publishing process. I feel as if my personal process will be more long-lasting.

In order to make this transition, I have used a WordPress plug-in called Simply Static. Install the software, play with the settings, create a static site, review results, and repeat if necessary. The software seems to work pretty well. Also, paying the roll of librarian, I have made an effort classify my blog postings while diminishing the number of items in the “miscellaneous” category.

By converting my blog to a static site and removing WordPress from my site, I feel as if I am making the Infomotions web presence simpler and easier to maintain. Sure, I am loosing some functionality, but that loss is smaller than the amount of time, effort, and worry I incur by running software I know too little about.

Charting & graphing with Tableau Public

Eric Lease Morgan — Fri, 04 May 2018 16:02:36 +0000

They say, “A picture is worth a thousand words”, and through use of something like Tableau this can become a reality in text mining.

After extracting features from a text, you will have almost invariably created lists. Each of the items on the lists will be characterized with bits of context thus transforing the raw data into information. These lists will probably take the shape of matrices (sets of rows & columns), but other data structures exist as well, such as networked graphs. Once the data has been transformed into information, you will want to make sense of the information — turn the information into knowledge. Charting & graphing the data is one way to make that happen.

For example, the reader may have associated each word in a text with a part-of-speech, and then this association was applied across a corpus. The reader might then ask, “To what degree are words associated with each part-of-speech similar or different across items in the corpus? Do different items include similar or different personal pronouns, and therefore, are some documents more male, more female, or more gender neutral?” Alternatively, suppose the named entities have been extracted from items in a corpus, then the reader may want to know, “What countries, states, and/or cities are mentioned in the text/corpus, and to what degree? Are these texts ‘European’ in scope?”

A charting & graphing application like Tableau (or Tableau Public) can address these questions. [1] The first can be answered by enabling the reader to select one more more items from a corpus, select one or more parts-of-speech, counting & tabulating the intersected words, and displaying the result as a word cloud. The second question can be addressed similarly. Allow the reader to select items from a corpus, extract the names of places (countries, states, and cities), and plot the geographic coordinates on a global map. Once these visualizations are complete, they can be saved on the Web for others to use, for example:

Creating visualizations with Tableau (or Tableau Public) takes practice. Not only does the reader need to have structured data in hand, but one needs to be patient in the learning of the interface. To the author’s mind, the whole thing is reminiscent of the venerable HyperCard program from the 1980’s where one was presented with a number of “cards”, and programming interfaces were created by placing “objects” on them.

This workshop comes with two previously created Tableau workbooks located in the etc directory (word-clouds.twbx and maps.twbx). Describing the process to create them is beyond the scope of this workshop, but an outline follows:

amass sets of data, like parts-of-speech or named entities
import the data into Tableau
in the case of the named entities, convert the data to “Geographic Roles”
drag data elements to the Marks, Rows, or Columns cards
make liberal use of the Show Me feature
drag data elements to the Filters card
observe the visualizations and turn your information into knowledge

Tableau is not really intended to be used against textual data/information; Tableau is more useful and more functional when applied to tabular numeric data. After all, the program is called… Tableau. This does not mean Tableau can not be exploited by the text miner. It just means it requires practice and an ability to articulate a question to be answered with the help of a visualization.

Links

[1] Tableau Public – https://public.tableau.com/

Extracting parts-of-speech and named entities with Stanford tools

Eric Lease Morgan — Fri, 27 Apr 2018 01:25:07 +0000

Extracting specific parts-of-speech as well as “named entities”, and then counting & tabulating them can be quite insightful.

Parts-of-speech include nouns, verbs, adjectives, adverbs, etc. Named entities are specific types of nouns, including but not limited to, the names of people, places, organizations, dates, times, money amounts, etc. By creating features out of parts-of-speech and/or named entities, the reader can answer questions such as:

What is discussed in this document?
What do things do in this document?
How are things described, and how might those descriptions be characterized?
To what degree is the text male, female, or gender neutral?
Who is mentioned in the text?
To what places are things referring?
What happened in the text?

There are a number of tools enabling the reader to extract parts-of-speech, including the venerable Brill part-of-speech tagger implemented in a number of programming languages, CLAWS, the Apache OpenNLP, and a specific part of the Stanford NLP suite of tools called the Stanford Log-linear Part-Of-Speech Tagger. [1] Named entities can be extracted with the Stanford Named Entity Recognizer (NER). [2] This workshop exploits the Standford tools.

The Stanford Log-linear Part-Of-Speech Tagger is written in Java, making it a bit difficult for most readers to use in the manner it was truly designed, the author included. Luckily, the distribution comes with a command-line interface allowing the reader to use the tagger sans any Java programing. Because any part-of-speech or named entity extraction application is the result of a machine learning process, it is necessary to use a previously created computer model. The Stanford tools comes quite a few models from which to choose. The command-line interface also enables the reader to specify different types of output: tagged, XML, tab-delimited, etc. Because of all these options, and because the whole thing uses Java “archives” (read programming libraries or modules), the command-line interface is daunting, to say the least.

After downloading the distribution, the reader ought to be able to change to the bin directory, and execute either one of the following commands:

$ stanford-postagger-gui.sh
> stanford-postagger-gui.bat

The result will be a little window prompting for a sentence. Upon entering a sentence, tagged output will result. This is a toy interface, but demonstrates things quite nicely.

The full-blown command-line interface is bit more complicated. From the command-line one can do either of the following, depending on the operating system:

$ stanford-postagger.sh models/english-left3words-distsim.tagger walden.txt
> stanford-postagger.bat models\english-left3words-distsim.tagger walden.txt

The result will be a long stream of tagged sentences, which I find difficult to parse. Instead, I prefer the inline XML output, which is much more difficult to execute but much more readable. Try either:

$ java -cp stanford-postagger.jar: edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/english-left3words-distsim.tagger -outputFormat inlineXML -outputFormatOptions lemmatize -textFile walden.txt
> java -cp stanford-postagger.jar: edu.stanford.nlp.tagger.maxent.MaxentTagger -model models\english-left3words-distsim.tagger -outputFormat inlineXML -outputFormatOptions lemmatize -textFile walden.txt

In these cases, the result will be a long string of ill-formed XML. With a bit of massaging, this XML is much easier to parse with just about any compute programming language, believe it or not. The tagger can also be run in server mode, which makes batch processing a whole lot easier. The workshop’s distribution comes a server and client application for exploiting these capabilities, but, unfortunately, they won’t run on Windows computers unless some sort of Linux shell has been installed. Some people can issue the following command to launch the server from the workshop’s distribution:

$ ./bin/pos-server.sh

The reader can run the client like this:

$ ./bin/pos-client.pl walden.txt

The result will be a well-formed XML file, which can be redirected to a file, processed by another script converting it into a tab-delimited file, and finally saved to a second file for reading by a spreadsheet, database, or data analysis tool:

$ ./bin/pos-client.pl walden.txt > walden.pos; ./bin/pos2tab.pl walden.pos > walden.tsv

For the purposes of this workshop, the whole of the harvested data has been pre-processed with the Stanford Log-linear Part-Of-Speech Tagger. The result is been mirrored in the parts-of-speech folder/directory. The reader can open the files in the parts-of-speech folder/directory for analysis. For example, you might open them in OpenRefine and try to see what verbs appear most frequently in a given document. My guess the answer will be the lemmas “be” or “have”. The next set of most frequently used verb lemmas will probably be more indicative of the text.

The process of extrating features of name entities is very similar with the Stanford NER. The original Stanford NER distribution comes with a number of jar files, models, and configuration/parameter files. After downloading the distribution, the reader can run a little GUI application, import some text, and run NER. The result will look something like this:

The simple command-line interface takes a single file as input, and it outputs a stream of tagged sentences. For example:

$ ner.sh walden.txt
> ner.bat walden.txt

Each tag denotes an entity (i.e. the name of a person, the name of a place, the name of an organization, etc.). Like the result of all machine learning algorithms, the tags are not necessarily correct, but upon closer examination, most of them are pretty close. Like the POS Tagger, this workshop’s distribution comes with a set of scripts/programs that can make the Stanford NER tool locally available as a server. It also comes with a simple client to query the server. Like the workshop’s POS tool, the reader (with a Macintosh or Linux computer) can extract named entities all in two goes:

$ ./bin/pos-server.sh
$ ./bin/pos-client.pl walden.txt > walden.ner; ./bin/pos2tab.pl walden.ner > walden.tsv

Like the workshop’s pre-processed part-of-speech files, the workshop’s corpus has been pre-processed with the NER tool. The preprocessed files ought to be in a folder/directory named… named-entities. And like the parts-of-speech files, the “ner” files are tab-delimited text files readable by spreadsheets, databases, OpenRefine, etc. For example, you might open one of them in OpenRefine and see what names of people trend in a given text. Try to create a list of places (which is not always easy), export them to a file, and open them with Tabeau Public for the purposes of making a geographic map.

Extracting parts-of-speech and named entities straddles simple text mining and natural language processing. Simple text mining is often about counting & tabulating features (words) in a text. These features have little context sans proximity to other features. On the other hand, parts-of-speech and named entities denote specific types of things, namely specific types of nouns, verbs, adjectives, etc. While these thing do not necessarily denote meaning, they do provide more context than simple features. Extracting parts-of-speech and named entities is (more or less) a easy text mining task with more benefit than cost. Extracting parts-of-speech and named entities goes beyond the basics.

Creating a plain text version of a corpus with Tika

Eric Lease Morgan — Thu, 26 Apr 2018 00:22:19 +0000

It is imperative to create plain text versions of corpus items.

Text mining can not be done without plain text data. This means HTML files need to be rid of markup. It means PDF files need to have been “born digitally” or they need to have been processed with optical character recognition (OCR), and then the underlying text needs to be extracted. Word processor files need to converted to plain text, and the result saved accordingly. The days of plain o’ ASCII text files need to be forgotten. Instead, the reader needs to embrace Unicode, and whenever possible, make sure characters in the text files are encoded as UTF-8. With UTF-8 encoding, one gets all of the nice accent marks so foreign to United States English, but one also gets all of the pretty emoticons increasingly sprinkling our day-to-day digital communications. Moreover, the data needs to be as “clean” as possible. When it comes to OCR, do not fret too much. Given the large amounts of data the reader will process, “bad” OCR (OCR with less than 85% accuracy) can still be quite effective.

Converting harvested data into plain text used to be laborious as well as painful, but then a Java application called Apache Tika came on the scene. [1] Tika comes in two flavors: application and server. The application version can take a single file as input, and it can output metadata as well as any underlying text. The application can also work in batch mode taking a directory as input and saving the results to a second directory. Tika’s server version is much more expressive, more powerful, and very HTTP-like, but it requires more “under the hood” knowledge to exploit to its fullest potential.

For the sake of this workshop, versions of the Tika application and Tika server are included in the distribution, and they have been saved in the lib directory with the names tika-desktop.jar and tika-server.jar. The reader can run the desktop/GUI version of the Tika application by merely double-clicking on it. The result will be a dialog box.

Drag a PDF (or just about any) file on to the window, and Tika will extract the underlying text. To use the command-line interface, something like this could be run to output the help text:

$ java -jar ./lib/tika-desktop.jar --help
> java -jar .\lib\tika-desktop.jar --help

And then something like these commands to process a single file or a whole directory of files:

$ java -jar ./lib/tika-desktop.jar -t
$ java -jar ./lib/tika-desktop.jar -t -i -o
> java -jar .\lib\tika-desktop.jar -t
> java -jar .\lib\tika-desktop.jar -t -i -o

Try transforming a few files individually as well as in batch. What does the output look like? To what degree is it readable? To what degree has the formatting been lost? Text mining does not take formatting into account, so there is no huge loss in this regard.

Without some sort of scripting, the use of Tika to convert harvested data into plain text can still be tedious. Consequently, the whole of the workshop’s harvested data has been pre-processed with a set of Perl and bash scripts (which probably won’t work on Windows computers unless some sort of Linux shell has been installed):

$ ./bin/tika-server.sh – runs Tika in server mode on TCP port 8080, and waits patiently for incoming connections
$ ./bin/tika-client.pl – takes a file as input, sends it to the server, and returns the plain text while handling the HTTP magic in the middle
$ ./bin/file2txt.sh – a front-end to the second script taking a file and directory name as input, transforming the file into plain text, and saving the result with the same name but in the given directory and with a .txt extension

The entirety of the harvested data has been transformed into plain text for the purposes of this workshop. (“Well, almost all.”) The result has been saved in the folder/directory named “corpus”. Peruse the corpus directory. Compare & contrast its contents with the contents of the harvest directory. Can you find any ommisions, and if so, then can you guess why/how they occurred?

Identifying themes and clustering documents using MALLET

Eric Lease Morgan — Thu, 26 Apr 2018 00:02:06 +0000

Topic modeling is an unsupervised machine learning process. It is used to create clusters (read “subsets”) of documents, and each cluster is characterized by sets of one or more words. Topic modeling is good at answering questions like, “If I were to describe this collection of documents in a single word, then what might that word be? How about two?” or make statements like, “Once I identify clusters of documents of interest, allow me to read/analyze those documents in greater detail.” Topic modeling can also be used for keyword (“subject”) assignment; topics can be identified and then documents can be indexed using those terms. In order for a topic modeling process to work, a set of documents first needs to be assembled. The topic modeler then, at the very least, takes an integer as input, which denotes the number of topics desired. All other possible inputs can be assumed, such as use of a stop word list or denoting the number of time the topic modeler ought to internally run before it “thinks” it has come the best conclusion.

MALLET is the grand daddy of topic modeling tools, and it supports other functions such as text classification and parsing. [1] It is essentially a set of Java-based libraries/modules designed to be incorporated into Java programs or executed from the command line.

A subset of MALLET’s functionality has been implemented in a program called topic-modeling-tool, and the tool bills itself as “A GUI for MALLET’s implementation of LDA.” [2] Topic-modeling-tool provides an easy way to read what possible themes exist in a set of documents or how the documents might be classified. It does this by creating topics, displaying the results, and saving the data used to create the results for future use. Here’s one way:

Create a set of plain text files, and save them in a single directory.
Run/launch topic-modeling-tool.
Specify where the set of plain text files exist.
Specify where the output will be saved.
Denote the number of topics desired.
Execute the command with “Learn Topics”.

The result will be a set of HTML, CSS, and CSV files saved in the output location. The “answer” can also be read in the tool’s console.

A more specific example is in order. Here’s how to answer the question, “If I were describe this corpus in a single word, then what might that one word be?”:

Repeat Steps #1-#4, above.
Specify a single topic to be calculated.
Press “Optional Settings…”.
Specify “1” as the number of topic words to print.
Press okay.
Execute the command with “Learn Topics”.

What one word can be used to describe your collection?

Iterate the modeling process by slowly increasing the number of desired topics and number of topic words. Personally, I find it interesting to implement a matrix of topics to words. For example, start with one topic and one word. Next, denote two topics with two words. Third, specify three topics with three words. Continue the process until the sets of words (“topics”) seem to make intuitive sense. After a while you may observe clear semantic distinctions between each topic as well as commonalities between each of the topic words. Distinctions and commonalities may include genders, places, names, themes, numbers, OCR “mistakes”, etc.

Introduction to the NLTK

Eric Lease Morgan — Wed, 25 Apr 2018 23:47:28 +0000

The venerable Python Natural Language Toolkit (NLTK) is well worth the time of anybody who wants to do text mining more programmatically. [0]

For much of my career, Perl has been the language of choice when it came to processing text, but in the recent past it seems to have fallen out of favor. I really don’t know why. Maybe it is because so many other computer languages have some into existence in the past couple of decades: Java, PHP, Python, R, Ruby, Javascript, etc. Perl is more than capable of doing the necessary work. Perl is well-supported, and there are a myriad of supporting tools/libraries for interfacing with databases, indexers, TCP networks, data structures, etc. On the other hand, few people are being introduced to Perl; people are being introduced to Python and R instead. Consequently, the Perl community is shrinking, and the communities for other languages is growing. Writing something in a “dead” language is not very intelligent, but that may be over-stating the case. On the other hand, I’m not going to be able to communicate with very many people if I speak Latin and everybody else is speaking French, Spanish, or German. It behooves the reader to write software in a language apropos to the task as well as a langage used by many others.

Python is a good choice for text mining and natural language processing. The Python NLTK provides functionality akin to much of what has been outlined in this workshop, but it goes much further. More specifically, it interaces with WordNet, a sort of thesaurus on steroids. It interfaces with MALLET, the Java-based classification & topic modeling tool. It is very well-supported and continues to be maintained. Moreover, Python is mature in & of itself. There are a host of Python “distributions/frameworks”. There are any number of supporting libraries/modules for interfacing with the Web, databases & indexes, the local file system, etc. Even more importantly for text mining (and natural language processing) techniques, Python is supported by a set of robust machine learning libraries/modules called scikit-learn. If the reader wants to write text mining or natural language processing applications, then Python is really the way to go.

In the etc directory of this workshop’s distribution is a “Jupyter Notebook” named “An introduction to the NLTK.ipynb”. [1] Notebooks are sort of interactive Python interfaces. After installing Jupyter, the reader ought to be able to run the Notebook. This specific Notebook introduces the use of the NLTK. It walks you through the processes of reading a plain text file, parsing the file into words (“features”). Normalizing the words. Counting & tabulating the results. Graphically illustrating the results. Finding co-occurring words, words with similar meanings, and words in context. It also dabbles a bit into parts-of-speech and named entity extraction.

The heart of the Notebook’s code follows. Given a sane Python intallation, one can run this proram by saving it with a name like introduction.py, saving a file named walden.txt in the same directory, changing to the given directory, and then running the following command:

python introduction.py

The result ought to be a number of textual outputs in the terminal window as well as a few graphics.

Errors may occur, probably because other Python libraries/modules have not been installed. Follow the error messages’ instructions, and try again. Remember, “Your milage may vary.”

# configure; using an absolute path, define the location of a plain text file for analysis
FILE = 'walden.txt'

# import / require the use of the Toolkit
from nltk import *

# slurp up the given file; display the result
handle = open( FILE, 'r')
data   = handle.read()
print( data )

# tokenize the data into features (words); display them
features = word_tokenize( data )
print( features )

# normalize the features to lower case and exclude punctuation
features = [ feature for feature in features if feature.isalpha() ]
features = [ feature.lower() for feature in features ]
print( features )

# create a list of (English) stopwords, and then remove them from the features
from nltk.corpus import stopwords
stopwords = stopwords.words( 'english' )
features  = [ feature for feature in features if feature not in stopwords ]

# count & tabulate the features, and then plot the results -- season to taste
frequencies = FreqDist( features )
plot = frequencies.plot( 10 )

# create a list of unique words (hapaxes); display them
hapaxes = frequencies.hapaxes()
print( hapaxes )

# count & tabulate ngrams from the features -- season to taste; display some
ngrams      = ngrams( features, 2 )
frequencies = FreqDist( ngrams )
frequencies.most_common( 10 )

# create a list each token's length, and plot the result; How many "long" words are there?
lengths = [ len( feature ) for feature in features ]
plot    = FreqDist( lengths ).plot( 10 )

# initialize a stemmer, stem the features, count & tabulate, and output
from nltk.stem import PorterStemmer
stemmer     = PorterStemmer()
stems       = [ stemmer.stem( feature ) for feature in features ]
frequencies = FreqDist( stems )
frequencies.most_common( 10 )

# re-create the features and create a NLTK Text object, so other cool things can be done
features = word_tokenize( data )
text     = Text( features )

# count & tabulate, again; list a given word -- season to taste
frequencies = FreqDist( text )
print( frequencies[ 'love' ] )

# do keyword-in-context searching against the text (concordancing)
print( text.concordance( 'love' ) )

# create a dispersion plot of given words
plot = text.dispersion_plot( [ 'love', 'war', 'man', 'god' ] )

# output the "most significant" bigrams, considering surrounding words (size of window) -- season to taste
text.collocations( num=10, window_size=4 )

# given a set of words, what words are nearby
text.common_contexts( [ 'love', 'war', 'man', 'god' ] )

# list the words (features) most associated with the given word
text.similar( 'love' )

# create a list of sentences, and display one -- season to taste
sentences = sent_tokenize( data )
sentence  = sentences[ 14 ]
print( sentence )

# tokenize the sentence and parse it into parts-of-speech, all in one go
sentence = pos_tag( word_tokenize( sentence ) )
print( sentence )

# extract named enities from a sentence, and print the results
entities = ne_chunk( sentence )
print( entities )

# done
quit()

Using Voyant Tools to do some “distant reading”

Eric Lease Morgan — Wed, 25 Apr 2018 02:44:53 +0000

Voyant Tools is often the first go-to tool used by either: 1) new students of text mining and the digital humanities, or 2) people who know what kind of visualization they need/want. [1] Voyant Tools is also one of the longest supported tools described in this bootcamp.

As stated the Tool’s documentation: “Voyant Tools is a web-based text reading and analysis environment. It is a scholarly project that is designed to facilitate reading and interpretive practices for digital humanities students and scholars as well as for the general public.” To that end it offers a myriad of visualizations and tabular reports characterizing a given text or texts. Voyant Tools works quite well, but like most things, the best use comes with practice, a knowledge of the interface, and an understanding of what the reader wants to express. To all these ends, Voyant Tools counts & tabulates the frequencies of words, plots the results in a number of useful ways, supports topic modeling, and the comparison documents across a corpus. Examples include but are not limited to: word clouds, dispersion plots, networked analysis, “stream graphs”, etc.

dispersion chart	network diagram
“stream” chart	word cloud
concordance	topic modeling

Voyant Tools’ initial interface consists of six panes. Each pain encloses a feature/function of Voyant. In the author’s experience, Voyant Tools’ is better experienced by first expanding one of the panes to a new window (“Export a URL”), and then deliberately selecting one of the tools from the “window” icon in the upper left-hand corner. There will then be displayed a set of about two dozen tools for use against a document or corpus.

initial layout

focused layout

Using Voyant Tools the reader can easily ask and answer the following sorts of questions:

What words or phrases appear frequently in this text?
How do those words trend throughout the given text?
What words are used in context with a given word?
If the text were divided into T topics, then what might those topics be?
Visually speaking, how do given texts or sets of text cluster together?

After a more thorough examination of the reader’s corpus, and after making the implicit more explicit, Voyant Tools can be more informative. Randomly clicking through its interface is usually daunting to the novice. While Voyant Tools is easy to use, it requires a combination of text mining knoweldge and practice in order to be used effectively. Only then will useful “distant” reading be done.

[1] Voyant Tools – https://voyant-tools.org/

Using a concordance (AntConc) to facilitate searching keywords in context

Eric Lease Morgan — Mon, 23 Apr 2018 11:53:24 +0000

A concordance is one of the oldest of text mining tools dating back to at least the 13th century when they were used to analyze and “read” religious texts. Stated in modern-day terms, concordances are key-word-in-context (KWIC) search engines. Given a text and a query, concordances search for the query in the text, and return both the query as well as the words surrounding the query. For example, a query for the word “pond” in a book called Walden may return something like the following:

  1.    the shore of Walden Pond, in Concord, Massachuset
  2.   e in going to Walden Pond was not to live cheaply 
  3.    thought that Walden Pond would be a good place fo
  4.    retires to solitary ponds to spend it. Thus also 
  5.    the woods by Walden Pond, nearest to where I inte
  6.    I looked out on the pond, and a small open field 
  7.   g up. The ice in the pond was not yet dissolved, t
  8.   e whole to soak in a pond-hole in order to swell t
  9.   oping about over the pond and cackling as if lost,
  10.  nd removed it to the pond-side by small cartloads,
  11.  up the hill from the pond in my arms. I built the

The use of a concordance enables the reader to learn the frequency of the given query as well as how it is used within a text (or corpus).

Digital concordances offer a wide range of additional features. For example, queries can be phrases or regular expressions. Search results and be sorted by the words on the left or on the right of the query. Queries can be clustered by the proximity of their surrounding words, and the results can be sorted accordingly. Queries and their nearby terms can be scored not only by their frequencies but also by the probability of their existence. Concordances can calculate the postion of a query i a text and illustrate the result in the form of a dispersion plot or histogram.

AntConc is a free, cross-platform concordance program that does all of the things listed above, as well as a few others. [1] The interface is not as polished as some other desktop applications, and sometimes the usability can be frustrating. On the other hand, given practice, the use of AntConc can be quite illuminating. After downloading and running AntConc, give these tasks a whirl:

use the File menu to open a single file
use the Word List tab to list token (word) frequencies
use the Settings/Tool Preferences/Word List Category to denote a set of stop words
use the Word List tab to regenerate word frequencies
select a word of interest from the frequency list to display the KWIC; sort the result
use the Concordance Plot tab to display the dispersion plot
select the Collocates tab to see what words are near the selected word
sort the collocates by frequency and/or word; use the result to view the concordance

The use of a concordance is often done just after the creation of a corpus. (Remember, a corpus can include one or more text files.) But the use of a concordance is much more fruitful and illuminating if the features of a corpus are previously made explicit. Concordances know nothing about parts-of-speech nor grammer. Thus they have little information about the words they are analyzing. To a concordance, every word is merely a token — the tiniest bit of data. Whereas features are more akin to information because they have value. It is better to be aware of the information at your disposal as opposed to simple data. Do not rush to the use of a concordance before you have some information at hand.

[1] AntConc – http://www.laurenceanthony.net/software/antconc/

Word clouds with Wordle

Eric Lease Morgan — Sun, 22 Apr 2018 14:36:02 +0000

A word cloud, or sometimes called a “tag cloud” is a fun, easy, and popular way to visualize the characteristics of a text. Usually used to illustrate the frequency of words in a text, a word clouds make some features (“words”) bigger than others, sometimes colorize the features, and amass the result in a sort of “bag of words” fashion.

Many people disparage the use of word clouds. This is probably because word clouds may have been over used. The characteristics they illustrate are sometimes sophomoric. Or too much value has been given to their meaning. Despite these facts, a word cloud is an excellent way to initialize the analysis of texts.

There are many word cloud applications and programming libraries, but Wordle is probably the easiest to use as well as the most popular. † [1] To get started, use your Web browser and go to the Wordle site. Click the Create tab and type some text into the resulting text box. Submit the form. Your browser may ask for permissions to run a Java application, and if granted, the result ought to be simple word cloud. The next step is to play with Wordle’s customizations: fonts, colors, layout, etc. To begin doing useful analysis, open a file from the workshop’s corpus, and copy/paste it into Wordle. What does the result tell you? Copy/paste a different file into Wordle and then compare/contrast the two word clouds.

By default, Wordle make effort to normalize the input. It removes stop words, lower-cases letters, removes numbers, etc. Wordle then counts & tabulates the frequencies of each word to create the visualization. But the frequency of words only tells one part of a text’s story. There are other measures of interest. For example, the reader might want to create a word cloud of ngram frequencies, the frequencies of parts-of-speech, or even the log-likelihood scores of significant words. To create the sorts of visualization as word clouds, the reader must first create a colon-delimited list of features/scores, and then submit them under Wordle’s Advanced tab. The challenging part of this process is created the list of features/scores, and the process can be done using a combination of the tools described in the balance of the workshop.

† Since Wordle is a Web-based Java application, it is also a good test case to see whether or not Java is installed and configured on your desktop computer.

[1] Wordle – http://www.wordle.net/

An introduction to the NLTK: A Jupyter Notebook

Eric Lease Morgan — Fri, 13 Apr 2018 03:31:59 +0000

The attached file introduces the reader to the Python Natural Langauge Toolkit (NLTK).

The Python NLTK is a set of modules and corpora enabling the reader to do natural langauge processing against corpora of one or more texts. It goes beyond text minnig and provides tools to do machine learning, but this Notebook barely scratches that surface.

This is my first Python Jupyter Notebook. As such I’m sure there will be errors in implementation, style, and functionality. For example, the Notebook may fail because the value of FILE is too operating system dependent, or the given file does not exist. Other failures may/will include the lack of additional modules. In these cases, simply read the error messages and follow the instructions. “Your mileage may vary.”

That said, through the use of this Notebook, the reader ought to be able to get a flavor for what the Toolkit can do without the need to completly understand the Python language.

What is text mining, and why should I care?

Eric Lease Morgan — Wed, 28 Mar 2018 12:17:05 +0000

[This is the first of a number of postings on the topic of text mining. More specifically, this is the first draft of an introductory section of a hands-on bootcamp scheduled for ELAG 2018. As I write the bootcamp’s workbook, I hope to post things here. Your comments are most welcome. –ELM]

Text mining is a process used to identify, enumerate, and analyze syntactic and semantic characteristics of a corpus, where a corpus is a collection of documents usually in the form of plain text files. The purpose of this process it to bring to light previously unknown facts, look for patterns & anomalies in the facts, and ultimately have a better understanding of the corpora as a whole.

The simplest of text mining processes merely count & tabulate a document’s “tokens” (usually words but sometimes syllables). The counts & tabulations are akin to the measurements and observations made in the physical and social sciences. Statistical methods can then be applied to the observations for the purposes of answering questions such as:

What is the average length of documents in the collection, and do they exhibit a normal distribution?
What are the most common words/phrases in a document?
What are the most common words/phrases in a corpus?
What are the unique words/phrases in a document?
What are the infrequent words/phrases in a corpus?
What words/phrases exist in every document and to what extent?
Where do given words/phrases appear in a text?
What other words surround a given word/phrase?
What words/phrases are truly representative of a document or corpus?
If a document or corpus where to be described in a single word, then what would that word be? How about described in three words? How about describing a document with three topics where each topic is denoted with five words?

The answers to these questions bring to light a corpus’s previously unknown features enabling the reader to use & understand a corpus more fully. Given the answers to these sorts of questions, a person can learn when Don Quixote actually tilts at windmills, to what degree does Thoreau’s Walden use the word “ice” in the same breath as “pond”, or how has the defintion of “scientific practice” has evolved over time?

Given models created from the results of natural language processing, other characteristics (sentences, parts-of-speech, named entities, etc.) can be parsed. These values can also be counted & tabulated enabling the reader to answer new sets of questions:

How difficult is a document to read?
What is being discussed in a corpus? To what degree are the things the names of people, organizations, places, dates, money amounts, etc? What percentage of the personal pronouns are male, female, or neutral?
What is the action in a corpus? What things happen in a document? Are things explained? Said? Measured?
How are things in the corpus described? Overall, are the connotations positive or negative? Do the connotations evolve within a document?

The documents in a corpus are often associated with metadata such as authors, titles, dates, subjects/keywords, numeric rankings, etc. This metadata can be combined with measurements & observations to answer questions like:

How have the use of specific words/phrases waxed & waned over time?
To what degree do authors write about a given concept?
What are the significant words/phrases described with a given genre?
Are there correlations between words/phrases and given document’s usefulness score?

Again, text mining is a process, and the process usually includes the following steps:

Articulating a research question
Amassing a corpus to study
Coercing the corpus into a form amenable to computer processing
Taking measurements and making observations
Analyzing the results and drawing conclusions

Articulating a research question can be as informally stated as, “I’d like to know more about this corpus” or “I’d like to garner an overview of the corpus before I begin reading it in earnest.” On the other hand, articulating a research question can be as formal as a dissertation’s thesis statement. The purpose of articulating a research question — no matter how formal — is to give you a context for your investigations. Knowing a set of questions to answer helps you determine what tools you will employ in your inquires.

Creating a corpus is not always as easy as you might think. The corpus can be as small as a single document, or as large as millions. The “documents” in the corpus can be anything from tweets from a Twitter feed, Facebook postings, survey comments, magazine or journal articles, reference manuals, books, screen plays, musical lyrics, etc. The original documents may have been born digital or not. If not, then they will need to be digitized in one way or another. It is better if each item in the corpus is associated with metadata, such as authors, titles, dates, keywords, etc. Actually obtaining the documents may be impeded by copyrights, licensing restrictions, or hardware limitations. Once the corpus is obtained, it is useful to organize it into a coherent whole. There is a lot of possible for when it comes to corpus creation.

Coercing a corpus into a form amenable to computer processing is a chore in an of itself. In all cases, the document’s text needs to be in “plain” text. These means the document includes only characters, numbers, punctuation marks, and a limited number of symbols. Plain text files include no graphical formatting. No bold. No italics, no “codes” denoting larger or smaller fonts, etc. Documents are usually manifested as files on a computer’s file system. The files are usually brought together as lists, and each item in the list have many attributes — the metadata describing each item. Furthermore, each document may need to be normalized, and normalization may include changing the case of all letters to lower case, parsing the document into words (usually called “features”), identifying the lemmas or stems of a word, eliminating stop/function words, etc. Coercing your corpus into coherent whole is not to be underestimated. Remember the old adage, “Garbage in, garbage out.”

Ironically, taking measurements and making observations is the easy part. There are a myriad of tools for this purpose, and the bulk of this workshop describes how to use them. One important note: it is imperative to format the measurements and observations in a way amenable to analysis. This usually means a tabular format where each column denotes a different observable characteristic. Without formating measurements and observation in tabular formats, it will be difficult to chart and graph any results.

Analyzing the results and drawing conclusions is the subprocess of looping back to Step #1. It is where you attempt to actually answer the questions previously asked. Keep in mind that human interpretation is a necessary part of this subprocess. Text mining does not present you with truth, only facts. It is up to you to interpret the facts. For example, suppose the month is January and the thermometer outside reads 32º Fahrenheit (0º Centigrade), then you might think nothing is amiss. On the other hand, suppose the month is August, and the thermometer still reads 32º, then what might you think? “It is really cold,” or maybe, “The thermometer is broken.” Either way, you bring context to the observations and interpret things accordingly. Text mining analysis works in exactly the same way.

Finally, text mining is not a replacement for the process of traditional reading. Instead, it ought to be considered as complementary, supplemental, and a natural progression of traditional reading. With the advent of ubiquitous globally networked computers, the amount of available data and information continues to grow exponentially. Text mining provides a means to “read” massive amounts of text quickly and easily. The process is akin the inclusion of now-standard parts of a scholarly book: title page, verso complete with bibliographic and provenance metadata, table of contents, preface, introduction, divisions into section and chapters, footnotes, bibliography, and a back-of-the-book index. All of these features make a book’s content more accessible. Text mining processes applied to books is the next step in accessibility. Text mining is often described as “distant” or “scalable” reading, and it is often contrasted with the “close” reading. This is a false dichotomy, but only after text mining becomes more the norm will the dichotomy fade.

All that said, the totality of this hands-on workshop is based on the following outline:

What is text mining, and why should I care?
Creating a corpus
Creating a plain text version of a corpus with Tika
Using Voyant Tools to do some “distant” reading
Using a concordance, like AntConc, to facilitate searching keywords in context
Creating a simple word list with a text editor
Cleaning & analyzing word lists with OpenRefine
Charting & graphing word lists with Tableau Public
Increasing meaning by extracting parts-of-speech with the Standford POS Tagger
Increasing meaning by extracting named entities with the Standford NER
Identifying themes and clustering documents using MALLET

By the end of the workshop you will have increased your ability to:

identify patterns, anomalies, and trends in a corpus
practice both “distant” and “scalable” reading
enhance & complement your ability to do “close” reading
use & understand any corpus of poetry or prose

The workshop is operating system agnostic, and all the software is freely available on the ‘Net, or already installed on your computer. Active participation requires zero programming, but readers must bring their own computer, and they must be willing to learn how to use a text editor such as NotePad++ or BBEdit. NotePad, WordPad and TextEdit are totally insufficient.

How to do text mining in 69 words

Eric Lease Morgan — Tue, 15 Aug 2017 13:38:34 +0000

Doing just about any type of text mining is a matter of: 0) articulating a research question, 1) acquiring a corpus, 2) cleaning the corpus, 3) coercing the corpus into a data structure one’s software can understand, 4) counting & tabulating characteristics of the corpus, and 5) evaluating the results of Step #4. Everybody wants to do Step #4 & #5, but the initial steps usually take more time than desired.

Achieving perfection

Eric Lease Morgan — Fri, 03 Jun 2016 09:48:09 +0000

Through the use of the Levenshtein algorithm, I am achieving perfection when it comes to searching VIAF. Well, almost.

I am making significant progress with VIAF Finder [0], but now I have exploited the use of the Levenshtein algorithm. In fact, I believe I am now able to programmatically choose VIAF identifiers for more than 50 or 60 percent of the authority records.

The Levenshtein algorithm measures the “distance” between two strings. [1] This distance is really the number of keystrokes necessary to change one string into another. For example, the distance between “eric” and “erik” is 1. Similarly the distance between “Stefano B” and “Stefano B.” is still 1. Along with a colleague (Stefano Bargioni), I took a long, hard look at the source code of an OpenRefine reconciliation service which uses VIAF as the backend database. [2] The code included the calculation of a ratio to denote the relative distance of two strings. This ratio is the quotient of the longest string minus the Levenshtein distance divided by the length of the longest string. From the first example, the distance is 1 and the length of the string “eric” is 4, thus the ratio is (4 – 1) / 4, which equals 0.75. In other words, 75% of the characters are correct. In the second example, “Stefano B.” is 10 characters long, and thus the ratio is (10 – 1) / 10, which equals 0.9. In other words, the second example is more correct than the first example.

Using the value of MARC 1xx$a of an authority file, I can then query VIAF. The SRU interface returns 0 or more hits. I can then compare my search string with the search results to create a ranked list of choices. Based on this ranking, I am able to more intelligently choose VIAF identifiers. For example, from my debugging output, if I get 0 hits, then I do nothing:

       query: Lucariello, Donato
        hits: 0

If I get too many hits, then I still do nothing:

       query: Lucas Lucas, Ramón
        hits: 18
     warning: search results out of bounds; consider increasing MAX

If I get 1 hit, then I automatically save the result, which seems to be correct/accurate most of the time, even though the Levenshtein distance may be large:

       query: Lucaites, John Louis
        hits: 1
       score: 0.250     John Lucaites (57801579)
      action: perfection achieved (updated name and id)

If I get many hits, and one of them exactly matches my query, then I “achieved perfection” and I save the identifier:

       query: Lucas, John Randolph
        hits: 3
       score: 1.000     Lucas, John Randolph (248129560)
       score: 0.650     Lucas, John R. 1929- (98019197)
       score: 0.500     Lucas, J. R. 1929- (2610145857009722920913)
      action: perfection achieved (updated name and id)

If I get many hits, and many of them are exact matches, then I simply use the first one (even though it might not be the “best” one):

       query: Lucifer Calaritanus
        hits: 5
       score: 1.000     Lucifer Calaritanus (189238587)
       score: 1.000     Lucifer Calaritanus (187743694)
       score: 0.633     Luciferus Calaritanus -ca. 370 (1570145857019022921123)
       score: 0.514     Lucifer Calaritanus gest. 370 n. Chr. (798145857991023021603)
       score: 0.417     Lucifer, Bp. of Cagliari, d. ca. 370 (64799542)
      action: perfection achieved (updated name and id)

If I get many hits, and none of them are perfect, but the ratio is above a configured threshold (0.949), then that is good enough for me (even if the selected record is not the “best” one):

       query: Palanque, Jean-Remy
        hits: 5
       score: 0.950     Palanque, Jean-Rémy (106963448)
       score: 0.692     Palanque, Jean-Rémy, 1898- (46765569)
       score: 0.667     Palanque, Jean Rémy, 1898- (165029580)
       score: 0.514     Palanque, J. R. (Jean-Rémy), n. 1898 (316408095)
       score: 0.190     Marrou-Davenson, Henri-Irénée, 1904-1977 (2473942)
      action: perfection achieved (updated name and id)

By exploiting the Levenshtein algorithm, and by learning from the good work of others, I have been able to programmatically select VIAF identifiers for more than half of my authority records. When one has as many as 120,000 records to process, this is a good thing. Moreover, this use of the Levenshtein algorithm seems to produce more complete results when compared to the VIAF AutoSuggest API. AutoSuggest identified approximately 20 percent of my VIAF identifiers, while my Levenshtein algorithm/logic identifies more than 40 or 50 percent. AutoSuggest is much faster though. Much.

Fun with the intelligent use of computers, and think of the possibilities.

[0] VIAF Finder – ./../2016/05/viaf-finder/index.html

[1] Levenshtein – http://bit.ly/1Wz3qZC

[2] reconciliation service – https://github.com/codeforkjeff/refine_viaf

VIAF Finder

Eric Lease Morgan — Fri, 27 May 2016 13:34:13 +0000

This posting describes VIAF Finder. In short, given the values from MARC fields 1xx$a, VIAF Finder will try to find and record a VIAF identifier. [0] This identifier, in turn, can be used to facilitate linked data services against authority and bibliographic data.

Quick start

Here is the way to quickly get started:

download and uncompress the distribution to your Unix-ish (Linux or Macintosh) computer [1]
put a file of MARC records named authority.mrc in the ./etc directory, and the file name is VERY important
from the root of the distribution, run ./bin/build.sh

VIAF Finder will then commence to:

create a “database” from the MARC records, and save the result in ./etc/authority.db
use the VIAF API (specifically the AutoSuggest interface) to identify VAIF numbers for each record in your database, and if numbers are identified, then the database will be updated accordingly [3]
repeat Step #2 but through the use of the SRU interface
repeat Step #3 but limiting searches to authority records from the Vatican
repeat Step #3 but limiting searches to the authority named ICCU
done

Once done the reader is expected to programmatically loop through ./etc/authority.db to update the 024 fields of their MARC authority data.

Manifest

Here is a listing of the VIAF Finder distribution:

00-readme.txt – this file
bin/build.sh – “One script to rule them all”
bin/initialize.pl – reads MARC records and creates a simple “database”
bin/make-dist.sh – used to create a distribution of this system
bin/search-simple.pl – rudimentary use of the SRU interface to query VIAF
bin/search-suggest.pl – rudimentary use of the AutoSuggest interface to query VIAF
bin/subfield0to240.pl – sort of demonstrates how to update MARC records with 024 fields
bin/truncate.pl – extracts the first n number of MARC records from a set of MARC records, and useful for creating smaller, sample-sized datasets
etc – the place where the reader is expected to save their MARC files, and where the database will (eventually) reside
lib/subroutines.pl – a tiny set of… subroutines used to read and write against the database

Usage

If the reader hasn’t figured it out already, in order to use VIAF Finder, the Unix-ish computer needs to have Perl and various Perl modules — most notably, MARC::Batch — installed.

If the reader puts a file named authority.mrc in the ./etc directory, and then runs ./bin/build.sh, then the system ought to run as expected. A set of 100,000 records over a wireless network connection will finish processing in a matter of many hours, if not the better part of a day. Speed will be increased over a wired network, obviously.

But in reality, most people will not want to run the system out of the box. Instead, each of the individual tools will need to be run individually. Here’s how:

save a file of MARC (authority) records anywhere on your file system
not recommended, but optionally edit the value of DB in bin/initialize.pl
run ./bin/initialize.pl feeding it the name of your MARC file, as per Step #1
if you edited the value of DB (Step #2), then edit the value of DB in bin/search-suggest.pl, and then run ./bin/search-suggest.pl
if you want to possibly find more VIAF identifiers, then repeat Step #4 but with ./bin/search-simple.pl and with the “simple” command-line option
optionally repeat Step #5, but this time use the “named” command-line option, and the possible named values are documented as a part of the VAIF API (i.e., “bav” denotes the Vatican
optionally repeat Step #6, but with other “named” values
optionally repeat Step #7 until you get tired
once you get this far, the reader may want to edit bin/build.sh, specifically configuring the value of MARC, and running the whole thing again — “one script to rule them all”
done

A word of caution is now in order. VIAF Finder reads & writes to its local database. To do so it slurps up the whole thing into RAM, updates things as processing continues, and periodically dumps the whole thing just in case things go awry. Consequently, if you want to terminate the program prematurely, try to do so a few steps after the value of “count” has reached the maximum (500 by default). A few times I have prematurely quit the application at the wrong time and blew my whole database away. This is the cost of having a “simple” database implementation.

To do

Alas, search-simple.pl contains a memory leak. Search-simple.pl makes use of the SRU interface to VIAF, and my SRU queries return XML results. Search-simple.pl then uses the venerable XML::XPath Perl module to read the results. Well, after a few hundred queries the totality of my computer’s RAM is taken up, and the script fails. One work-around would be to request the SRU interface to return a different data structure. Another solution is to figure out how to destroy the XML::XPath object. Incidentally, because of this memory leak, the integer fed to simple-search.pl was implemented allowing the reader to restart the process at a different point dataset. Hacky.

Database

The use of the database is key to the implementation of this system, and the database is really a simple tab-delimited table with the following columns:

id (MARC 001)
tag (MARC field name)
_1xx (MARC 1xx)
a (MARC 1xx$a)
b (MARC 1xx$b and usually empty)
c (MARC 1xx$c and usually empty)
d (MARC 1xx$d and usually empty)
l (MARC 1xx$l and usually empty)
n (MARC 1xx$n and usually empty)
p (MARC 1xx$p and usually empty)
t (MARC 1xx$t and usually empty)
x (MARC 1xx$x and usually empty)
suggestions (a possible sublist of names, Levenshtein scores, and VIAF identifiers)
viafid (selected VIAF identifier)
name (authorized name from the VIAF record)

Most of the fields will be empty, especially fields b through x. The intention is/was to use these fields to enhance or limit SRU queries. Field #13 (suggestions) is for future, possible use. Field #14 is key, literally. Field #15 is a possible replacement for MARC 1xx$a. Field #15 can also be used as a sort of sanity check against the search results. “Did VIAF Finder really identify the correct record?”

Consider pouring the database into your favorite text editor, spreadsheet, database, or statistical analysis application for further investigation. For example, write a report against the database allowing the reader to see the details of the local authority record as well as the authority data in VIAF. Alternatively, open the database in OpenRefine in order to count & tabulate variations of data it contains. [4] Your eyes will widened, I assure you.

Commentary

First, this system was written during my “artist’s education adventure” which included a three-month stint in Rome. More specifically, this system was written for the good folks at Pontificia Università della Santa Croce. “Thank you, Stefano Bargioni, for the opportunity, and we did some very good collaborative work.”

Second, I first wrote search-simple.pl (SRU interface) and I was able to find VIAF identifiers for about 20% of my given authority records. I then enhanced search-simple.pl to include limitations to specific authority sets. I then wrote search-suggest.pl (AutoSuggest interface), and not only was the result many times faster, but the result was just as good, if not better, than the previous result. This felt like two steps forward and one step back. Consequently, the reader may not ever need nor want to run search-simple.pl.

Third, while the AutoSuggest interface was much faster, I was not able to determine how suggestions were made. This makes the AutoSuggest interface seem a bit like a “black box”. One of my next steps, during the copious spare time I still have here in Rome, is to investigate how to make my scripts smarter. Specifically, I hope to exploit the use of the Levenshtein distance algorithm. [5]

Finally, I would not have been able to do this work without the “shoulders of giants”. Specifically, Stefano and I took long & hard looks at the code of people who have done similar things. For example, the source code of Jeff Chiu’s OpenRefine Reconciliation service demonstrates how to use the Levenshtein distance algorithm. [6] And we found Jakob Voß’s viaflookup.pl useful for pointing out AutoSuggest as well as elegant ways of submitting URL’s to remote HTTP servers. [7] “Thanks, guys!”

Fun with MARC-based authority data!

Links

[0] VIAF – http://viaf.org/

[1] VIAF Finder distribution – http://infomotions.com/sandbox/pusc/etc/viaf-finder.tar.gz

[2] VIAF API – http://www.oclc.org/developer/develop/web-services/viaf.en.html

[4] OpenRefine – http://openrefine.org/

[5] Levenshtein distance – https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance

[6] Chiu’s reconciliation service – https://github.com/codeforkjeff/refine_viaf

[7] Voß’s viaflookup.pl – https://gist.github.com/nichtich/832052/3274497bfc4ae6612d0c49671ae636960aaa40d2

Making stone soup: Working together for the advancement of learning and teaching

Eric Lease Morgan — Mon, 09 May 2016 12:26:41 +0000

It is simply not possible for any of us to do our jobs well without the collaboration of others. Yet specialization abounds, jargon proliferates, and professional silos are everywhere. At the same time we all have a shared goal: to advance learning and teaching. How are we to balance these two seemingly conflicting characteristics in our workplace? How can we satisfy the demands of our day-to-day jobs and at the same time contribute to the work of others? ‡

The answer is not technical but instead rooted in what it means to a part of a holistic group of people. The answer is rooted in things like the abilities to listen, to share, to learn, to go beyond tolerance and towards respect, to take a sincere interest in the other person’s point of view, to discuss, and to take to heart the idea that nobody really sees the whole picture.

As people — members of the human race — we form communities with both our strengths & our weaknesses, with things we know would benefit the group & things we would rather not share, with both our beauties and our blemishes. This is part of what it means to be people. There is no denying it, and if we try, then we are only being less of who we really are. To deny it is an unrealistic expectation. We are not gods. We are not actors. We are people, and being people — real people — is a good thing.

Within any community, there are norms of behavior. Without norms of behavior, there is really no community, only chaos and anarchy. In anarchy and chaos, physical strength is oftentimes the defining characteristic of decision-making, but when the physically strong are outnumbered by the emotionally mature and intellectually aware, then chaos and anarchy are overthrown for a more holistic set of decision-making proceses. Examples include democracy, consensus building, and even the possibility governance through benevolent dictatorship.

A community’s norms are both written and unwritten. Our workplaces are good examples of such communities. On one hand there may be policies & procedures, but these policies & procedures usually describe workflows, the methods used to evaluate employees, or to some extent strategic plans. They might outline how meetings are conducted or how teams are to accomplish their goals. On the other hand, these policies & procedures do not necessarily outline how to talk with fellow employees around the virtual water cooler, how to write email messages, nor how to greet each on a day-to-day basis. Just as importantly, our written norms of behavior do not describe how to treat and communicate with people outside one’s own set of personal expertise. Don’t get me wrong. This does not mean I am advocating written norms for such things, but such things do need to be discussed and agreed upon. Such are the beginnings of stone soup.

Increasingly we seem to work in disciplines of specialization, and these specializations, necessarily, generate their own jargon. “Where have all the generalists gone? Considering our current environment, is it really impossible to be a Renaissance Man^h^h^h Person?” Increasingly, the answer to the first question is, “The generalists have gone the way of Leonardo DiVinci.” And the answer to the second question is, “Apparently so.”

For example, some of us lean more towards formal learning, teaching, research, and scholarship. These are the people who have thoroughly studied and now teach a particular academic discipline. These same people have written dissertations, which, almost by defintion, are very specialized, if not unique. They live in a world pursuant of truth while balancing the worlds of rigorous scholarly publishing and student counseling.

There are those among us who thoroughly know the in’s and out’s of computer technology. These people can enumerate the differences between a word processor and a text editor. They can compare & contrast operating systems. These people can configure & upgrade software. They can make computers communicate on the Internet. They can trouble-shoot computer problems when the computers seem — for no apparent reason — to just break.

Finally, there are those among us who specialize in the collection, organization, preservation, and dissemination of data, information, and knowledge. These people identify bodies of content, systematically describe it, make every effort to preserve it for posterity, and share it with their respective communities. These people deal with MARC records, authority lists, and subject headings.

Despite these truisms, we — our communities — need to figure out how to work together, how to bridge the gaps in our knowledge (a consequence of specialization), and how to achieve our shared goals. This is an aspect of our metaphoric stone soup.

So now the problem can be re-articulated. We live and work in communities of unwritten and poorly articulated norms. To complicate matters, because of our specializations, we all approach our situations from different perspectives and use different languages to deal with the situations. As I was discussing this presentation with a dear friend & colleague, the following poem attributed to Prissy Galagarian was brought to my attention†, and it eloquently states the imperative:

  The Person Next to You

  The person next to you is the greatest miracle
   and the greatest mystery you will ever
   meet at this moment.

  The person next to you is an inexhaustible
   reservoir of possibility,
   desire and dread,
   smiles and frowns, laughter and tears,
   fears and hopes,
   all struggling to find expression.

  The person next to you believes in something,
   stands for something, counts for something,
   lives for something, labors for something,
   waits for something, runs from something,
   runs to something.

  The person next to you has problems and fears,
   wonders how they're doing,
   is often undecided and unorganized
   and painfully close to chaos!
   Do they dare speak of it to you?

  The person next to you can live with you
   not just alongside you,
   not just next to you.

  The person next to you is a part of you.
   for you are the person next to them.

How do we overcome these impediments in order to achieve our mutual goals of the workplace? The root of the answer lies in our ability to really & truly respect our fellow employees.

Working together towards a shared goals is a whole lot like making “stone soup”. Do you know the story of “stone soup”? A man comes into a village, and asks the villagers for food. Every time he asks he is told that there is nothing to give. Despite an apparent lack of anything, the man sets up a little fire, puts a pot of water on, and drops a stone into the pot. Curious people come by, and they ask, “What are you doing?” He says, “I’m making stone soup, but I think it needs a bit of flavor.” Wanting to participate, people begin to add their own things to the soup. “I think I have some carrots,” says one villager. “I believe I have a bit of celery,” says another. Soon the pot is filled with bits of this and that and the other thing: onions, salt & pepper, a beef bone, a few tomatoes, a couple of potatoes, etc. In the end, a rich & hearty stew is made, enough for everybody to enjoy. Working together, without judgement nor selfishness, the end result is a goal well-accomplished.

This can happen in the workplace as well. It can happen in our community where the goal is teaching & learning. And in the spirit of cooking, here’s a sort of recipe:

Understand that you do not have all the answers, and in fact, nobody does; nobody has the complete story nor sees the whole picture. Only after working on a task, and completing it at least once, will a holistic perspective begin to develop.
Understand that nobody’s experience is necessarily more important than the others’, including your own. Everybody has something to offer, and while your skills & expertise may be imperative to success, so are the skills & expertise of others. And if there an established hierarchy within your workplace, understand that the hierarchy is all but arbitrary, and maintained by people with an over-developed sense of power. We all have more things in common than differences.
Spend the time to get to know your colleagues, and come to a sincere appreciation of who they are as a person as well as a professional. This part of the “recipe” may include formal or informal social events inside or outside the workplace. Share a drink or a meal. Take a walk outside or through the local museum. Do this in groups of two or more. Such activities provide a way for everybody involved to reflect upon an outside stimulus. Through this process the interesting characteristics of the others will become apparent. Appreciate these characteristics. Do not judge them, but rather respect them.
Remember, listening is a wonderful skill, and when the other person talks for a long time, they will go away thinking they had a wonderful conversation. Go beyond hearing what a person says. Internalize what they say. Ask meaningful & constructive questions, and speak their name frequently during discussions. These things will demonstrate your true intentions. Through this process the others will become a part of you, and you will become a part of them.
Combine the above ingredients, bring them to a boil, and then immediately lower the temperature allowing everything to simmer for a good long time. Keeping the pot boiling will only overheat the soup and make a mess. Simmering will keep many of the individual parts intacked, enable the flavors to mellow, and give you time to set the table for the next stage of the process.

Finally, making stone soup does not require fancy tools. A cast iron pot will work just as well as one made from aluminium or teflon. What is needed is a container large enough to hold the ingredients and withstand the heat. It doesn’t matter whether or not the heat source is gas, electric, or fire. It just has to be hot enough to allow boiling and then simmering. Similarly, stone soup in the workplace does not require Google Drive, Microsoft Office 365, nor any type of wiki. Sure, those things can facilitate project work, but they are not the means for getting to know your colleagues. Only through personal interaction will such knowledge be garnered.

Working together for the advancement of learning & teaching — or just about any other type of project work — is a lot like making stone soup. Everybody contributes a little something, and the result is nourishing meal for all.

‡ This essay was written as a presentation for the AMICAL annual conference which took place in Rome (May 12-14, 2016), and this essay is available as a one-page handout.

† http://fraternalthoughts.blogspot.it/2011/02/person-next-to-you.html

Editing authorities at the speed of four records per minute

Eric Lease Morgan — Thu, 07 Apr 2016 07:52:37 +0000

This missive outlines and documents an automated process I used to “cleanup” and “improve” a set of authority records, or, to put it another way, how I edited authorities at the speed of four records per minute.

As you may or may not know, starting in September 2015, I commenced upon a sort of “leave of absence” from my employer.† This leave took me to Tuscany, Venice, Rome, Provence, Chicago, Philadelphia, Boston, New York City, and back to Rome. In Rome I worked for the American Academy of Rome doing short-term projects in the library. The first project revolved around authority records. More specifically, the library’s primary clientele were Americans, but the catalog’s authority records included a smattering of Italian headings. The goal of the project was to automatically convert as many of the “invalid” Italian headings into “authoritative” Library of Congress headings.

Identify “invalid” headings

When I first got to Rome I had the good fortune to hang out with Terry Reese, the author of the venerable MarcEdit.‡ He was there giving workshops. I participated in the workshops. I listened, I learned, and I was grateful for a Macintosh-based version of Terry’s application.

When the workshops were over and Terry had gone home I began working more closely with Sebastian Hierl, the director of the Academy’s library.❧ Since the library was relatively small (about 150,000 volumes), and because the Academy used Koha for its integrated library system, it was relatively easy for Sebastian to give me the library’s entire set of 124,000 authority records in MARC format. I fed the authority records into MarcEdit, and ran a report against them. Specifically, I asked MarcEdit to identify the “invalid” records, which really means, “Find all the records not found in the Library of Congress database.” The result was a set of approximately 18,000 records or approximately 14% of the entire file. I then used MarcEdit to extract the “invalid” records from the complete set, and this became my working data.

Search & download

I next created a rudimentary table denoting the “invalid” records and the subsequent search results for them. This tab-delimited file included values of MARC field 001, MARC field 1xx, an integer denoting the number of times I searched for a matching record, an integer denoting the number of records I found, an identifier denoting a Library of Congress authority record of choice, and a URL providing access to the remote authority record. This table was initialized using a script called authority2list.pl. Given a file of MARC records, it outputs the table.

I then systematically searched the Library of Congress for authority headings. This was done with a script called search.pl. Given the table created in the previous step, this script looped through each authority, did a rudimentary search for a valid entry, and output an updated version of the table. This script was a bit “tricky”.❦ It first searched the Library of Congress by looking for the value of MARC 1xx$a. If no records were found, then no updating was done and processing continued. If one record was found, then the Library of Congress identifier was saved to the output and processing continued. If many records were found, then a more limiting search was done by adding a date value extracted from MARC 1xx$d. Depending on the second search result, the output was updated (or not), and processing continued. Out of original 18,000 “invalid” records, about 50% of them were identified with no (zero) Library of Congress records, about 30% were associated with multiple headings, and the remaining 20% (approximately 3,600 records) were identified with one and only one Library of Congress authority record.

I now had a list of 3,600 “valid” authority records, and I needed to download them. This was done with a script called harvest.pl. This script is really a wrapper around a program called GNU Wget. Given my updated table, the script looped through each row, and if it contained a URL pointing to a Library of Congress authority record, then the record was cached to the file system. Since the downloaded records were formatted as MARCXML, I then needed to transform them into MARC21. This was done with a pair of scripts: xml2marc.sh and xml2marc.pl. The former simply looped through each file in a directory, and the later did the actual transformation but along the way updated MARC 001 to the value of the local authority record.

Verify and merge

In order to allow myself as well as others to verify that correct records had been identified, I wrote another pair of programs: marc2compare.pl and compare2html.pl. Given two MARC files, marc2compare.pl created a list of identifiers, original authority values, proposed authority values, and URLs pointing to full descriptions of each. This list was intended to be poured into a spreadsheet for compare & contrast purposes. The second script, compare2html.pl, simply took the output of the first and transformed it into a simple HTML page making it easier for a librarian to evaluate correctness.

Assuming the 3,600 records were correct, the next step was to merge/overlay the old records with the new records. This was a two-step process. The first step was accomplished with a script called rename.pl. Given two MARC files, rename.pl first looped through the set of new authorities saving each identifier to memory. It then looped through the original set of authorities looking for records to update. When records to update were found, each was marked for deletion by prefixing MARC 001 with “x-“. The second step employed MarcEdit to actually merge the set of new authorities with the original authorities. Consequently, the authority file increased in size by 3,600 records. It was then up to other people to load the authorities into Koka, re-evaluate the authorities for correctness, and if everything was okay, then delete each authority record prefixed with “x-“.

Done.❀

Summary and possible next steps

In summary, this is how things happened. I:

got a complete dump of original authority 123,329 records
extracted 17,593 “invalid” records
searched LOC for “valid” records and found 3,627 of them
harvested the found records
prefixed the 3,627 001 fields in the original file with “x-“
merged the original authority records with the harvested records
made the new set of 126,956 updated records available

There were many possible next steps. One possibility was to repeat the entire process but with an enhanced search algorithm. This could be difficult considering the fact that searches using merely the value of 1xx$a returned zero hits for half of the working data. A second possibility was to identify authoritative records from a different system such as VIAF or Worldcat. Even if this was successful, I wonder how possible it would have been to actually download authority records as MARC. A third possibility was to write a sort of disambiguation program allowing librarians to choose from a set of records. This could have been accomplished by searching for authorities, presenting possibilities, allowing librarians to make selections via an HTML form, caching the selections, and finally, batch updating the master authority list. Here at the Academy we denoted the last possibility as the “cool” one.

Now here’s an interesting way to look at the whole thing. This process took me about two weeks worth of work, and in that two weeks I processed 18,000 authority records. That comes out to 9,000 records/week. There are 40 hours in work week, and consequently, I processed 225 records/hour. Each hour is made up of 60 minutes, and therefore I processed approximately 4 records/minute, and that is 1 record every fifteen seconds for the last two weeks. Wow!?

Finally, I’d like to thank the Academy (with all puns intended). Sebastian, his colleagues, and especially my office mate (Kristine Iara) were all very supportive throughout my visit. They provided intellectual stimulation and something to do while I contemplated my navel during the “adventure”.

Notes

† Strictly speaking, my adventure was not a sabbatical nor a leave of absence because: 1) as a librarian I was not authorized to take a sabbatical, and 2) I did not have any healthcare issues. Instead, after bits of negotiation, my contract was temporarily changed from full-time faculty to adjunct faculty, and I worked for my employer 20% of the time. The other 80% of time was spent on my “adventure”. And please don’t get me wrong, this whole thing was a wonderful opportunity for which I will be eternally grateful. “Thank you!”

‡ During our overlapping times there in Rome, Terry & I played tourist which included the Colosseum, a happenstance mass at the Pantheon, a Palm Sunday Mass in St. Peter’s Square with tickets generously given to us by Joy Nelson of ByWater Solutions, and a day-trip to Florence. Along the way we discussed librarianship, open source software, academia, and life in general. A good time was had by all.

❧ Ironically, Sebastian & I were colleagues during the dot-com boom when we both worked at North Caroline State University. The world of librarianship is small.

❦ This script — search.pl — was really a wrapper around an application called curl, and thanks go to Jeff Young of OCLC who pointed me to the ATOM interface of the LC Linked Data Service. Without Jeff’s helpful advice, I would have wrestled with OCLC’s various authentication systems and Web Service interfaces.

❀ Actually, I skipped a step in this narrative. Specifically, there are some records in the authority file that were not expected to be touched, even if they are “invalid”. This set of records was associated with a specific call number pattern. Two scripts (fu-extract.pl and fu-remove.pl) did the work. The first extracted a list of identifiers not to touch and the second removed them from my table of candidates to validate.

Failure to communicate

Eric Lease Morgan — Tue, 22 Mar 2016 10:35:50 +0000

In my humble opinion, what we have here is a failure to communicate.

Libraries, especially larger libraries, are increasingly made up of many different departments, including but not limited to departments such as: cataloging, public services, collections, preservation, archives, and now-a-days departments of computer staff. From my point of view, these various departments fail to see the similarities between themselves, and instead focus on their differences. This focus on the differences is amplified by the use of dissimilar vocabularies and subdiscipline-specific jargon. This use of dissimilar vocabularies causes a communications gap and left unresolved ultimately creates animosity between groups. I believe this is especially true between the more traditional library departments and the computer staff. This communications gap is an impediment to when it comes to achieving the goals of librarianship, and any library — whether it be big or small — needs to address these issues lest it wastes both its time and money.

Here are a few examples outlining failures to communicate:

MARC – MARC is a data structure. The first 24 characters are called the leader. The second section is called the directory, and the third section is intended to contain bibliographic data. The whole thing is sprinkled with ASCII characters 29, 30, and 31 denoting the ends of fields, subfields, and the record itself. MARC does not denote the kinds of data it contains. Yet, many catalogers say they know MARC. Instead, what they really know are sets of rules defining what goes into the first and third sections of the data structure. These rules are known as AACR2/RDA. Computer staff see MARC (and MARCXML) as a data structure. Librarians see MARC as the description of an item akin to a catalog card.
Databases & indexes – Databases & indexes are two sides of the same information retrieval coin. “True” databases are usually relational in nature and normalized accordingly. “False” databases are flat files — simple tables akin to Excel spreadsheets. Librarians excel (no puns intended) at organizing information, and this usually manifests itself through the creation of various lists. Lists of books. Lists of journals. Lists of articles. Lists of authoritative names. Lists of websites. Etc. In today’s world, the most scalable way to maintain lists is through the use of a database, yet most librarians wouldn’t be able to draw an entity relationship diagram — the literal illustration of a database’s structure — to save their lives. With advances in computer technology, the problem of find is no longer solved through the searching of databases but instead through the creation of an index. In reality, modern indexes are nothing more than enhancements of traditional back-of-the-book indexes — lists of words and associated pointers to where those words can be found in a corpus. Computer staff see databases as MySQL and indexes as Solr. Librarians see databases as a matrix of rows & columns, and the searching of databases in a light of licensed content such as JSTOR, Academic Search Primer, or New York Times.
Collections – Collections, from the point of view of a librarian, are sets of curated items with a common theme. Taken as a whole, these collections embody a set of knowledge or a historical record intended for use by students & researchers for the purposes of learning & scholarship. The physical arrangment of the collection — especially in archives — as well as the intellectual arrangment of the collection is significant because they bring together like items or represent the development of an idea. This is why libraries have classification schemes and archives physically arrange their materials in the way they do. Unfortunately, computer staff usually do not understand the concept of “curation” and usually see the arrangements of books — classification numbers — as rather arbitrary.
Services – Many librarians see the library profession as being all about service. These services range from literacy programs to story hours. They range from the answering of reference questions to the circulation of books. They include social justice causes, stress relievers during exam times, and free access to computers with Internet connections. Services are important because the provide the means for an informed public, teaching & learning, and the improvement society in general. Many of these concepts are not in the forefront of the minds of computer staff. Instead, their idea of service is making sure the email system works, people can log into their computers, computer hardware & software are maintained, and making sure the connections to the Internet are continual.

As a whole, what the profession does not understand is that everybody working in a library has more things in common than differences. Everybody is (suppose to be) working towards the same set of goals. Everybody plays a part in achieving those goals, and it behooves everybody to learn & respect the roles of everybody else. A goal is to curate collections. This is done through physical, intellectual, and virtual arrangment, but it also requires the use of computer technology. Collection managers need to understand more of the computer technology, and the technologist needs to understand more about curation. The application of AACR2/RDA is an attempt to manifest inventory and the dissemination of knowledge. The use of databases & indexes also manifest inventory and dissemination of knowledge. Catalogers and database administrators ought to communicate on the similar levels. Similarly, there is much more to preservation of materials than putting bits on tape. “Yikes!”

What is the solution to these problems? In my opinion, there are many possibilities, but the solution ultimately rests with individuals willing to take the time to learn from their co-workers. It rests in the ability to respect — not merely tolerate — another point of view. It requires time, listening, discussion, reflection, and repetition. It requires getting to know other people on a personal level. It requires learning what others like and dislike. It requires comparing & contrasting points of view. It demands “walking a mile in the other person’s shoes”, and can be accomplished by things such as the physical intermingling of departments, cross-training, and simply by going to coffee on a regular basis.

Again, all of us working in libraries have more similarities than differences. Learn to appreciate the similarities, and the differences will become insignificant. The consequence will be a more holistic set of library collections and services.

Using BIBFRAME for bibliographic description

Eric Lease Morgan — Sun, 06 Mar 2016 20:21:44 +0000

Bibliographic description is an essential process of librarianship. In the distant past this process took the form of simple inventories. In the last century we saw bibliographic description evolve from the catalog card to the MARC record. With the advent of globally networked computers and the hypertext transfer protocol, we are seeing the emergence of a new form of description called BIBFRAME which is based on the principles of RDF (Resource Description Framework). This essay describes, illustrates, and demonstrates how BIBFRAME can be used to fulfill the promise and purpose of bibliographic description.†

Librarianship as collections & services

Libraries are about a number of things. Some of those things surround the collection and preservation of materials, most commonly books. Some of those things surround services, most commonly the lending of books.†† But it is asserted here that collections are not really about books nor any other physical medium because those things are merely the manifestation of the real things of libraries: data, information, and knowledge. It is left to another essay as to the degree libraries are about wisdom. Similarly, the primary services of libraries are not really about the lending of materials, but instead the services surround learning and intellectual growth. Librarians cannot say they have lent somebody a book and conclude they have done their job. No, more generally, libraries provide services enabling the reader to use & understand the content of acquired materials. In short, it is asserted that libraries are about the collection, organization, preservation, dissemination, and sometimes evaluation of data, information, knowledge, and sometimes wisdom.

With the advent of the Internet the above definition of librarianship is even more plausible since the materials of libraries can now be digitized, duplicated (almost) exactly, and distributed without diminishing access to the whole. There is no need to limit the collection to physical items, provide access to the materials through surrogates, nor lend the materials. Because these limitations have been (mostly) removed, it is necessary for libraries to think differently their collections and services. To the author’s mind, librarianship has not shifted fast enough nor far enough. As a long standing and venerable profession, and as an institution complete with its own set of governance, diversity, and shear size, change & evolution happen very slowly. The evolution of bibliographic description is a perfect example.

Bibliographic description: an informal history

Bibliographic description happens in between the collections and services of libraries, and the nature of bibliographic description has evolved with technology. Think of the oldest libraries. Think clay tablets and papyrus scrolls. Think of the size of library collections. If a library’s collection was larger than a few hundred items, then the library was considered large. Still, the collections were so small that an inventory was relatively easy for sets of people (librarians) to keep in mind.

Think medieval scriptoriums and the development of the codex. Consider the time, skill, and labor required to duplicate an item from the collection. Consequently, books were very expensive but now had a much longer shelf life. (All puns are intended.) This increased the size of collections, but remembering everything in a collection was becoming more and more difficult. This, coupled with the desire to share the inventory with the outside world, created the demand for written inventories. Initially, these inventories were merely accession lists — a list of things owned by a library and organized by the date they were acquired.

With the advent of the printing press, even more books were available but at a much lower cost. Thus, the size of library collections grew. As it grew it became necessary to organize materials not necessarily by their acquisition date nor physical characteristics but rather by various intellectual qualities — their subject matter and usefulness. This required the librarian to literally articulate and manifest things of quality, and thus the profession begins to formalize the process of analytics as well as supplement their inventory lists with this new (which is not really new) information.

Consider some of the things beginning in the 18th and 19th centuries: the idea of the “commons”, the idea of the informed public, the idea of the “free” library, and the size of library collections numbering 10’s of thousands of books. These things eventually paved the way in the 20th century to open stacks and the card catalog — the most recent incarnation of the inventory list written in its own library short-hand and complete with its ever-evolving controlled vocabulary and authority lists — becoming available to the general public. Computers eventually happen and so does the MARC record. Thus, the process of bibliographic description (cataloging) literally becomes codified. The result is library jargon solidified in an obscure data structure. Moreover, in an attempt to make the surrogates of library collections more meaningful, the information of bibliographic description bloats to fill much more than the traditional three to five catalog cards of the past. With the advent of the Internet comes less of a need for centralized authorities. Self-service and connivence become the norm. When was the last time you used a travel agent to book airfare or reserve a hotel room?

Librarianship is now suffering from a great amount of reader dissatisfaction. True, most people believe libraries are “good things”, but most people also find libraries difficult to use and not meeting their expectations. People search the Internet (Google) for items of interest, and then use library catalogs to search for known items. There is then a strong desire to actually get the item, if it is found. After all, “Everything in on the ‘Net”. Right? To this author’s mind, the solution is two-fold: 1) digitize everthing and put the result on the Web, and 2) employ a newer type of bibliographic description, namely RDF. The former is something for another time. The later is elaborated upon below.

Resource Description Framework

Resource Description Framework (RDF) is essentially relational database technology for the Internet. It is comprised of three parts: keys, relationships, and values. In the case of RDF and akin to relational databases, keys are unique identifiers and usually in the form of URIs (now called “IRIs” — Internationalized Resource Identifiers — but think “URL”). Relationships take the form of ontologies or vocabularies used to describe things. These ontologies are very loosely analogous to the fields in a relational database table, and there are ontologies for many different sets of things, including the things of a library. Finally, the values of RDF can also be URIs but are ultimately distilled down to textual and numeric information.

RDF is a conceptual model — a sort of cosmology for the universe of knowledge. RDF is made real through the use of “triples”, a simple “sentence” with three distinct parts: 1) a subject, 2) a predicate, and 3) an object. Each of these three parts correspond to the keys, relationships, and values outlined above. To extend the analogy of the sentence further, think of subjects and objects as if they were nouns, and think of predicates as if they were verbs. And here is a very important distinction between RDF and relational databases. In relational databases there is the idea of a “record” where an identifier is associated with a set of values. Think of a book that is denoted by a key, and the key points to a set of values for titles, authors, publishers, dates, notes, subjects, and added entries. In RDF there is no such thing as the record. Instead there are only sets of literally interlinked assertions — the triples.

Triples (sometimes called “statements”) are often illustrated as arced graphs where subjects and objects are nodes and predicates are lines connecting the nodes:

[ subject ] --- predicate ---> [ object ]

The “linking” in RDF statements happens when sets of triples share common URIs. By doing so, the subjects of statements end up having many characteristics, and the objects of URIs point to other subjects in other RDF statements. This linking process transforms independent sets of RDF statements into a literal web of interconnections, and this is where the Semantic Web gets its name. For example, below is a simple web of interconnecting triples:

              / --- a predicate ---------> [ an object ]
[ subject ] - | --- another predicate ---> [ another object ]
              \ --- a third predicate ---> [ a third object ]
                                                   |
                                                   |
                                          yet another predicate
                                                   |
                                                   |
                                                  \ /

                                         [ yet another object ]

An example is in order. Suppose there is a thing called Rome, and it will be represented with the following URI: http://example.org/rome. We can now begin to describe Rome using triples:

subjects                 predicates         objects
-----------------------  -----------------  -------------------------
http://example.org/rome  has name           "Rome"
http://example.org/rome  has founding date  "1000 BC"
http://example.org/rome  has description    "A long long time ago,..."
http://example.org/rome  is a type of       http://example.org/city
http://example.org/rome  is a sub-part of   http://example.org/italy

The corresponding arced graph would look like this:

                               / --- has name ------------> [ "Rome" ]
                              |  --- has description -----> [ "A long time ago..." ]
[ http://example.org/rome ] - |  --- has founding date ---> [ "1000 BC" ]
                              |  --- is a sub-part of  ---> [ http://example.org/italy ]
                               \ --- is a type of --------> [ http://example.org/city ]

In turn, the URI http://example.org/italy might have a number of relationships asserted against it also:

subjects                  predicates         objects
------------------------  -----------------  -------------------------
http://example.org/italy  has name           "Italy"
http://example.org/italy  has founding date  "1923 AD"
http://example.org/italy  is a type of       http://example.org/country
http://example.org/italy  is a sub-part of   http://example.org/europe

Now suppose there were things called Paris, London, and New York. They can be represented in RDF as well:

subjects                    predicates          objects
--------------------------  -----------------   -------------------------
http://example.org/paris    has name            "Paris"
http://example.org/paris    has founding date   "100 BC"
http://example.org/paris    has description     "You see, there's this tower..."
http://example.org/paris    is a type of        http://example.org/city
http://example.org/paris    is a sub-part of    http://example.org/france
http://example.org/london   has name            "London"
http://example.org/london   has description     "They drink warm beer here."
http://example.org/london   has founding date   "100 BC"
http://example.org/london   is a type of        http://example.org/city
http://example.org/london   is a sub-part of    http://example.org/england
http://example.org/newyork  has founding date   "1640 AD"
http://example.org/newyork  has name            "New York"
http://example.org/newyork  has description     "It is a place that never sleeps."
http://example.org/newyork  is a type of        http://example.org/city
http://example.org/newyork  is a sub-part of    http://example.org/unitedstates

Furthermore, each of “countries” can be have relationships denoted against them:

subjects                         predicates         objects
-------------------------------  -----------------  -------------------------
http://example.org/unitedstates  has name           "United States"
http://example.org/unitedstates  has founding date  "1776 AD"
http://example.org/unitedstates  is a type of       http://example.org/country
http://example.org/unitedstates  is a sub-part of   http://example.org/northamerica
http://example.org/england       has name           "England"
http://example.org/england       has founding date  "1066 AD"
http://example.org/england       is a type of       http://example.org/country
http://example.org/england       is a sub-part of   http://example.org/europe
http://example.org/france        has name           "France"
http://example.org/france        has founding date  "900 AD"
http://example.org/france        is a type of       http://example.org/country
http://example.org/france        is a sub-part of   http://example.org/europe

The resulting arced graph of all these triples might look like this:

[IMAGINE A COOL LOOKING ARCED GRAPH HERE.]

From this graph, new information can be inferred as long as one is able to trace connections from one node to another node through one or more arcs. For example, using the arced graph above, questions such as the following can be asked and answered:

What things are denoted as types of cities, and what are their names?
What is the oldest city?
What cities were founded after the year 1 AD?
What countries are sub-parts of Europe?
How would you describe Rome?

In summary, RDF is data model — a method for organizing discrete facts into a coherent information system, and to this author, this sounds a whole lot like a generalized form of bibliographic description and a purpose of library catalogs. The model is built on the idea of triples whose parts are URIs or literals. Through the liberal reuse of URIs in and between sets of triples, questions surrounding the information can be answered and new information can be inferred. RDF is the what of the Semantic Web. Everything else (ontologies & vocabularies, URIs, RDF “serializations” like RDF/XML, triple stores, SPARQL, etc.) are the how’s. None of them will make any sense unless the reader understands that RDF is about establishing relationships between data for the purposes of sharing information and increasing the “sphere of knowledge”.

Linked data

Linked data is RDF manifested. It is a process of codifying triples and systematically making them available on the Web. It first involves selecting, creating (“minting”), and maintaining sets of URIs denoting the things to be described. When it comes to libraries, there are many places where authoritative URIs can be gotten including: OCLC’s Worldcat, the Library of Congress’s linked data services, Wikipedia, institutional repositories, or even licensed indexes/databases.

Second, manifesting RDF as linked data involves selecting, creating, and maintaining one or more ontologies used to posit relationships. Like URIs, there are many existing bibliographic ontologies for the many different types of cultural heritage institutions: libraries, archives, and museums. Example ontologies include but are by no means limited to: BIBFRAME, bib.schema.org, the work of the (aged) LOCAH project, EAC-CPF, and CIDOC CRM.

The third step to implementing RDF as linked data is to actually create and maintain sets of triples. This is usually done through the use of a “triple store” which is akin to a relational database. But remember, there is no such thing as a record when it comes to RDF! There are a number of not a huge number of toolkits and applications implementing triple stores. 4store is (or was) a popular open source triple store implementation. Virtuoso is another popular implementation that comes in both open sources as well as commercial versions.

The forth step in the linked data process is the publishing (making freely available on the Web) of RDF. This is done in a combination of two ways. The first is to write a report against the triple store resulting in a set of “serializations” saved at the other end of a URL. Serializations are textual manifestations of RDF triples. In the “old days”, the serialization of one or more triples was manifested as XML, and might have looked something like this to describe the Declaration of Independence and using the Dublin Core and FOAF (Friend of a friend) ontologies:




  
	
	  male

Many people think the XML serialization is too verbose and thus difficult to read. Consequently other serializations have been invented. Here is the same small set of triples serialized as N-Triples:

@prefix foaf: .
@prefix rdf: .
@prefix dcterms: .
 dcterms:creator .
 a foaf:Person;
  foaf:gender "male".

Here is yet another example, but this time serialized as JSON, a data structure first implemented as a part of the Javascript language:

{
"http://en.wikipedia.org/wiki/Declaration_of_Independence": {
  "http://purl.org/dc/terms/creator": [
	{
	  "type": "uri", 
	  "value": "http://id.loc.gov/authorities/names/n79089957"
	}
  ]
}, 
 "http://id.loc.gov/authorities/names/n79089957": {
   "http://xmlns.com/foaf/0.1/gender": [
	 {
	   "type": "literal", 
	   "value": "male"
	 }
   ], 
   "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
	 {
	   "type": "uri", 
	   "value": "http://xmlns.com/foaf/0.1/Person"
	 }
   ]
 }
}

RDF has even been serialized in HTML files by embedding triples into attributes. This is called RDFa, and a snippet of RDFa might look like this:

  
    
      
    
  

Once the RDF is serialized and put on the Web, it is intended to be harvested by Internet spiders and robots. They cache the data locally, read it, and update their local triples stores. This data is then intended to be analyzed, indexed, and used to find or discover new relationships or knowledge.

The second way of publishing linked data is through a “SPARQL endpoint”. SPARQL is a query language very similar to the query language of relational databases (SQL). SPARQL endpoints are usually Web-accesible interfaces allowing the reader to search the underlying triple store. The result is usually a stream of XML. Admitted, SPARQL is obtuse at the very least.

Just like the published RDF, the output of SPARQL queries can be serialized in many different forms. And just like relational databases, triple stores and SPARQL queries are not intended to be used directly by the reader. Instead, something more friendly (but ultimately less powerful and less flexible) is always intended.

So what does this have to do with libraries and specifically bibliographic description? The answer is not that complicated. The what of librarianship has not really changed over the millenium. Librarianship is still about processes of collection, organization, preservation, dissemination, and sometimes evaluation. On the other hand, with the evolution of technology and cultural expectations, the how’s of librarianship have changed dramatically. Considering the current environment, it is time to evolve, yet again. The next evolution is the employment of RDF and linked data as the means of bibliographic description. By doing so the data, information, and knowledge contained in libraries will be more accessible and more useful to the wider community. As time has gone on, the data and metadata of libraries has become less and less librarian-centric. By taking the leap to RDF and linked data, this will only become more true, and this is a good thing for both libraries and the people they serve.

BIBFRAME

Enter BIBFRAME, an ontology designed for libraries and their collections. It is not the only ontology intended to describe libraries and their collections. There are other examples as well, notably, bib.schema.org, FRBR for RDF, MODS and MADS for RDF, and to some extent, Dublin Core. Debates rage on mailing lists regarding the inherent advantages & disadvantages of each of these ontologies. For the most part, the debates seem to be between BIBFRAME, bib.schema.org, and FRBR for RDF. BIBFRAME is sponsored by the Library of Congress and supported by a company called Zepheira. At its very core are the ideas of a work and its instance. In other words, BIBFRAME boils the things of libraries down to two entities. Bib.schema.org is a subset of schema.org, an ontology endorsed by the major Internet search engines (Google, Bing, and Yahoo). And since schema.org is designed to enable the description of just about anything, the implementation of bib.schema.org is seen as a means of reaching the widest possible audience. On the other hand, bib.schema.org is not always seen as being as complete as BIBFRAME. The third contender is FRBR for RDF. Personally, the author has not seen very many examples of its use, but it purports to better serve the needs/desires of the reader through the concepts of WEMI (Work, Expression, Manifestation, and Item).

That said, it is in this author’s opinion, that the difference between the various ontologies is akin to debating the differences between vanilla and chocolate ice cream. It is a matter of opinion, and the flavors are not what is important, but rather it is the ice cream itself. Few people outside libraries really care which ontology is used. Besides, each ontology includes predicates for the things everybody expects: titles, authors, publishers, dates, notes, subjects/keywords, added entries, and locations. Moreover, in this time of transition, it is not feasible to come up with the perfect solution. Instead, this evolution is an iterative process. Give something a go. Try it for a limited period of time. Evaluate. And repeat. We also live in a world of digital data and information. This data and information is, by its very nature, mutable. There is no reason why one ontology over another needs to be debated ad nauseum. Databases (triple stores) support the function of find/replace with ease. If one ontology does not seem to be meeting the desired needs, then (simply) change to another one.††† In short, BIBFRAME may not be the “best” ontology, but right now, it is good enough.

Workflow

Now that the fundamentals have been outlined and elaborated upon, a workflow can be articulated. At the risk of mixing too many metaphors, here is a “recipe” for doing bibliographic description using BIBFRAME (or just about any other bibliographic ontology):

Answer the questions, “What is bibliographic description, and how does it help facilitate the goals of librarianship?”
Understand the concepts of RDF and linked data.
Embrace & understand the strengths & weaknesses of BIBFRAME as a model for bibliographic description.
Design or identify and then install a system for creating, storing, and editing your bibliographic data. This will be some sort of database application whether it be based on SQL, non-SQL, XML, or a triple store. It might even be your existing integrated library system.
Using the database system, create, store, import/edit your bibliographic descriptions. For example, you might simply use your existing integrated library for these purposes, or you might transform your MARC data into BIBFRAME and pour the result into a triple store, like this:
1. Dump MARC records
2. Transform MARC into BIBFRAME
3. Pour the result into a triple-store
4. Sort the triples according to the frequency of literal values
5. Find/replace the most frequently found literals with URIs††††
6. Go to Step #D until tired
7. Use the triple-store to create & maintain ongoing bibliographic description
8. Go to Step #D
Expose your bibliographic description as linked data by writing a report against the database system. This might be as simple as configuring your triple store, or as complicated as converting MARC/AACR2 from your integrated library system to BIBFRAME.
Facilitate the discovery process, ideally through the use of linked data publishing and SPARQL, or directly against the integrated library system.
Go to Step #5 on a daily basis.
Go to Step #1 on an annual basis.

If the profession continues to use its existing integrated library systems for maintaining bibliographic data (Step #4), then the hard problem to solve is transforming and exposing the bibliographic data as linked data in the form of the given ontology. If the profession designs a storage and maintenance system rooted in the given ontology to begin with, then the problem is accurately converting existing data into the ontology and then designing mechanisms for creating/editing the data. The later option may be “better”, but the former option seems less painful and requires less retooling. This author advocates the “better” solution.

After a while, such a system may enable a library to meet the expressed needs/desires of its constituents, but it may present the library with a different set of problems. On one hand, the use of RDF as the root of a discovery system almost literally facilitates a “Web of knowledge”. But on the other hand, to what degree can it be used to do (more mundane) tasks such as circulation and acquisitions? One of the original purposes of bibliographic description was to create a catalog — an inventory list. Acquisitions adds to the list, and circulation modifies the list. To what degree can the triple store be used to facilitate these functions? If the answer is “none”, then there will need to be some sort of outside application interfacing with the triple store. If the answer is “a lot”, then the triple store will need to include an ontology to facilitate acquisitions and circulation.

Prototypical implementation

In the spirit of putting the money where the mouth is, the author has created the most prototypical and toy implementations possible. It is merely a triple store filled with a tiny set of automatically transformed MARC records and made publicly accessible via SPARQL. The triple store was built using a set of Perl modules called Redland. The system supports initialization of a triple store, the adding of items to the store via files saved on a local file system, rudimentary command-line search, a way to dump the contents of the triple store in the form of RDF/XML, and a SPARQL endpoint. [1] Thus, Step #4 from the recipe above has been satisfied.

To facilitate Step #5 a MARC to BIBFRAME transformation tool was employed [2]. The transformed MARC data was very small, and the resulting serialized RDF was valid. [3, 4] The RDF was imported into the triple store and resulted in the storage of 5,382 triples. Remember, there is no such thing as a record in the world of RDF! Using the SPARQL endpoint, it is now possible to query the triple store. [5] For example, the entire store can be dumped with this (dangerous) query:

# dump of everything
SELECT ?s ?p ?o 
WHERE { ?s ?p ?o }

To see what types of things are described one can list only the objects (classes) of the store:

# only the objects
SELECT DISTINCT ?o
WHERE { ?s a ?o }
ORDER BY ?o

To get a list of all the store’s properties (types of relationships), this query is in order:

# only the predicates
SELECT DISTINCT ?p
WHERE { ?s ?p ?o }
ORDER BY ?p

BIBFRAME denotes the existence of “Works”, and to get a list of all the works in the store, the following query can be executed:

# a list of all BIBFRAME Works
SELECT ?s 
WHERE { ?s a  }
ORDER BY ?s

This query will enumerate and tabulate all of the topics in the triple store. Thus providing the reader with an overview of the breadth and depth of the collection in terms of subjects. The output is ordered by frequency:

# a breadth and depth of subject analsysis
SELECT ( COUNT( ?l ) AS ?c ) ?l
WHERE {
  ?s a  . 
  ?s  ?l
}
GROUP BY ?l
ORDER BY DESC( ?c )

All of the information about a specific topic in this particular triple store can be listed in this manner:

# about a specific topic
SELECT ?p ?o 
WHERE {  ?p ?o }

The following query will create the simplest of title catalogs:

# simple title catalog
SELECT ?t ?w ?c ?l ?a
WHERE {
  ?w a            .
  ?w     ?wt .
  ?wt   ?t  .
  ?w       ?ci .
  ?ci        ?c  .
  ?w       ?s  .
  ?s         ?l  .
  ?s  ?a
}
ORDER BY ?t

The following query is akin to a phrase search. It looks for all the triples (not records) containing a specific key word (catholic):

# phrase search
SELECT ?s ?p ?o
WHERE {
  ?s ?p ?o
  FILTER REGEX ( ?o, 'catholic', 'i' )
}
ORDER BY ?p

Automatically transformed MARC data into BIBFRAME RDF will contain a preponderance of literal values when URIs are really desired. The following query will find all of the literals and sort them by the number of their individual occurrences:

# find all literals
SELECT ?p ?o ( COUNT ( ?o ) as ?c )
WHERE { ?s ?p ?o FILTER ( isLiteral ( ?o ) ) }
GROUP BY ?o 
ORDER BY DESC( ?c )

It behooves the cataloger to identify URIs for these literal values and replace the literals (or supplement) the triples accordingly (Step #5E in the recipe, above). This can be accomplished both programmatically as well as manually by first creating a list of appropriate URIs and then executing a set of INSERT or UPDATE commands against the triple store.

“Blank nodes” (URIs that point to nothing) are just about as bad as literal values. The following query will list all of the blank nodes in a triple store:

# find all blank nodes
SELECT ?s ?p ?o WHERE { ?s ?p ?o FILTER ( isBlank( ?s ) ) }

And the data associated with a particular blank node can be queried in this way:

# learn about a specific blank node
SELECT distinct ?p WHERE { _:r1456957120r7483r1 ?p ?o } ORDER BY ?p

In the case of blank nodes, the cataloger will then want to “mint” new URIs and perform an additional set of INSERT or UPDATE operations against the underlying triple store. This is a continuation of Step #5E.

These SPARQL queries applied against this prototypical implementation have tried to illustrate how RDF can fulfill the needs and requirements of bibliographic description. One can now begin to see how an RDF triple store employing a bibliographic ontology can be used to fulfill some of the fundamental goals of a library catalog.

Summary

This essay defined librarianship as a set of interlocking collections and services. Bibliographic description was outlined in an historical context, with the point being that the process of bibliographic description has evolved with technology and cultural expectations. The principles of RDF and linked data were then described, and the inherent advantages & disadvantages of leading bibliographic RDF ontologies were touched upon. The essay then asserted the need for faster evolution regarding bibliographic description and advocated the use of RDF and BIBFRAME for this purpose. Finally, the essay tried to demonstrate how RDF and BIBFRAME can be used to satisfy the functionality of the library catalog. It did this through the use of a triple store and a SPARQL endpoint. In the end, it is hoped the reader understands that there is no be-all end-all solution for bibliographic description, but the use of RDF technology is the wave of the future, and BIBFRAME is good enough when it comes to the ontology. Moving to the use of RDF for bibliographic description will be painful for the profession, but not moving to RDF will be detrimental.

Notes

† This presentation ought to be also be available as a one-page handout in the form of a PDF document.

†† Moreover, collections and services go hand-in-hand because collections without services are useless, and services without collections are empty. As a buddhist monk once said, “Collections without services is the sound of one hand clapping.” Librarianship requires a healthy balance of both.

††† That said, no matter what a person does, things always get lost in translation. This is true of human language just as much as it is true for the language (data/information) of computers. Yes, data & information will get lost when moving from one data model to another, but still I contend the fundamental and most useful elements will remain.

†††† This process (Step #5E) was coined by Roy Tennant and his colleagues at OCLC as “entification”.

Links

[1] toy implementation – http://infomotions.com/sandbox/bibframe/
[2] MARC to BIBFRAME – http://bibframe.org/tools/transform/start
[3] sample MARC data – http://infomotions.com/sandbox/bibframe/data/data.xml
[4] sample RDF data – http://infomotions.com/sandbox/bibframe/data/data.rdf
[5] SPARQL endpoint – http://infomotions.com/sandbox/bibframe/sparql/

XML 101

Eric Lease Morgan — Wed, 06 Jan 2016 18:05:55 +0000

This past Fall I taught “XML 101” online and to library school graduate students. This posting echoes the scripts of my video introductions, and I suppose this posting could also be used as very gentle introduction to XML for librarians.

Introduction

I work at the University of Notre Dame, and my title is Digital Initiatives Librarian. I have been a librarian since 1987. I have been writing software since 1976, and I will be your instructor. Using materials and assignments created by the previous instructors, my goal is to facilitate your learning of XML.

XML is a way of transforming data into information. It is a method for marking up numbers and text, giving them context, and therefore a bit of meaning. XML includes syntactical characteristics as well as semantic characteristics. The syntactical characteristics are really rather simple. There are only five or six rules for creating well-formed XML, such as: 1) there must be one and only one root element, 2) element names are case-sensitive, 3) elements must be close properly, 4) elements must be nested properly, 4) attributes must be quoted, and 5) there are a few special characters (&, <, and >) which must be escaped if they are to be used in their literal contexts. The semantics of XML is much more complicated and they denote the intended meaning of the XML elements and attributes. The semantics of XML are embodied in things called DTDs and schemas.

Again, XML is used to transform data into information. It is used to give data context, but XML is also used to transmit this information in an computer-independent way from one place to another. XML is also a data structure in the same way MARC, JSON, SQL, and tab-delimited files are data structures. Once information is encapsulated as XML, it can unambiguously transmitted from one computer to another where it can be put to use.

This course will elaborate upon these ideas. You will learn about the syntax and semantics of XML in general. You will then learn how to manipulate XML using XML-related technologies called XPath and XSLT. Finally, you will learn library-specific XML “languages” to learn how XML can be used in Library Land.

Well-formedness

In this, the second week of “XML 101 for librarians”, you will learn about well-formed XML and valid XML. Well-formed XML is XML that conforms to the five or six syntactical rules. (XML must have one and only one root element. Element names are case sensitive. Elements must be closed. Elements must be nested correctly. Attributes must be quoted. And there are a few special characters that must be escaped (namely &, <, and >). Valid XML is XML that is not only well-formed but also conforms to a named DTD or schema. Think of valid XML as semantically correct.

Jennifer Weintraub and Lisa McAulay, the previous instructors of this class, provide more than a few demonstrations of how to create well-formed as well as valid XML. Oxygen, the selected XML editor for this course is both powerful and full-featured, but using it efficiently requires practice. That’s what the assignments are all about. The readings supplement the demonstrations.

DTD’s and namespaces

DTD’s, schemas, and namespaces put the “X” in XML. They make XML extensible. They allow you to define your own elements and attributes to create your own “language”.

DTD’s — document type declarations — and schemas are the semantics of XML. They define what elements exists, what order they appear in, what attributes they can contain, and just as importantly what the elements are intended to contain. DTD’s are older than schemas and not as robust. Schemas are XML documents themselves and go beyond DTD’s in that they provide the ability to define the types of data elements and attributes contain.

Namespaces allow you, the author, to incorporate multiple DTD and schema definitions into a single XML document. Namespaces provide a way for multiple elements of the same name to exist concurrently in a document. For example, two different DTD’s may contain an element called “title”, but one DTD refers to a title as in the title of a book, and the other refers to “title” as if it were an honorific.

Schemas

Schemas are an alternative and more intelligent alternative to DTDs. While DTDs define the structure of XML documents, schemas do it with more exactness. While DTDs only allow you to define elements, the number of elements, the order of elements, attributes, and entities, schemas allow you to do these things and much more. For example, they allow you to define the types of content that go into elements or attributes. Strings (characters). Numbers. Lists of characters or numbers. Boolean (true/false) values. Dates. Times. Etc. Schemas are XML documents in an of themselves, and therefore they can be validated just like any other XML document with a pre-defined structure.

The reading and writing of XML schemas is very librarian-ish because it is about turning data into information. It is about structuring data so it makes sense, and it does this in an unambiguous and computer-independent fashion. It is too bad our MARC (bibliographic) standards are not as rigorous.

RelaxNG, Schematron, and digital libraries

The first is yet another technology for modeling your XML, and it is called RelaxNG. This third modeling technology is intended to be more human readable than schemas and more robust that DTDs. Frankly, I have not seen RelaxNG implements very many times, but it behooves you to know it exists and how it compares to other modeling tools.

The second is Schematron. This tool too is used to validate XML, but instead of returning “ugly” computer-looking error messages, its errors are intended to be more human-readable and describe why things are the way they are instead of just saying “Wrong!”

Lastly, there is an introduction to digital libraries and trends in their current development. More and more, digital libraries are really and truly implementing the principles of traditional librarianship complete with collection, organization, preservation, and dissemination. At the same time, they are pushing the boundaries of the technology and stretching our definitions. Remember, it is not so much about the technology (the how of librarianship) that is important, but rather the why of libraries and librarianship. The how changes quickly. The why changes slowly, albiet sometimes too slowly.

XPath

This week is all about XPath, and it is used to select content from your XML files. It is akin to navigating a computer’s filesystem from the command line in order to learn what is located in different directories.

XPath is made up of expressions which return values of true, false, strings (characters), numbers, or nodes (subsets of XML files). XPath is used in conjunction with other XML technologies, most notably XSTL and XQuery. XSLT is used to transform XML files into other plain text files. XQuery is akin to the structured query language of relational databases.

You will not be able to do very much with XML other than read or write it, unless you understand XPath. An understanding XPath is essencial if you want to do truly interesting things with XML.

XSLT

This week you will be introduced to XSLT, a programming language used to transform XML into other plain text files.

XML is all about information, and it is not about use nor display. In order for XML to be actually useful — to be applied towards some sort of end — specific pieces of data need to be extracted from XML or the whole of the XML file needs to be converted into something else. The most common conversion (or “transformation”) is from some sort of XML into HTML for display in a Web browser. For example, bibliographic XML (MARCXML or MODS) may be transformed into a sort of “catalog card” for display, or a TEI file may be transformed into a set of Web pages, or an EAD file may be transformed into a guide intended for printing. Alternatively, you may want to tranform the bibliographic data into a tab-delimited text file for a spreadsheet or an SQL file for a relational database. Along with other sets of information, an XML file may contain geographic coordinates, and you may want to extract just those coordinates to create a KML file — a sort of map file.

XSLT is a programming language but not like most programming languages you may know. Most programming languages are “procedural” (like Perl, PHP, or Python), meaning they execute their commands in a step-wise manner. “First do this, then do that, then do the other thing.” This can be contrasted with “declarative” programming languages where events occur or are encountered in a data file, and then some sort of execution happens. There are relatively few declarative programming languages, but LISP is/was one of them. Because of the declarative nature of XSLT, the apply-templates command is so important. The apply-templates command sort of tells the XSLT processor to go off and find more events.

Now that you are beginning to learn XSLT and combining it with XPath, you are beginning to do useful things with the XML you have been creating. This is where the real power is. This is where it gets really interesting.

TEI — Text Encoding Initiative

TEI is a granddaddy, when it comes to XML “languages”. It started out as a different from of mark-up, a mark-up called SGML, and SGML was originally a mark-up language designed at IBM for the purposes of creating, maintaining, and distributing internal documentation. Now-a-days, TEI is all but a hallmark of XML.

TEI is a mark-up language for any type of literature: poetry or prose. Like HTML, it is made up of head and body sections. The head is the place for administrative, bibliographic, and provenance metadata. The body is where the poetry or prose is placed, and there are elements for just about anything you can imagine: paragraphs, lines, headings, lists, figures, marginalia, comments, page breaks, etc. And if there is something you want to mark-up, but an element does not explicitly exist for it, then you can almost make up your own element/attribute combination to suit your needs.

TEI is quite easily the most well-documented XML vocabulary I’ve ever seen. The community is strong, sustainable, albiet small (if not tiny). The majority of the community is academic and very scholarly. Next to a few types of bibliographic XML (MARCXML, MODS, OAIDC, etc.), TEI is probably the most commonly used XML vocabulary in Library Land, with EAD being a close second. In libraries, TEI is mostly used for the purpose of marking-up transcriptions of various kinds: letters, runs of out-of-print newsletters, or parts of a library special collection. I know of no academic journals marked-up in TEI, no library manuals, nor any catalogs designed for printing and distribution.

TEI, more than any other type of XML designed for literature, is designed to support the computed critical analysis of text. But marking something up in TEI in a way that supports such analysis is extraordinarily expensive in terms of both time and expertise. Consequently, based on my experience, there are relatively very few such projects, but they do exist.

XSL-FO

As alluded to throughout this particular module, XSL-FO is not easy, but despite this fact, I sincerely believe it is under-utilized tool.

FO stands for “Formatting Objects”, and it in an of itself is an XML vocabulary used to define page layout. It has elements defining the size of a printed page, margins, running headers & footers, fonts, font sizes, font styles, indenting, pagination, tables of contents, back-of-the-book indexes, etc. Almost all of these elements and their attributes use a syntax similar to the syntax of HTML’s cascading stylesheets.

Once an XML file is converted into an FO document, you are expected to feed the FO document to a FO processor, and the FO processor will convert the document into something intended for printing — usually a PDF document.

FO is important because not everything is designed nor intended to be digital. Digital everything is mis-nomer. The graphic design of a printed medium is different from the graphic design of computer screens or smart phones. In my opinion, important XML files ought to be transformed into different formats for different mediums. Sometimes those mediums are screen oriented. Sometimes it is better to print something, and printed somethings last a whole lot longer. Sometimes it is important to do both.

FO is another good example of what XML is all about. XML is about data and information, not necessarily presentation. XSL transforms data/information into other things — things usually intended for reading by people.

EAD — Encoded Archival Description

Encoded Archival Description (or EAD) is the type of XML file used to enumerate, evaluate, and make accessible the contents of archival collections. Archival collections are often the raw and primary materials of new humanities scholarship. They are usually “the papers” of individuals or communities. They may consist of all sorts of things from letters, photographs, manuscripts, meeting notes, financial reports, audio cassette tapes, and now-a-days computers, hard drives, or CDs/DVDs. One thing, which is very important to understand, is that these things are “collections” and not intended to be used as individual items. MARC records are usually used as a data structure for bibliographically describing individual items — books. EAD files describe an entire set of items, and these descriptions are more colloquially called “finding aids”. They are intended to be read as intellectual works, and the finding aids transform collections into coherent wholes.

Like TEI files, EAD files are comprised of two sections: 1) a header and 2) a body. The header contains a whole lot or very little metadata of various types: bibliographic, administrative, provenance, etc. Some of this metadata is in the form of lists, and some of it is in the form of narratives. More than TEI files, EAD files are intended to be displayed on a computer screen or printed on paper. This is why you will find many XSL files transforming EAD into either HTML or FO (and then to PDF).

RDF

RDF is an acronym for Resource Description Framework. It is a data model intended to describe just about anything. The data model is based on an idea called triples, and as the name implies, the triples have three parts: 1) subjects, 2) predicates, and 3) objects.

Subjects are always URIs (think URLs), and they are the things described. Objects can be URIs or literals (words, phrases, or numbers), and objects are the descriptions. Predicates are also always URIs, and they denote the relationship between the subjects and the objects.

The idea behind RDF was this. Describe anything and everthing in RDF. Resuse as many of the URIs used by other people as possible. Put the RDF on the Web. Allow Internet robots/spiders to harvest and cache the RDF. Allow other computer programs to ingest the RDF, analyse it for the similar uses of subjects, predicates, and objects, and in turn automatically uncover new knowledge and new relationships between things.

RDF is/was originally expressed as XML, but the wider community had two problems with RDF. First, there were no “killer” applications using RDF as input, and second, RDF expressed as XML was seen as too verbose and too confusing. Thus, the idea of RDF languished. More recently, RDF is being expressed in other forms such as JSON and Turtle and N3, but there are still no killer applications.

You will hear the term “linked data” in association with RDF, and linked data is the process of making RDF available on the Web.

RDF is important for libraries and “memory” or “cultural heritage” institutions, because the goal of RDF is very similar to the goals of libraries, archives, and museums.

MARC

The MARC standard has been the bibliographic bread & butter of Library Land since the late 1960’s. When it was first implemented it was an innovative and effect data structure used primarily for the production of catalog cards. With the increasing availability of computers, somebody got the “cool” idea of creating an online catalog. While logical, the idea did not mature with a balance of library and computing principles. To make a long story short, library principles prevailed and the result has been and continues to be painful for both the profession as well as the profession’s clientele.

MARCXML was intended to provide a pathway out of this morass, but since it was designed from the beginning to be “round tripable” with the original MARC standard, all of the short-comings of the original standard have come along for the ride. The Library Of Congress was aware of these short-comings, and consequently MODS was designed. Unlike MARC and MARCXML, MODS has no character limit and its field names are human-readable, not based on numeric codes. Given that MODS is flavor of XML, all of this is a giant step forward.

Unfortunately, the library profession’s primary access tools — the online catalog and “discovery system” — still heavily rely on traditional MARC for input. Consequently, without a wholesale shift in library practice, the intellectual capital the profession so dearly wants to share is figuratively locked in the 1960’s.

Not a panacea

XML really is an excellent technology, and it is most certainly apropos for the work of cultural heritage institutions such as libraries, archives, and museums. This is true for many reasons:

it is computing platform independent
it requires a minimum of computer technology to read and write
to some degree, it is self-documenting, and
especially considering our profession, it is all about data, information, and knowlege

On the other hand, it does have a number of disadvantages, for example:

it is verbose — not necessarily succinct
while easy to read and write, it can be difficult to process
like all things computer program-esque, it imposes a set of syntactical rules, which people can sometimes find frustrating
its adoption as standard has not been as ubiquitous as desired

To date you have learned how to read, write, and process XML and a number of its specific “flavors”, but you have by no means learned everything. Instead you have received a more than adequate introduction. Other XML topics of importance include:

evolutions in XSLT and XPath
XML-based databases
XQuery, a standardized method for querying sets of XML similar to the standard query language of relational databases
additional XML vocabularies, most notably RSS
a very functional way of making modern Web browsers display XML files
XML processing instructions as well as reserved attributes like lang

In short, XML is not a panacea, but it is an excellent technology for library work.

Summary

You have all but concluded a course on XML in libraries, and now is a good time for a summary.

First of all, XML is one of culture’s more recent attempts at formalizing knowledge. At its root (all puns intended) is data, such as the number like 1776. Through mark-up we might say this number is a year, thus turning the data into information. By putting the information into context, we might say that 1776 is when the Declaration of Independence was written and a new type of government was formed. Such generalizations fall into the realm of knowledge. To some degree, XML facilitates the transformation of data into knowledge. (Again, all puns intended.)

Second, understand that XML is also a data structure defined by the characteristics of well-formedness. By that I mean XML has one and only one root element. Elements must be opened and closed in a hierarchal manner. Attributes of elements must be quoted, and a few special characters must always be escaped. The X in XML stands for “extensible”, and through the use of DTDs and schemas, specific XML “flavors” can be specified.

With this under your belts you then experimented with at least a couple of XML flavors: TEI and EAD. The former is used to mark-up literature. The later is used to describe archival collections. You then learned about the XML transformation process through the application of XSL and XPath, two rather difficult technologies to master. Lastly, you made strong efforts to apply the principles of XML to the principles of librarianship by marking up sets of documents or creating your own knowledge entity. It is hoped you have made a leap from mere technology to system. It is not about Oxygen nor graphic design. It is about the chemistry of disseminating data as unambiguously as possible for the purposes of increasing the sphere of knowledge. With these things understood, you are better equipped to practice librarianship in the current technological environment.

Finally, remember, there is no such thing as a Dublin Core record.

Epilogue — Use and understanding

This course in XML was really only an introduction. You were expected to read, write, and transform XML. This process turns data into information. All of this is fine, but what about knowledge?

One of the original reasons texts were marked up was to facilitate analysis. Researchers wanted to extract meaning from texts. One way to do that is to do computational analysis against text. To facilitate computational analysis people thought is was necessary for essential characteristics of a text to be delimited. (It is/was thought computers could not really do natural language processing.) How many paragraphs exists? What are the names in a text? What about places? What sorts of quantitative data can be statistically examined? What main themes does the text include? All of these things can be marked-up in a text and then counted (analyzed).

Now that you have marked up sets of letters with persname elements, you can use XPath to not only find persname elements but count them as well. Which document contains the most persnames? What are the persnames in each document. Tabulate their frequency. Do this over a set of documents to look for trends across the corpus. This is only a beginning, but entirely possible given the work you have already done.

Libraries do not facilitate enough quantitative analysis against our content. Marking things up in XML is a good start, but lets go to the next step. Let’s figure out how the profession can move its readership from discovery to analysis — towards use & understand.

Mr. Serials continues

Eric Lease Morgan — Wed, 06 Jan 2016 16:42:37 +0000

The (ancient) Mr. Serials Process continues to support four mailing list archives, specifically, the archives of ACQNET, Colldv-l, Code4Lib, and NGC4Lib, and this posting simply makes the activity explicit.

Mr. Serials is/was a process I developed quite a number of years ago as a method for collecting, organizing, archiving electronic journals (serials). The process worked well for a number of years, until electronic journals were no longer distributed via email. Now-a-days, Mr. Serials only collects the content of a few mailing lists. That’s okay. Things change. No big deal.

On the other hand, from a librarian’s and archivist’s point-of-view, it is important to collect mailing list content in its original form — email. Email uses the SMTP protocol. The communication sent back and forth, between email server and client, is well-structured albiet becoming verbose. Probably “the” standard for saving email on a file system is called mbox. Given a mbox file, it is possible to use any number of well-known applications to read/write mbox data. Heck, all you need is a text editor. Increasingly, email archives are not available from mailing list applications, and if they are, then they are available only to mailing list administrators and/or in a proprietary format. For example, if you host a mailing list on Google, can you download an archive of the mailing list in a form that is easily and universally readable? I think not.

Mr. Serials circumvents this problem. He subscribes to mailing lists, saves the incoming email to mbox files, and processes the mbox files to create searchable/browsable interfaces. The interfaces are not hugely aesthetically appealing, but they are more than functional, and the source files are readily available. Just ask.

Most recently both the ACQNET and Colldv-l mailing lists moved away from their hosting institutions to servers hosted by the American Library Association. This has not been the first time these lists have moved. It probably won’t be the last, but since Mr. Serials continues subscribe to these lists, comprehensive archives persevere. Score a point for librarianship and the work of archives. Long live Mr. Serials.

Re-MARCable

Eric Lease Morgan — Tue, 17 Nov 2015 19:07:59 +0000

This blog posting contains: 1) questions/statements about MARC and posted by graduate library school students taking an online XML class I’m teaching this semester, and 2) my replies. Considering my previously published blog posting, you might say this posting is “re-MARCable”.

I’m having some trouble accessing the file named data.marc for the third question in this week’s assignment. It keeps opening in word and all I get is completely unreadable. Is there another way of going about finding the answer for that particular question?

Okay. I have to admit. I’ve been a bit obtuse about the MARC file format.

MARC is/was designed to contain ASCII characters, and therefore it ought to be human-readable. MARC does not contain binary characters and therefore ought to be readable in text editors. DO NOT open the .marc file in your word processor. Use your text editor to open it up. If you have line wrap turned off, then you ought to see one very long line of ugly text. If you turn on line wrap, then you will see many lines of… ugly text. Attached (hopefully) is a screen shot of many MARC records loaded into my text editor. And I rhetorically ask, “How many records are displayed, and how do you know?”

I’m trying to get y’all to answer a non-rhetorical question asked against yourself, “Considering the state of today’s computer technology, how viable is MARC? What are the advantages and disadvantages of MARC?”

I am taking Basic Cataloging and Classification this semester, but we did not discuss octets or have to look at an actual MARC file. Since this is supposed to be read by a machine, I don’t think this file format is for human consumption which is why it looks scary.

[Student], you continue to be a resource for the entire class. Thank you.

Everybody, yes, you will need to open the .marc file in your text editor. All of the files we are creating in this class ought to be readable in your text editor. True and really useful data files ought to be text files so they can be transferred from application to application. Binary files are sometimes more efficient, but not long-lasting. Here in Library Land we are in it for the long haul. Text files are where it is at. PDF is bad enough. Knowing how to manipulate things in a text editor is imperative when it comes to really using a computer. Imperative!!! Everything on the Web is in plain text.

In any event, open the .marc file in your text editor. On a Macintosh that is Text Edit. On Windows it is NotePad or WordPad. Granted all of these particular text editors are rather brain-dead, but they all function necessarily. A better text editor for Macintosh is Text Wrangler, and for Windows is NotePad++. When you open the .marc file, it will look ugly. It will seem unreadable, but that is not the case at all. Instead, a person needs to know the “secret codes” of cataloging, as well as a bit of an obtuse data structure in order to make sense of the whole thing.

Okay. Octets. Such are 8-bit characters, as opposed to the 7-bit characters of ASCII enclosing. The use of 8-bit characters enabled Library Land to integrate characters such as ñ, é, or å into its data. And while Library Land was ahead of the game in this regard, it did not embrace Unicode when it came along:

Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world’s writing systems. Developed in conjunction with the Universal Character Set standard and published as The Unicode Standard, the latest version of Unicode contains a repertoire of more than 120,000 characters covering 129 modern and historic scripts, as well as multiple symbol sets. [1]

Nor did Library Land update its data when changes happened. Consequently, not only do folks outside Library Land need to know how to read and write MARC records (which they can’t), they also need to know and understand the weird characters encodings which we use. In short, the data of Library Land is not very easily readable by the wider community, let alone very many people within our own community. Now that is irony. Don’t you think so!? Our data is literally and figuratively stuck in 1965, and we continue to put it there.

Professor, is this data.marc file suppose to be read only by a machine as [a fellow classmate] suggested?

Only readable by a computer? The answer is both no and yes.

Any data file intended to be shared between systems (sets of applications) ought to be saved as plain text in order to facilitate transparency and eliminate application monopolies/tyrannies. Considering the time when MARC was designed, it fulfilled these requirements. The characters were 7-bits long (ASCII), the MARC codes were few and far between, and its sequential nature allowed it to be shipped back and forth on things like tape or even a modem. (“Remember modems?”) Without the use of an intermediary computer program, is is entirely possible to read and write a MARC records with a decent text editor. So, the answer is “No, MARC is not only readable by a machine.”

On the other hand, considering how much extra data (“information”) the profession has stuffed into MARC data structure, it is really really hard to edit MARC records with a text editor. Library Land has mixed three things into a single whole: data, presentation, and data structure. This is really bad when it comes to computing. For example, a thing may have been published in 1542, but the cataloger is not certain of this date. Consequently, they will enter a data value of [1542]. Well, that is not a date (a number), but rather a string (a word). To make matters worse, the cataloger may think the date (year) of publication is within a particular decade but not exactly sure, and the date may be entered like as [154?]. Ack! Then let’s get tricky and add a copyright notation to a more recent but uncertain date — [c1986]. Does it never end? Then lets’ talk about the names of people. The venerable Fred Kilgour — founder of OCLC — is denoted in cataloging rules as Kilgour, Fred. Well, I don’t think Kilgour, Fred ever backwards talked so make sure his ideas sortable. Given the complexity of cataloging rules, which never simplify, it is really not feasible to read and write MARC records without an intermediate computer program. So, on the other hand, “Yes, an intermediary computer is necessary.” But if this is true, then why don’t catalogers know to read and write MARC records? The answer lies in what I said above. We have mixed three things into a single whole, and that is a really bad idea. We can’t expect catalogers to be computer programmers too.

The bottom line is this. Library Land automated its processes but it never really went to the next level and used computers to enhance library collections and services. All Library Land has done is used computers to facilitate library practice; Library Land has not embraced the true functionality of computers such as its ability to evaluate data/information. We have simply done the same thing. We wrote catalog cards by hand. We then typed catalog cards. We then used a computer to create them.

One more thing, Library Land simply does not have enough computer programmer types. Libraries build collections. Cool. Libraries provide services against the collections. Wonderful. This worked well (more or less) when libraries were physical entities in a localized environment. Now-a-days, when libraries are a part of a global network, libraries need to speak the global language, and that global language is spoken through computers. Computers use relational databases to organize information. Computers use indexes to make the information findable. Computers use well-structured Unicode files (such XML, JSON, and SQL files) to transmit information from one computer to another. In order to function, people who work in libraries (librarians) need to know these sorts of technologies in order to work on a global scale, but realistically speaking, what percentage of librarians, now how to do these thing, let alone know what they are? Probably less than 10%. It needs to be closer to 33%. Where 33% of the people build collections, 33% of the people provide services, and 33% of the people glue the work of the first 66% into a coherent whole. What to do with the remaining 1%? Call them “administrators”.

[1] Unicode – https://en.wikipedia.org/wiki/Unicode

MARC, MARCXML, and MODS

Eric Lease Morgan — Wed, 11 Nov 2015 15:19:36 +0000

This is the briefest of comparisons between MARC, MARCXML, and MODS. Its was written for a set of library school students learning XML.

MARC is an acronym for Machine Readable Cataloging. It was designed in the 1960’s, and its primary purpose was to ship bibliographic data on tape to libraries who wanted to print catalog cards. Consider the computing context of the time. There were no hard drives. RAM was beyond expensive. And the idea of a relational database had yet to be articulated. Consider the idea of a library’s access tool — the card catalog. Consider the best practice of catalog cards. “Generate no more than four or five cards per book. Otherwise, we will not be able to accommodate all of the cards in our drawers.” MARC worked well, and considering the time, it represented a well-designed serial data structure complete with multiple checksum redundancy.

Someone then got the “cool” idea to create an online catalog from MARC data. The idea was logical but grew without a balance of library and computing principles. To make a long story short, library principles sans any real understanding of computing principles prevailed. The result was a bloating of the MARC record to include all sorts of administrative data that never would have made it on to a catalog card, and this data was delimited in the MARC record with all sorts of syntactical “sugar” in the form of punctuation. Moreover, as bibliographic standards evolved, the previously created data was not updated, and sometimes people simply ignored the rules. The consequence has been disastrous, and even Google can’t systematically parse the bibliographic bread & butter of Library Land.* The folks in the archives community — with the advent of EAD — are so much better off.

Soon after XML was articulated the Library Of Congress specified MARCXML — a data structure designed to carry MARC forward. For the most part, it addressed many of the necessary issues, but since it insisted on making the data in a MARCXML file 100% transformable into a “traditional” MARC record, MARCXML falls short. For example, without knowing the “secret codes” of cataloging — the numeric field names — it is very difficult to determine what are the authors, titles, and subjects of a book.

The folks at the Library Of Congress understood these limitations almost from the beginning, and consequently they created an additional bibliographic standard called MODS — Metadata Object Description Schema. This XML-based metadata schema goes a long way in addressing both the computing times of the day and the needs for rich, full, and complete bibliographic data. Unfortunately, “traditional” MARC records are still the data structure ingested and understood by the profession’s online catalogs and “discovery systems”. Consequently, without a wholesale shift in practice, the profession’s intellectual content is figuratively stuck in the 1960’s.

* Consider the hodgepodge of materials digitized by Google and accessible in the HathiTrust. A search for Walden by Henry David Thoreau returns a myriad of titles, all exactly the same.

Readings

MARC (http://www.loc.gov/marc/bibliographic/bdintro.html) – An introduction to the MARC standard
leader (http://www.loc.gov/marc/specifications/specrecstruc.html#leader) – All about the leader of a traditional MARC record
MARC Must Die (http://lj.libraryjournal.com/2002/10/ljarchives/marc-must-die/) – An essay by Roy Tennent outlining why MARC is not a useful bibliographic format. Notice when it was written.
MARCXML (https://www.loc.gov/standards/marcxml/marcxml-design.html) – Here are the design considerations for MARCXML
MODS (http://www.loc.gov/standards/mods/userguide/) – This is an introduction to MODS

Exercise

This is much more of an exercise than it is an assignment. The goal of the activity is not to get correct answers but instead to provide a framework for the reader to practice critical thinking against some of the bibliographic standards of the library profession. To the best of your ability, and in the form of an written essay between 500 and 1000 words long, answer and address the following questions based on the contents of the given .zip file:

Measured in characters (octets), what is the maximum length of a MARC record? (Hint: It is defined in the leader of a MARC record.)
Given the maximum length of a MARC record (and therefore a MARCXML record), what are some of the limitations this imposes when it comes to full and complete bibliographic description?
Given the attached .zip file, how many bibliographic items are described in the file named data.marc? How many records are described in the file named data.xml? How many records are described in the file named data.mods? How do did you determine the answers to the previous three questions? (Hint: Open and read the files in your favorite text and/or XML editor.)
What is the title of the book in the first record of data.marc? Who is the author of the second record in the file named data.xml. What are the subjects of the third record in the file named data.mods? How did you determine the answers the previous three questions? Be honest.
Compare & contrast the various bibliographic data structures in the given .zip file. There are advantages and disadvantages to all three.

“Sum reflextions” on travel

Eric Lease Morgan — Sun, 25 Oct 2015 11:31:16 +0000

These are “sum reflextions” on travel; travel is a good thing, for many reasons.

I am blogging in front of the Pantheon. Amazing? Maybe. Maybe not. But the ability to travel, see these sorts of things, experience the different languages and cultures truly is amazing. All too often we live in our own little worlds, especially in the United States. I can’t blame us too much. The United States is geographically large. It borders only two other countries. One country speaks Spanish. The other speaks English and French. While the United States is the proverbial “melting pot”, there really isn’t very much cultural diversity in the United States, not compared to Europe. Moreover, the United States does not nearly have the history of Europe. For example, I am sitting in front of a building that was build before the “New World” was even considered as existing. It doesn’t help that the United States’ modern version of imperialism tends to make “United Statesians” feel as if they are the center of the world. I guess, that is some ways, it is not much different than Imperial Rome. “All roads lead to Rome.”

As you may or may not know, I have commenced upon a sort of leave of absence from my employer. In the past six weeks I have moved all of belongings to a cabin in a remote part of Indiana, and I have moved myself to Chicago. From there I began a month-long adventure. It began in Tuscany where I painted and deepened my knowledge of Western art history. I spent a week in Venice where I did more painting, walked up to my knees in water because the streets flooded, and I experienced Giotto’s frescos in Padua. For the past week I experienced Rome and did my best to actively participate in a users group meeting called ADLUG — the remnants of a user’s group meeting surrounding one of the very first integrated library systems — Dobris Libris. I also painted and rode a bicycle along the Appian Way. I am now on my way to Avignon where I will take a cooking class and continue on a “artist’s education”.

Travel is not easy. It requires a lot of planning and coordination. “Where will I be when, and how will I get there? Once I’m there, what am I going to do, and how will I make sure things don’t go awry?” In this way, travel is not for the fient of heart, especially when venturing into territory where you do not know the language. It can be scary. Nor is travel inexpensive. One needs to maintain two households.

Travel is a kind of education that can not be gotten through the reading of books, the watching of television, nor discussion with other people. It is something that must be experienced first hand. Like sculpture, it is literally an experience that can only exist time & space in order to fully appreciate.

What does this have to do with librarianship? On one hand, nothing. On the other hand, everthing. From my perspective, librarianship is about a number of processes applied against a number of things. These processes include collection, organization, preservation, dissemination, and sometimes evaluation. The things of librarianship are data, information, knowledge, and sometimes wisdom. Even today, with the advent of our globally networked computers, the activities of librarianship remain essentially unchanged when compared to the activities of more than a hundred years ago. Libraries still curate collections, organize the collections into useful sets, provide access to the collections, and endeavor to maintain all of these services for the long haul.

Like most people and travel, many librarians (and people who work in libraries) do not have a true appreciation for the work of their colleagues. Sure, everybody applauds everybody else’s work, but have they actually walked in those other people’s shoes? The problem is most acute between the traditional librarians and the people who write computer programs for libraries. Both sets of people have the same goals; they both want to apply the same processes to the same things, but their techniques for accomplishing those goals are disimilar. One wants to take a train to get where they are going, and other wants to fly. This must change lest the profession become even less relevant.

What is the solution? In a word, travel. People need to mix and mingle with the other culture. Call it cross-training. Have the computer programmer do some traditional cataloging for a few weeks. Have the cataloger learn how to design, implement, and maintain a relational database. Have the computer programmer sit at the reference desk for a while in order to learn about service. Have the reference librarian work with the computer programmer and learn how to index content and make it searchable. Have the computer programmer work in an archive or conservatory making books and saving content in gray cardboard boxes. Have the archivist hang out with computer programmer and learn how content is backed up and restored.

How can all this happen? In my opinion, the most direct solution is advocacy from library administration. Without the blessing of library administration everybody will say, “I don’t have time for such ‘travel’.” Well, library work is never done, and time will need to be carved out and taken from the top, like retirement savings, in order for such trips abroad to come to fruition.

The waiters here at my cafe are getting restless. I have had my time here, and it is time to move on. I will come back, probably in the Spring, and I’ll stay longer. In the meantime, I will continue with my own personal education.

What is old is new again

Eric Lease Morgan — Thu, 22 Oct 2015 10:40:09 +0000

The “how’s” of librarianship are changing, but not the “what’s”.

(This is an outline for my presentation given at the ADLUG Annual Meeting in Rome (October 21, 2015). Included here are also the one-page handout and slides, both in the form of PDF documents.)

Linked Data

Linked Data is a method of describing objects, and these objects can be the objects in a library. In this way, Linked Data is a type of bibliographic description.

Linked Data is a manifestation of the Semantic Web. It is an interconnection of virtual sentences known as triples. Triples are rudimentary data structures, and as the name implies, they are made of three parts: 1) subjects, 2) predicates, and 3) objects. Subjects always take the form of a URI (think “URL”), and they point to things real or imaginary. Objects can take the form of a URI or a literal (think “word”, “phrase” or “number”). Predicates also take the form of a URI, and they establish relationships between subjects and objects. Sets of predicates are called ontologies or vocabularies and they present the languages of Linked Data.

Through the curation of sets of triples, and through the re-use of URIs, it is often possible to make explicit assuming information and new knowledge.

There are an increasing number of applications enabling libraries to transform and convert their bibliographic data into Linked Data. One such application is called the ALIADA.

When & if the intellectual content of libraries, archives, and museums is manifested as Linked Data, then new relationships between resources will be uncovered and discovered. Consequently, one of the purposes of cultural heritage institutions will be realized. Thus, Linked Data is a newer, more timely method of describing collections; what is old is new again.

Curation of digital objects

The curation of collections, especially in libraries, does not have to be limited to physical objects. Increasingly new opportunities regarding the curation of digital objects represent a growth area.
With the advent of the Internet there exists an abundance of full-text digital objects just waiting to be harvested, collected, and cached. It is not good enough to link and point to such objects because links break and institutions (websites) dissolve.

Curating digital objects is not easy, and it requires the application of traditional library principles of preservation in order to be fulfilled. It also requires systematic organization and evaluation in order to be useful.

Done properly, there are many advantages to the curation of such digital collections: long-term access, analysis & evaluation, use & re-use, and relationship building. Examples include: the creation of institutional repositories, the creation of bibliographic indexes made up of similar open access journals, and the complete works of an author of interest.

In the recent past I have created “browsers” used to do “distant reading” against curated collections of materials from the HathiTrust, the EEBO-TCP, and JSTOR. Given a curated list of identifiers each of the browsers locally caches the full text of digital object object, creates a “catalog” of the collection, does full text indexing against the whole collection, and generates a set of reports based on the principles of text mining. The result is a set of both HTML files and simple tab-delimited text files enabling the reader to get an overview of the collection, query the collection, and provide the means for closer reading.

How can these tools be used? A reader could first identify the complete works of a specific author from the HathiTrust, say, Ralph Waldo Emerson. They could then identify all of the journal articles in JSTOR written about Ralph Waldo Emerson. Finally the reader could use the HathiTrust and JSTOR browsers to curate the full text of all the identified content to verify previously established knowledge or discover new knowledge. On a broader level, a reader could articulate a research question such as “What are some of the characteristics of early American literature, and how might some of its authors be compared & contrasted?” or “What are some of the definitions of a ‘great’ man, and how have these definitions changed over time?”

The traditional principles of librarianship (collection, organization, preservation, and dissemination) are alive and well in this digital age. Such are the “whats” of librarianship. It is the “hows” of the librarianship that need to evolve in order the profession to remain relevant. What is old is new again.

Painting in Tuscany

Eric Lease Morgan — Tue, 13 Oct 2015 10:30:07 +0000

As you may or may not know, I have commenced upon a sort of leave of absence from my employer, and I spent the last the better part of the last two weeks painting in Tuscany.

Me and eight other students arrived in Arezzo (Italy) on Wednesday, October 1, and we were greeted by Yves Larocque of Walk The Arts. We then spent the next ten days on a farm/villa very close to Singalunga (Italy) where we learned about color theory, how to mix colors, a bit of Western art history, and art theory. All the while we painted and painted and painted. I have taken a few art classes in my day and this was quite honestly the best one I’ve ever attended. It was thorough, individualized, comprehensive, and totally immersive. Painting in Tuscany was a wonderful way to commence a leave of absence. The process gave me a chance to totally get away, see things from a different vantage point, and begin an assessment.

What does this have to do with librarianship? I don’t know, yet. When I find out I’ll let you know.

My water collection predicts the future

Eric Lease Morgan — Tue, 29 Sep 2015 16:37:35 +0000

As many of you may or may not know, I collect water, and it seems as if my water collection predicts the future, sort of.

Since 1979 or so, I’ve been collecting water. [1] The purpose of the collection is/was enable me to see and experience different parts of the world whenever I desired. As the collection grew and my computer skills developed, I frequently used the water collection as a kind of Guinea pig for digital library projects. For example, my water collection was once manifested as a HyperCard Stack complete with the sound of running water in the background. For a while my water collection was maintained in a FileMaker database that generated sets of HTML. Quite a number of years ago I migrated everything to MySQL and embedded images of the water bottles in fields of the database. This particular implementation also exploited XML and XSLT to dynamically make the content available on the Web. (There was even some RDF output.) After that I included geographic coordinates into the database. This made it easy for me to create maps illustrating whence the water came. To date, there are about two hundred and fifty waters in my collection, but active collecting has subsided in the past few years.

But alas, this past year I migrated my co-located host to a virtual machine. In the process I moved all of my Web-based applications — dating back more than two decades — to a newer version of the LAMP stack, and in the process I lost only a single application — my water collection. I still have all the data, but the library used to integrate XSLT into my web server (AxKit) simply would not work with Apache 2.0, and I have not had the time to re-implement a suitable replacement.

Concurrently, I have been negotiating a two-semester long leave-of-absence from my employer. The “leave” has been granted and commenced a few of weeks ago. The purpose of the leave is two-fold: 1) to develop my skills as a librarian, and 2) to broaden my experience as a person. The first part of my leave is to take a month-long vacation, and that vacation begins today. For the first week I will paint in Tuscany. For the second week I will drink coffee in Venice. During the third week I will give a keynote talk at ADLUG in Rome. [2] Finally, during the fourth week I will learn how to make croissants in Provence. After the vacation is over I will continue to teach “XML 101” to library school graduate students at San Jose State University. [3] I will also continue to work for the University of Notre Dame on a set of three text mining projects (EEBO, JSTOR, and HathiTrust). [4, 5, 6]

As I was getting ready for my “leave” I was rooting through my water collection, and I found four different waters, specifically from: 1) Florence, 2) Venice, 3) Rome, and 4) Nice. As I looked at the dates of when the water was collected, I realized I will be in those exact same four places, on those exact same four days, exactly thirty-three years after I originally collected them. My water collection predicted my future. My water collection is a sort of model of me and my professional career. My water collection has sent me a number of signs.

This “leave-of-absence” (which in not really a leave nor a sabbatical, but instead a temporary change to adjunct faculty status) is a whole lot like going to college for the first time. “Where in the world am I going? What in the world am I going to do? Who in the world will I meet?” It is both exciting and scary at once and at the same time. It is an opportunity I would be foolish to pass up, but it is not as easy as you might imagine. That said, I guess I am presently an artist- and librarian-at-large. I think I need new, albeit temporary, business cards to proclaim my new title(s).

Wish me luck, and “On my mark. Get set. Go!”

blog postings describing my water collection – ./../2009/09/water-1-of-3/index.html
ADLUG – http://www.adlug.net/
“XML 101” at SJSU – http://ischoolapps.sjsu.edu/facultypages/view.php?fac=morgane
EEBO browser – https://github.com/ndlib/text-analysis-eebo
JSTOR browser – https://github.com/ndlib/text-analysis-jstor
HathiTrust browser – https://github.com/ndlib/text-analysis-htrc

Some automated analysis of Richard Baxter’s works

Eric Lease Morgan — Sat, 13 Jun 2015 21:19:00 +0000

baxter

This page describes a corpus named baxter. It is a programmatically generated report against the full text of all the writing of Richard Baxter (a English Puritan church leader, poet, and hymn-writer) as found in Early English Books Online. It was created using a (fledgling) tool called the EEBO Workset Browser.

General statistics

An analysis of the corpus’s metadata provides an overview of what and how many things it contains, when things were published, and the sizes of its items:

Number of items – 140
Publication date range – 1650 to 1697 (histogram : boxplot)
Sizes in pages – 1 to 1258 (histogram : boxplot)
Total number of pages – 33507
Average number of pages per item – 239

Possible correlations between numeric characteristics of records in the catalog can be illustrated through a matrix of scatter plots. As you would expect, there is almost always a correlation between pages and number of words. Are others exist? For more detail, browse the catalog.

Notes on word usage

By counting and tabulating the words in each item of the corpus, it is possible to measure additional characteristics:

Sizes of items in words – 858 to 532780 (histogram : boxplot)
Total number of words – 4986083
Average number of words per item – 35614
Total number of unique words – 70036
Most common words – god (65230) may (34500) us (32510) one (31437) church (30329) men (28306) would (27141) man (26623) hath (26209) yet (22972) many (21280) much (20667) chriſt (20084) love (19732) world (19611) make (18478) know (17281) therefore (17169) faith (16088) must (15544) though (15314) good (15283) doth (15163) muſt (15089) things (14996)

Perusing the list of all words in the corpus (and their frequencies) as well as all unique words can prove to be quite insightful. Are there one or more words in these lists connoting an idea of interest to you, and if so, then to what degree do these words occur in the corpus?

To begin to see how words of your choosing occur in specific items, search the collection.

Through the creation of locally defined “dictionaries” or “lexicons”, it is possible to count and tabulate how specific sets of words are used across a corpus. This particular corpus employs three such dictionaries — sets of: 1) “big” names, 2) “great” ideas, and 3) colors. Their frequencies are listed below:

Most common “big” names – james (567) augustine (148) aquinas (142) plato (128) aristotle (59) plutarch (55) smith (55) bacon (37) hobbes (35) virgil (30) aurelius (29) gilbert (25) plotinus (23) mill (22) epictetus (18) swift (17) apollonius (13) tacitus (12) galen (12) homer (12) gibbon (9) lucretius (7) sterne (7) hippocrates (6) herodotus (4) For more detail, see the list of “big” name frequencies.
Most common “great” ideas – god (65231) one (31438) man (26624) many (21281) love (19733) world (19612) good (15284) life (13768) law (12074) sin (9698) time (9296) nature (8487) duty (7094) death (6759) truth (6689) religion (6483) soul (5993) matter (5631) peace (5093) particular (4830) mind (4676) government (4538) knowledge (4099) evil (4045) cause (3536) For more detail, see the list of “great” idea frequencies.
Colors – white (302) black (93) red (88) green (54) purple (51) brown (46) orange (9) gray (6) yellow (3) blue (1) For more detail, see the list of color word frequencies.

The distribution of words (histograms and boxplots) and the frequency of words (wordclouds), and how these frequencies “cluster” together can be illustrated:

Histograms – “big” names; “great” ideas; colors
Boxplots – “big” names; “great” ideas; colors
Wordclouds – most common words; “big” names; “great” ideas; colors
Cluster dendrograms – most common words; “big” names; “great” ideas; colors

Items of interest

Based on the information above, the following items (and their associated links) are of possible interest:

Shortest item (1 p.) – Short instructions for the sick: Especially who by contagion, or otherwise, are deprived of the presence of a faithfull pastor. / By Richard Baxter. (TEI : HTML : plain text)
Longest item (1258 p.) – A Christian directory, or, A summ of practical theologie and cases of conscience directing Christians how to use their knowledge and faith, how to improve all helps and means, and to perform all duties, how to overcome temptations, and to escape or mortifie every sin : in four parts … / by Richard Baxter. (TEI : HTML : plain text)
Oldest item (1650) – The saints everlasting rest, or, A treatise of the blessed state of the saints in their enjoyment of God in glory wherein is shewed its excellency and certainty, the misery of those that lose it, the way to attain it, and assurance of it, and how to live in the continual delightful forecasts of it and now published by Richard Baxter … (TEI : HTML : plain text)
Most recent (1697) – Mr. Richard Baxter’s last legacy in select admonitions and directions to all sober dissenters. (TEI : HTML : plain text)
Most thoughtful item – Short instructions for the sick: Especially who by contagion, or otherwise, are deprived of the presence of a faithfull pastor. / By Richard Baxter. (TEI : HTML : plain text)
Least thoughtful item – Dattodiad y qwestiwn mawr, beth sydd raid i ni ei wneuthur fel y byddom gadwedig. Athrawiaethau i fuchedd sanctaidd. / O waith y disinydd parchedig Mr. Richard Baxter. (TEI : HTML : plain text)
Biggest name dropper – R. Baxter’s sence of the subscribed articles of religion (TEI : HTML : plain text)
Fewest quotations – Additions to the poetical fragments of Rich. Baxter written for himself and communicated to such as are more for serious verse than smooth. (TEI : HTML : plain text)
Most colorful – The certainty of the worlds of spirits and, consequently, of the immortality of souls of the malice and misery of the devils and the damned : and of the blessedness of the justified, fully evinced by the unquestionable histories of apparitions, operations, witchcrafts, voices &c. / written, as an addition to many other treatises for the conviction of Sadduces and infidels, by Richard Baxter. (TEI : HTML : plain text)
Ugliest – Richard Baxter his account to his dearly beloved, the inhabitants of Kidderminster, of the causes of his being forbidden by the Bishop of Worcester to preach within his diocess with the Bishop of Worcester’s letter in answer thereunto : and some short animadversions upon the said bishops letter. (TEI : HTML : plain text)

Marrying close and distant reading: A THATCamp project

Eric Lease Morgan — Sun, 12 Apr 2015 16:47:07 +0000

The purpose of this page is to explore and demonstrate some of the possibilities of marrying close and distant reading. By combining both of these processes there is a hope greater comprehension and understanding of a corpus can be gained when compared to using close or distant reading alone. (This text might also be republished at http://dh.crc.nd.edu/sandbox/thatcamp-2015/ as well as http://nd2015.thatcamp.org/2015/04/07/close-and-distant/.)

To give this exploration a go, two texts are being used to form a corpus: 1) Machiavelli’s The Prince and 2) Emerson’s Representative Men. Both texts were printed and bound into a single book (codex). The book is intended to be read in the traditional manner, and the layout includes extra wide margins allowing the reader to liberally write/draw in the margins. As the glue is drying on the book, the plain text versions of the texts were evaluated using a number of rudimentary text mining techniques and with the results made available here. Both the traditional reading as well as the text mining are aimed towards answering a few questions. How do both Machiavelli and Emerson define a “great” man? What characteristics do “great” mean have? What sorts of things have “great” men accomplished?

Comparison

Feature	The Prince	Representative Men
Author	Niccolò di Bernardo dei Machiavelli (1469 – 1527)	Ralph Waldo Emerson (1803 – 1882)
Title	The Prince	Representative Men
Date	1532	1850
Fulltext	plain text \| HTML \| PDF \| TEI/XML	plain text \| HTML \| PDF \| TEI/XML
Length	31,179 words	59,600 words
Fog score	23.1	14.6
Flesch score	33.5	52.9
Kincaid score	19.7	11.5
Frequencies	unigrams, bigrams, trigrams, quadgrams, quintgrams	unigrams, bigrams, trigrams, quadgrams, quintgrams
Parts-of-speech	nouns, pronouns, adjectives, verbs, adverbs	nouns, pronouns, adjectives, verbs, adverbs

Search

Search for “man or men” in The Prince. Search for “man or men” in Representative Men.

Observations

I observe this project to be a qualified success.

First, I was able to print and bind my book, and while the glue is still trying, I’m confident the final results will be more than usable. The real tests of the bound book are to see if: 1) I actually read it, 2) I annotate it using my personal method, and 3) if I am able to identify answers to my research questions, above.

bookmaking tools

almost done

Second, the text mining services turned out to be more of a compare & contrast methodology as opposed to a question-answering process. For example, I can see that one book was written hundreds of years before the other. The second book is almost twice as long and the first. Readability score-wise, Machiavelli is almost certainly written for the more educated and Emerson is easier to read. The frequencies and parts-of-speech are enumerative, but not necessarily illustrative. There are a number of ways the frequencies and parts-of-speech could be improved. For example, just about everything could be visualized into histograms or word clouds. The verbs ought to lemmatized. The frequencies ought to be depicted as ratios compared to the texts. Other measures could be created as well. For example, my Great Books Coefficient could be employed.

How do Emerson and Machiavelli define a “great” man. Hmmm… Well, I’m not sure. It is relatively easy to get “definitions” of men in both books (The Prince or Representative Men). And network diagrams illustrating what words are used “in the same breath” as the word man in both works are not very dissimilar:

“man” in The Prince

“man” in Representative men

I think I’m going to have to read the books to find the answer. Really.

Code

Bunches o’ code was written to produce the reports:

concordance.cgi – the simple search engine
fathom.pl – used to compute the readability scores
file2pos.py – create a parts-of-speech file for later use
network.cgi – used to display words used “in the same breath” a given word
ngrams.pl – compute ngrams
pos.py – count and tabulate parts-of-speech from a previously created file

You can download this entire project — code and all — from http://dh.crc.nd.edu/sandbox/thatcamp-2015/reports/thatcamp-2015.tar.gz or ./../wp-content/uploads/2015/04/thatcamp-2015.tar.gz.

Great Books Survey

Eric Lease Morgan — Thu, 01 Jan 2015 15:55:26 +0000

I am happy to say that the Great Books Survey is still going strong. Since October of 2010 it has been answered 24,749 times by 2,108 people from people all over the globe. To date, the top five “greatest” books are Athenian Constitution by Aristotle, Hamlet by Shakespeare, Don Quixote by Cervantes, Odyssey by Homer, and the Divine Comedy by Dante. The least “greatest” books are Rhesus by Euripides, On Fistulae by Hippocrates, On Fractures by Hippocrates, On Ulcers by Hippocrates, On Hemorrhoids by Hippocrates. “Too bad Hippocrates”.

For more information about this Great Books of the Western World investigation, see the various blog postings.

Doing What I’m Not Suppose To Do

Eric Lease Morgan — Fri, 24 Oct 2014 18:09:37 +0000

I suppose I’m doing what I’m not suppose to do. One of those things is writing in books.

I’m attending a local digital humanities conference. One of the presenters described and demonstrated a program from MIT called Annotation Studio. Using this program a person can upload some text to a server, annotate the text, and share the annotations with a wider audience. Interesting!?

I then went for a walk to see an art show. It seems I had previously been to this art museum. The art was… art, but I did not find it beautiful. The themes were disturbing.

I then made it to the library where I tried to locate a copy of my one and only formally published book — WAIS And Gopher Servers. When I was here previously, I signed the book’s title page, and I came back to do the same thing. Alas, the book had been moved to remote storage.

I then proceeded to find another book in which I had written something. I was successful, and I signed the title page. Gasp! Considering the fact that no one had opened the book in years, and the pages were glued together I figured, “What the heck!”

Just as importantly, my contribution to the book — written in 1992 — was a short story called, “A day in the life of Mr. D“. It is an account of how computers would be used in the future. In it the young boy uses it to annotate a piece of text, and he gets to see the text of previous annotators. What is old is new again.

P.S. I composed this blog posting using an iPad. Functional but tedious.

Publishing LOD with a bent toward archivists

Eric Lease Morgan — Sat, 16 Aug 2014 14:56:10 +0000

eye candy by Eric

This essay provides an overview of linked open data (LOD) with a bent towards archivists. It enumerates a few advantages the archival community has when it comes to linked data, as well as some distinct disadvantages. It demonstrates one way to expose EAD as linked data through the use of XSLT transformations and then through a rudimentary triple store/SPARQL endpoint combination. Enhancements to the linked data publication process are then discussed. The text of this essay in the form of a handout as well as a number of support files is can also be found at http://infomotions.com/sandbox/lodlamday/.

Review of RDF

The ultimate goal of LOD is to facilitate the discovery of new information and knowledge. To accomplish this goal, people are expected to make metadata describing their content available on the Web in one or more forms of RDF — Resource Description Framework. RDF is not so much a file format as a data structure. It is a collection of “assertions” in the form of “triples” akin to rudimentary “sentences” where the first part of the sentence is a “subject”, the second part is a “predicate”, and the third part is an “object”. Both the subjects and predicates are required to be Universal Resource Identifiers — URIs. (Think “URLs”.) The subject URI is intended to denote a person, place, or thing. The predicate URI is used to specify relationships between subjects and the objects. When verbalizing RDF assertions, it is usually helpful to prefix predicate URIs with a “is a” or “has a” phrase. For example, “This book ‘has a’ title of ‘Huckleberry Finn'” or “This university ‘has a’ home page of URL”. The objects of RDF assertions are ideally more URIs but they can also be “strings” or “literals” — words, phrases, numbers, dates, geo-spacial coordinates, etc. Finally, it is expected that the URIs of RDF assertions are shared across domains and RDF collections. By doing so, new assertions can be literally “linked” across the world of RDF in the hopes of establishing new relationships. By doing so new new information and new knowledge is brought to light.

Simple foray into publishing linked open data

Manifesting RDF from archival materials by hand is not an easy process because nobody is going to manually type the hundreds of triples necessary to adequately describe any given item. Fortunately, it is common for the description of archival materials to be manifested in the form of EAD files. Being a form of XML, valid EAD files must be well-formed and conform to a specific DTD or schema. This makes it easy to use XSLT to transform EAD files into various (“serialized”) forms of RDF such as XML/RDF, turtle, or JSON-LD. A few years ago such a stylesheet was written by Pete Johnston for the Archives Hub as a part of the Hub’s LOCAH project. The stylesheet outputs XML/RDF and it was written specifically for Archives Hub EAD files. It has been slightly modified here and incorporated into a Perl script. The Perl script reads the EAD files in a given directory and transforms them into both XML/RDF and HTML. The XML/RDF is intended to be read by computers. The HTML is intended to be read by people. By simply using something like the Perl script, an archive can easily participate in LOD. The results of these efforts can be seen in the local RDF and HTML directories. Nobody is saying the result is perfect nor complete, but it is more than a head start, and all of this is possible because the content of archives is often times described using EAD.

Triple stores and SPARQL endpoints

By definition, linked data (RDF) is structured data, and structured data lends itself very well to relational database applications. In the realm of linked data, these database applications are called “triple stores”. Database applications excel at the organization of data, but they are also designed to facilitate search. In the realm of relational databases, the standard query language is called SQL, and there is a similar query language for triples stores. It is called SPARQL. The term “SPARQL endpoints” is used denote a URL where SPARQL queries can be applied to a specific triple store.

4store is an open source triple store application which also supports SPARQL endpoints. Once compiled and installed, it is controlled and managed through a set of command-line applications. These applications support the sorts of things one expects with any other database application such as create database, import into database, search database, dump database, and destroy database. Two other commands turn on and turn off SPARQL endpoints.

For the purposes of LODLAM Training Day, a 4store triple store was created, filled with sample data, and made available as a SPARQL endpoint. If it has been turned on, then the following links ought to return useful information and demonstrating additional ways of publishing linked data:

status page – a sort of home pages for a 4store triple store
test page – an HTML form enabling one to apply SPARQL queries to the triple store
SELECT DISTINCT ?o WHERE {?s a ?o} – a list of all the objects which are types of subjects. Useful for learning what types of things are described in any triple store.
SELECT DISTINCT ?p WHERE {?s ?p ?o} – a list of all the predicates used in the triple store. Useful for learning how things are described in any triple store. If the ontologies used in the original RDF are well-documented, then by following the URIs (URLs) and reading the documentation found there, a person can use the previous two standard queries to understand the contents of any triple store. In this way, linked data (RDF) is self-documenting.
SELECT ?title ?description ?url WHERE { ?s ?url . ?s ?title . ?s ?description } order by ?title – a rudimentary bibliography of all finding aids
SELECT DISTINCT ?subject WHERE { ?s ?uri . ?uri ?subject } ORDER BY ?subject – an alphabetical list of all the store’s subject headings

Advantages and disadvantages

The previous sections demonstrate the ease at which archival metadata can be published as linked data. These demonstrations are not the the be-all nor end-all of linked data the publication process. Additional techniques could be employed. Exploiting content negotiation in response to a given URI is an excellent example. Supporting alternative RDF serializations is another example. It behooves the archivist to provide enhanced views of the linked data, which are sometimes called “graphs”. The linked data can be combined with the linked data of other publishers to implement even more interesting services, views, and graphs. All of these things are advanced techniques requiring the skills of additional people (graphic designers, usability experts, computer programmers, systems administrators, allocators of time and money, project managers, etc.). Despite this, given the tools outlined above, it is not too difficult to publish linked data now and today. Such are the advantages.

On the other hand, there are at least two distinct disadvantages. The most significant derives from the inherent nature of archival material. Archival material is almost always rare or unique. Because it is rare and unique, there are few (if any) previously established URIs for the people and things described in archival collections. This is unlike the world of librarianship, where the materials of libraries are often owned my multiple institutions. Union catalogs share authority lists denoting people and institutions. Shared URIs across domains is imperative for the idea of the Semantic Web to come to fruition. The archival community has no such collection of shared URIs. Maybe the community-wide implementation and exploitation of Encoded Archival Context for Corporate Bodies, Persons, and Families (EAC-CPF) can help resolve this problem. After all, it too is a form of XML which lends itself very will to XSLT transformation.

Second, and almost as importantly, the use of EAD is not really the best way manifest archival metadata for linked data publication. EADs are finding aids. They are essentially narrative essays describing collections as a whole. They tell stories. The controlled vocabularies articulated in the header do not necessarily apply to each of the items in the container list. For good reasons, the items in the container list are minimally described. Consequently, the resulting RDF statement come across rather thin and poorly linked to fuller descriptions. Moreover, different archivists put different emphases on different aspect of EAD description. This makes amalgamated collections of archival linked data difficult to navigate; the linked data requires cleaning and normalization. The solution to these problems might be to create and maintain archival collections in database applications, such as ArchivesSpace, and have linked data published from there. By doing so the linked data publication efforts of the archival community would be more standardized and somewhat centralized.

Summary

This essay has outlined the ease at which archival metadata in the form of EAD can be easily published as linked data. The result is far from perfect, but a huge step in the right direction. Publishing linked data is not an event, but rather an iterative process. There is always room for improvement. Starting today, publish your metadata as linked data.

Fun with Koha

Eric Lease Morgan — Sat, 19 Jul 2014 18:16:31 +0000

These are brief notes about my recent experiences with Koha.

Introduction

As you may or may not know, Koha is a grand daddy of library-related open source software, and it is an integrated library system to boot. Such are no small accomplishments. For reasons I will not elaborate upon, I’ve been playing with Koha for the past number of weeks, and in short, I want to say, “I’m impressed.” The community is large, international, congenial, and supportive. The community is divided into a number of sub-groups: developers, committers, commercial support employees, and, of course, librarians. I’ve even seen people from another open source library system (Evergreen) provide technical support and advice. For the most part, everything is on the ‘Net, well laid out, and transparent. There are some rather “organic” parts to the documentation akin to an “English garden”, but that is going to happen in any de-centralized environment. All in all, and without any patronizing intended, “Kudos to Koha!”

Installation

Looking through my collection of tarballs, I see I’ve installed Koha a number of times over the years, but this time it was challenging. Sparing you all the details, I needed to use a specific version of MySQL (version 5.5), and I had version 5.6. The installation failure was not really Koha’s fault. It is more the fault of MySQL because the client of MySQL version 5.6 outputs a warning message to STDOUT when a password is passed on the command line. This message confused the Koha database initialization process, thus making Koha unusable. After downgrading to version 5.5 the database initialization process was seamless.

My next step was to correctly configure Zebra — Koha’s default underlying indexer. Again, I had installed from source, and my Zebra libraries, etc. were saved in a directory different from the configuration files created by the Koha’s installation process. After correctly updating the value of modulePath to point to /usr/local/lib/idzebra-2.0/ in zebra-biblios-dom.cfg, zebra-authorities.cfg, zebra-biblios.cfg, and zebra-authorities-dom.cfg I could successfully index and search for content. I learned this from a mailing list posting.

Koha “extras”

Koha comes (for free) with a number of “extras”. For example, the Zebra indexer can be deployed as both a Z39.50 server as well as an SRU server. Turning these things on was as simple as uncommenting a few lines in the koha-conf.xml file and opening a few ports in my firewall. Z39.50 is inherently unusable from a human point of view so I didn’t go into configuring it, but it does work. Through the use of XSL stylesheets, SRU can be much more usable. Luckily I have been here before. For example, a long time ago I used Zebra to index my Alex Catalogue as well as some content from the HathiTrust (MBooks). The hidden interface to the Catalogue sports faceted searching and used to support spelling corrections. The MBooks interface transforms MARCXML into simple HTML. Both of these interfaces are quite zippy. In order to get Zebra to recognize my XSL I needed to add an additional configuration directive to my koha-conf.xml file. Specifically, I need to add a docpath element to my public server’s configuration. Once I re-learned this fact, implementing a rudimentary SRU interface to my Koha index was easy and results are returned very fast. I’m impressed.

My big goal is to figure out ways Koha can expose its content to the wider ‘Net. To this end sKoha comes with an OAI-PMH interface. It needs to be enabled, and can be done through the Koha Web-based backend under Home -> Koha Administration -> Global Preferences -> General Systems Preferences -> Web Services. Once enabled, OAI sets can be created through the Home -> Administration -> OAI sets configuration module. (Whew!) Once this is done Koha will respond to OAI-PMH requests. I then took it upon myself to transform the OAI output into linked data using a program called OAI2LOD. This worked seamlessly, and for a limited period of time you can browse my Koha’s cataloging data as linked data. The viability of the resulting linked data is questionable, but that is another blog posting.

Ideas and next steps

Library catalogs (OPACs, “discovery systems”, whatever you want to call them) are not simple applications/systems. They are a mixture of very specialized inventory lists, various types of people with various skills and authorities, indexing, and circulation, etc. Then we — as librarians — add things like messages of the day, record exporting, browsable lists, visualizations, etc. that complicate the whole thing. It is simply not possible to create a library catalog in the “Unix way“. The installation of Koha was not easy for me. There are expenses with open source software, and I all but melted down my server during the installation process. (Everything is now back to normal.) I’ve been advocating open source software for quite a while, and I understand the meaning of “free” in this context. I’m not complaining. Really.

Now that I’ve gotten this far, my next step is to investigate the feasibility of using a different indexer with Koha. Zebra is functional. It is fast. It is multi-faceted (all puns intended). But configuring it is not straight-forward, and its community of support is tiny. I see from rooting around in the Koha source code that Solr has been explored. I have also heard through the grapevine that ElasticSearch has been explored. I will endeavor to explore these things myself and report on what I learn. Different indexers, with more flexible API’s may make the possibility of exposing Koha content as linked data more feasible as well.

Wish me luck.

Fun with ElasticSearch and MARC

Eric Lease Morgan — Sun, 22 Jun 2014 15:40:58 +0000

For a good time I have started to investigate how to index MARC data using ElasticSearch. This posting outlines some of my initial investigations and hacks.

ElasticSearch seems to be an increasingly popular indexer. Getting it up an running on my Linux host was… trivial. It comes withe a full-fledged Perl interface. Nice! Since ElasticSearch takes JSON as input, I needed to serialize my MARC data accordingly, and MARC::File::JSON seems to do a fine job. With this in hand, I wrote three programs:

index.pl – create an index of MARC records
get.pl – retrieve a specific record from the index
search.pl – query the index

I have some work to do, obviously. First of all, do I really want to index MARC in its raw, communications format? I don’t think so, but that is where I’ll start. Second, the search script doesn’t really search. Instead it simply gets all the records. This is because I really don’t know how to search yet; I don’t really know how to query fields like “245 subfield a”.

index.pl

#!/usr/bin/perl

# configure
use constant INDEX => 'pamphlets';
use constant MARC  => './pamphlets.marc';
use constant MAX   => 100;
use constant TYPE  => 'marc';

# require
use MARC::Batch;
use MARC::File::JSON;
use Search::Elasticsearch;
use strict;

# initialize
my $batch = MARC::Batch->new( 'USMARC', MARC );
my $count = 0;
my $e     = Search::Elasticsearch->new;

# process each record in the batch
while ( my $record = $batch->next ) {

  # debug
  print $record->title, "\n";
  
  # serialize the record into json
  my $json = &MARC::File::JSON::encode( $record );
  
  # increment
  $count++;
  
  # index; do the work
  $e->index(  index   => INDEX,
                type    => TYPE,
                id      => $count,
                body    => { "$json" }
    );
    
  # check; only do a few
  last if ( $count > MAX );
  
}

# done
exit;

get.pl

# configure 
use constant INDEX => 'pamphlets';
use constant TYPE  => 'marc';

# require
use MARC::File::JSON;
use Search::Elasticsearch;
use strict;

# initialize
my $e = Search::Elasticsearch->new;

# get; do the work
my $doc = $e->get( index   => INDEX,
                   type    => TYPE,
                   id      => $ARGV[ 0 ]
);

# reformat and output; done
my $record = MARC::Record->new_from_json( keys( $doc->{ '_source' } ) );
print $record->as_formatted, "\n";
exit;

search.pl

# configure 
use constant INDEX => 'pamphlets';

# require
use MARC::File::JSON;
use Search::Elasticsearch;
use strict;

# initialize
my $e = Search::Elasticsearch->new;

# search
my $results = $e->search(
  index => INDEX,
    body  => { query => { match_all => { $ARGV[ 0 ] } } }
);

# output
my $hits = $results->{ 'hits' }->{ 'hits' };
for ( my $i = 0; $i <= $#$hits; $i++ ) {

  my $record = MARC::Record->new_from_json( keys( $hits[ $i ]->{ '_source' } ) );
  print $record->as_formatted, "\n\n";

}

# done
exit;

LiAM source code: Perl poetry

Eric Lease Morgan — Mon, 17 Feb 2014 04:40:33 +0000

#!/usr/bin/perl # Liam Guidebook Source Code; Perl poetry, sort of # Eric Lease Morgan # February 16, 2014 # done exit;

#!/usr/bin/perl # marc2rdf.pl – make MARC records accessible via linked data # Eric Lease Morgan # December 5, 2013 – first cut; # configure use constant ROOT => ‘/disk01/www/html/main/sandbox/liam’; use constant MARC => ROOT . ‘/src/marc/’; use constant DATA => ROOT . ‘/data/’; use constant PAGES => ROOT . ‘/pages/’; use constant MARC2HTML => ROOT . ‘/etc/MARC21slim2HTML.xsl’; use constant MARC2MODS => ROOT . ‘/etc/MARC21slim2MODS3.xsl’; use constant MODS2RDF => ROOT . ‘/etc/mods2rdf.xsl’; use constant MAXINDEX => 100; # require use IO::File; use MARC::Batch; use MARC::File::XML; use strict; use XML::LibXML; use XML::LibXSLT; # initialize my $parser = XML::LibXML->new; my $xslt = XML::LibXSLT->new; # process each record in the MARC directory my @files = glob MARC . “*.marc”; for ( 0 .. $#files ) { # re-initialize my $marc = $files[ $_ ]; my $handle = IO::File->new( $marc ); binmode( STDOUT, ‘:utf8’ ); binmode( $handle, ‘:bytes’ ); my $batch = MARC::Batch->new( ‘USMARC’, $handle ); $batch->warnings_off; $batch->strict_off; my $index = 0; # process each record in the batch while ( my $record = $batch->next ) { # get marcxml my $marcxml = $record->as_xml_record; my $_001 = $record->field( ‘001’ )->as_string; $_001 =~ s/_//; $_001 =~ s/ +//; $_001 =~ s/-+//; print ” marc: $marc\n”; print ” identifier: $_001\n”; print ” URI: http://infomotions.com/sandbox/liam/id/$_001\n”; # re-initialize and sanity check my $output = PAGES . “$_001.html”; if ( ! -e $output or -s $output == 0 ) { # transform marcxml into html print ” HTML: $output\n”; my $source = $parser->parse_string( $marcxml ) or warn $!; my $style = $parser->parse_file( MARC2HTML ) or warn $!; my $stylesheet = $xslt->parse_stylesheet( $style ) or warn $!; my $results = $stylesheet->transform( $source ) or warn $!; my $html = $stylesheet->output_string( $results ); &save( $output, $html ); } else { print ” HTML: skipping\n” } # re-initialize and sanity check my $output = DATA . “$_001.rdf”; if ( ! -e $output or -s $output == 0 ) { # transform marcxml into mods my $source = $parser->parse_string( $marcxml ) or warn $!; my $style = $parser->parse_file( MARC2MODS ) or warn $!; my $stylesheet = $xslt->parse_stylesheet( $style ) or warn $!; my $results = $stylesheet->transform( $source ) or warn $!; my $mods = $stylesheet->output_string( $results ); # transform mods into rdf print ” RDF: $output\n”; $source = $parser->parse_string( $mods ) or warn $!; my $style = $parser->parse_file( MODS2RDF ) or warn $!; my $stylesheet = $xslt->parse_stylesheet( $style ) or warn $!; my $results = $stylesheet->transform( $source ) or warn $!; my $rdf = $stylesheet->output_string( $results ); &save( $output, $rdf ); } else { print ” RDF: skipping\n” } # prettify print “\n”; # increment and check $index++; last if ( $index > MAXINDEX ) } } # done exit; sub save { open F, ‘ > ‘ . shift or die $!; binmode( F, ‘:utf8’ ); print F shift; close F; return; }

#!/usr/bin/perl # ead2rdf.pl – make EAD files accessible via linked data # Eric Lease Morgan # December 6, 2013 – based on marc2linkedata.pl # configure use constant ROOT => ‘/disk01/www/html/main/sandbox/liam’; use constant EAD => ROOT . ‘/src/ead/’; use constant DATA => ROOT . ‘/data/’; use constant PAGES => ROOT . ‘/pages/’; use constant EAD2HTML => ROOT . ‘/etc/ead2html.xsl’; use constant EAD2RDF => ROOT . ‘/etc/ead2rdf.xsl’; use constant SAXON => ‘java -jar /disk01/www/html/main/sandbox/liam/bin/saxon.jar -s:##SOURCE## -xsl:##XSL## -o:##OUTPUT##’; # require use strict; use XML::XPath; use XML::LibXML; use XML::LibXSLT; # initialize my $saxon = ”; my $xsl = ”; my $parser = XML::LibXML->new; my $xslt = XML::LibXSLT->new; # process each record in the EAD directory my @files = glob EAD . “*.xml”; for ( 0 .. $#files ) { # re-initialize my $ead = $files[ $_ ]; print ” EAD: $ead\n”; # get the identifier my $xpath = XML::XPath->new( filename => $ead ); my $identifier = $xpath->findvalue( ‘/ead/eadheader/eadid’ ); $identifier =~ s/[^\w ]//g; print ” identifier: $identifier\n”; print ” URI: http://infomotions.com/sandbox/liam/id/$identifier\n”; # re-initialize and sanity check my $output = PAGES . “$identifier.html”; if ( ! -e $output or -s $output == 0 ) { # transform marcxml into html print ” HTML: $output\n”; my $source = $parser->parse_file( $ead ) or warn $!; my $style = $parser->parse_file( EAD2HTML ) or warn $!; my $stylesheet = $xslt->parse_stylesheet( $style ) or warn $!; my $results = $stylesheet->transform( $source ) or warn $!; my $html = $stylesheet->output_string( $results ); &save( $output, $html ); } else { print ” HTML: skipping\n” } # re-initialize and sanity check my $output = DATA . “$identifier.rdf”; if ( ! -e $output or -s $output == 0 ) { # create saxon command, and save rdf print ” RDF: $output\n”; $saxon = SAXON; $xsl = EAD2RDF; $saxon =~ s/##SOURCE##/$ead/e; $saxon =~ s/##XSL##/$xsl/e; $saxon =~ s/##OUTPUT##/$output/e; system $saxon; } else { print ” RDF: skipping\n” } # prettify print “\n”; } # done exit; sub save { open F, ‘ > ‘ . shift or die $!; binmode( F, ‘:utf8’ ); print F shift; close F; return; }

#!/usr/bin/perl # store-make.pl – simply initialize an RDF triple store # Eric Lease Morgan # # December 14, 2013 – after wrestling with wilson for most of the day # configure use constant ETC => ‘/disk01/www/html/main/sandbox/liam/etc/’; # require use strict; use RDF::Redland; # sanity check my $db = $ARGV[ 0 ]; if ( ! $db ) { print “Usage: $0 \n”; exit; } # do the work; brain-dead my $etc = ETC; my $store = RDF::Redland::Storage->new( ‘hashes’, $db, “new=’yes’, hash-type=’bdb’, dir=’$etc'” ); die “Unable to create store ($!)” unless $store; my $model = RDF::Redland::Model->new( $store, ” ); die “Unable to create model ($!)” unless $model; # “save” $store = undef; $model = undef; # done exit;

#!/user/bin/perl # store-add.pl – add items to an RDF triple store # Eric Lease Morgan # # December 14, 2013 – after wrestling with wilson for most of the day # configure use constant ETC => ‘/disk01/www/html/main/sandbox/liam/etc/’; # require use strict; use RDF::Redland; # sanity check #1 – command line arguments my $db = $ARGV[ 0 ]; my $file = $ARGV[ 1 ]; if ( ! $db or ! $file ) { print “Usage: $0 \n”; exit; } # sanity check #2 – store exists die “Error: po2s file not found. Make a store?\n” if ( ! -e ETC . $db . ‘-po2s.db’ ); die “Error: so2p file not found. Make a store?\n” if ( ! -e ETC . $db . ‘-so2p.db’ ); die “Error: sp2o file not found. Make a store?\n” if ( ! -e ETC . $db . ‘-sp2o.db’ ); # open the store my $etc = ETC; my $store = RDF::Redland::Storage->new( ‘hashes’, $db, “new=’no’, hash-type=’bdb’, dir=’$etc'” ); die “Error: Unable to open store ($!)” unless $store; my $model = RDF::Redland::Model->new( $store, ” ); die “Error: Unable to create model ($!)” unless $model; # sanity check #3 – file exists die “Error: $file not found.\n” if ( ! -e $file ); # parse a file and add it to the store my $uri = RDF::Redland::URI->new( “file:$file” ); my $parser = RDF::Redland::Parser->new( ‘rdfxml’, ‘application/rdf+xml’ ); die “Error: Failed to find parser ($!)\n” if ( ! $parser ); my $stream = $parser->parse_as_stream( $uri, $uri ); my $count = 0; while ( ! $stream->end ) { $model->add_statement( $stream->current ); $count++; $stream->next; } # echo the result warn “Namespaces:\n”; my %namespaces = $parser->namespaces_seen; while ( my ( $prefix, $uri ) = each %namespaces ) { warn ” prefix: $prefix\n”; warn ‘ uri: ‘ . $uri->as_string . “\n”; warn “\n”; } warn “Added $count statements\n”; # “save” $store = undef; $model = undef; # done exit; 10.5 store-search.pl – query a triple store # Eric Lease Morgan # December 14, 2013 – after wrestling with wilson for most of the day # configure use constant ETC => ‘/disk01/www/html/main/sandbox/liam/etc/’; my %namespaces = ( “crm” => “http://erlangen-crm.org/current/”, “dc” => “http://purl.org/dc/elements/1.1/”, “dcterms” => “http://purl.org/dc/terms/”, “event” => “http://purl.org/NET/c4dm/event.owl#”, “foaf” => “http://xmlns.com/foaf/0.1/”, “lode” => “http://linkedevents.org/ontology/”, “lvont” => “http://lexvo.org/ontology#”, “modsrdf” => “http://simile.mit.edu/2006/01/ontologies/mods3#”, “ore” => “http://www.openarchives.org/ore/terms/”, “owl” => “http://www.w3.org/2002/07/owl#”, “rdf” => “http://www.w3.org/1999/02/22-rdf-syntax-ns#”, “rdfs” => “http://www.w3.org/2000/01/rdf-schema#”, “role” => “http://simile.mit.edu/2006/01/roles#”, “skos” => “http://www.w3.org/2004/02/skos/core#”, “time” => “http://www.w3.org/2006/time#”, “timeline” => “http://purl.org/NET/c4dm/timeline.owl#”, “wgs84_pos” => “http://www.w3.org/2003/01/geo/wgs84_pos#” ); # require use strict; use RDF::Redland; # sanity check #1 – command line arguments my $db = $ARGV[ 0 ]; my $query = $ARGV[ 1 ]; if ( ! $db or ! $query ) { print “Usage: $0 \n”; exit; } # sanity check #2 – store exists die “Error: po2s file not found. Make a store?\n” if ( ! -e ETC . $db . ‘-po2s.db’ ); die “Error: so2p file not found. Make a store?\n” if ( ! -e ETC . $db . ‘-so2p.db’ ); die “Error: sp2o file not found. Make a store?\n” if ( ! -e ETC . $db . ‘-sp2o.db’ ); # open the store my $etc = ETC; my $store = RDF::Redland::Storage->new( ‘hashes’, $db, “new=’no’, hash-type=’bdb’, dir=’$etc'” ); die “Error: Unable to open store ($!)” unless $store; my $model = RDF::Redland::Model->new( $store, ” ); die “Error: Unable to create model ($!)” unless $model; # search #my $sparql = RDF::Redland::Query->new( “CONSTRUCT { ?a ?b ?c } WHERE { ?a ?b ?c }”, undef, undef, “sparql” ); my $sparql = RDF::Redland::Query->new( “PREFIX modsrdf: \nSELECT ?a ?b ?c WHERE { ?a modsrdf:$query ?c }”, undef, undef, ‘sparql’ ); my $results = $model->query_execute( $sparql ); print $results->to_string; # done exit;

#!/usr/bin/perl # store-dump.pl – output the content of store as RDF/XML # Eric Lease Morgan # # December 14, 2013 – after wrestling with wilson for most of the day # configure use constant ETC => ‘/disk01/www/html/main/sandbox/liam/etc/’; # require use strict; use RDF::Redland; # sanity check #1 – command line arguments my $db = $ARGV[ 0 ]; my $uri = $ARGV[ 1 ]; if ( ! $db ) { print “Usage: $0 \n”; exit; } # sanity check #2 – store exists die “Error: po2s file not found. Make a store?\n” if ( ! -e ETC . $db . ‘-po2s.db’ ); die “Error: so2p file not found. Make a store?\n” if ( ! -e ETC . $db . ‘-so2p.db’ ); die “Error: sp2o file not found. Make a store?\n” if ( ! -e ETC . $db . ‘-sp2o.db’ ); # open the store my $etc = ETC; my $store = RDF::Redland::Storage->new( ‘hashes’, $db, “new=’no’, hash-type=’bdb’, dir=’$etc'” ); die “Error: Unable to open store ($!)” unless $store; my $model = RDF::Redland::Model->new( $store, ” ); die “Error: Unable to create model ($!)” unless $model; # do the work my $serializer = RDF::Redland::Serializer->new; print $serializer->serialize_model_to_string( RDF::Redland::URI->new, $model ); # done exit;

#!/usr/bin/perl # sparql.pl – a brain-dead, half-baked SPARQL endpoint # Eric Lease Morgan # December 15, 2013 – first investigations # require use CGI; use CGI::Carp qw( fatalsToBrowser ); use RDF::Redland; use strict; # initialize my $cgi = CGI->new; my $query = $cgi->param( ‘query’ ); if ( ! $query ) { print $cgi->header; print &home; } else { # open the store for business my $store = RDF::Redland::Storage->new( ‘hashes’, ‘store’, “new=’no’, hash-type=’bdb’, dir=’/disk01/www/html/main/sandbox/liam/etc'” ); my $model = RDF::Redland::Model->new( $store, ” ); # search my $results = $model->query_execute( RDF::Redland::Query->new( $query, undef, undef, ‘sparql’ ) ); # return the results print $cgi->header( -type => ‘application/xml’ ); print $results->to_string; } # done exit; sub home { # create a list namespaces my $namespaces = &namespaces; my $list = ”; foreach my $prefix ( sort keys $namespaces ) { my $uri = $$namespaces{ $prefix }; $list .= $cgi->li( “$prefix – ” . $cgi->a( { href=> $uri, target => ‘_blank’ }, $uri ) ); } $list = $cgi->ol( $list ); # return a home page return < LiAM SPARQL Endpoint

LiAM SPARQL Endpoint

This is a brain-dead and half-baked SPARQL endpoint to a subset of LiAM linked data. Enter a query, but there is the disclaimer. Errors will probably happen because of SPARQL syntax errors. Remember, the interface is brain-dead. Your milage will vary.

Here are a few sample queries:

Find all triples with RDF Schema labels – PREFIX rdf: SELECT * WHERE { ?s rdf:label ?o }
Find all items with MODS subjects – PREFIX mods: SELECT * WHERE { ?s mods:subject ?o }
Find every unique predicate – SELECT DISTINCT ?p WHERE { ?s ?p ?o }
Find everything – SELECT * WHERE { ?s ?p ?o }
Find all classes – SELECT DISTINCT ?class WHERE { [] a ?class } ORDER BY ?class
Find all properties – SELECT DISTINCT ?property WHERE { [] ?property [] } ORDER BY ?property
Find URIs of all finding aids – PREFIX hub: SELECT ?uri WHERE { ?uri ?o hub:FindingAid }
Find URIs of all MARC records – PREFIX mods: SELECT ?uri WHERE { ?uri ?o mods:Record }
Find all URIs of all collections – PREFIX mods: PREFIX hub: SELECT ?uri WHERE { { ?uri ?o hub:FindingAid } UNION { ?uri ?o mods:Record } } ORDER BY ?uri

This is a list of ontologies (namespaces) used in the triple store as predicates:

$list

For more information about SPARQL, see:

SPARQL Query Language for RDF from the W3C
SPARQL from Wikipedia

Source code — sparql.pl — is available online.

Eric Lease Morgan
January 6, 2014

EOF } sub namespaces { my %namespaces = ( “crm” => “http://erlangen-crm.org/current/”, “dc” => “http://purl.org/dc/elements/1.1/”, “dcterms” => “http://purl.org/dc/terms/”, “event” => “http://purl.org/NET/c4dm/event.owl#”, “foaf” => “http://xmlns.com/foaf/0.1/”, “lode” => “http://linkedevents.org/ontology/”, “lvont” => “http://lexvo.org/ontology#”, “modsrdf” => “http://simile.mit.edu/2006/01/ontologies/mods3#”, “ore” => “http://www.openarchives.org/ore/terms/”, “owl” => “http://www.w3.org/2002/07/owl#”, “rdf” => “http://www.w3.org/1999/02/22-rdf-syntax-ns#”, “rdfs” => “http://www.w3.org/2000/01/rdf-schema#”, “role” => “http://simile.mit.edu/2006/01/roles#”, “skos” => “http://www.w3.org/2004/02/skos/core#”, “time” => “http://www.w3.org/2006/time#”, “timeline” => “http://purl.org/NET/c4dm/timeline.owl#”, “wgs84_pos” => “http://www.w3.org/2003/01/geo/wgs84_pos#” ); return \%namespaces; }

# package Apache2::LiAM::Dereference; # Dereference.pm – Redirect user-agents based on value of URI. # Eric Lease Morgan # December 7, 2013 – first investigations; based on Apache2::Alex::Dereference # configure use constant PAGES => ‘http://infomotions.com/sandbox/liam/pages/’; use constant DATA => ‘http://infomotions.com/sandbox/liam/data/’; # require use Apache2::Const -compile => qw( OK ); use CGI; use strict; # main sub handler { # initialize my $r = shift; my $cgi = CGI->new; my $id = substr( $r->uri, length $r->location ); # wants RDF if ( $cgi->Accept( ‘text/html’ )) { print $cgi->header( -status => ‘303 See Other’, -Location => PAGES . $id . ‘.html’, -Vary => ‘Accept’ ) } # give them RDF else { print $cgi->header( -status => ‘303 See Other’, -Location => DATA . $id . ‘.rdf’, -Vary => ‘Accept’, “Content-Type” => ‘application/rdf+xml’ ) } # done return Apache2::Const::OK; } 1; # return true or die

LiAM SPARQL Endpoint

Eric Lease Morgan — Sun, 15 Dec 2013 16:30:11 +0000

I have implemented a brain-dead and half-baked SPARQL endpoint to a subset of LiAM linked data, but there is the disclaimer. Errors will probably happen because of SPARQL syntax errors. Your milage will vary.

Here are a few sample queries:

Find all triples with RDF Schema labels – PREFIX rdf: SELECT * WHERE { ?s rdf:label ?o }
Find all items with MODS subjects – PREFIX mods: SELECT * WHERE { ?s mods:subject ?o }
Find every predicate of every triple – SELECT ?p WHERE { ?s ?p ?o }
Find everything – SELECT * WHERE { ?s ?p ?o }

Source code — sparql.pl — is online.

EAD2RDF

Eric Lease Morgan — Mon, 11 Nov 2013 01:30:33 +0000

I have played with an XSL stylesheet called EAD2RDF with good success.

Archivists use EAD as their “MARC” records. EAD has its strengths and weakness, just like any metadata standard, but EAD is a flavor of XML. As such it lends itself to XSLT processing. EAD2RDF is a stylesheet written by Pete Johnston. After running it through an XSLT 2.0 processor, it outputs an RDF/XML file. (I have made a resulting RDF/XML file available for you to peruse.) The result validates against the W3C RDF Validator but won’t have a graph created, probably because there are so many triples in the result.

I think archivists as well as computer technologists working in archives ought to take a closer look at EAD2RDF.

OAI2LOD Server

Eric Lease Morgan — Sun, 10 Nov 2013 17:39:39 +0000

At first glance, a software package called OAI2LOD Server seems to work pretty well, and on a temporary basis, I have made one of my OAI repositories available as Linked Data — http://infomotions.com:2020/

OAI2LOD Server is a software package, written by Bernhard Haslhofer in 2008. Building, configuring, and running the server was all but painless. I think this has a great deal of potential, and I wonder why it has not been more widely exploited. For more information about the server, see “The OAI2LOD Server: Exposing OAI-PMH Metadata as Linked Data“

TriLUG, open source software, and satisfaction

Eric Lease Morgan — Fri, 09 Dec 2011 15:46:47 +0000

This is posting about TriLUG, open source software, and satisfaction for doing a job well-done.

A long time ago, in a galaxy far far away, I lived in Raleigh (North Carolina), and a fledgling community was growing called the Triangle Linux User’s Group (TriLUG). I participated in a few of their meetings. While I was interested in open source software, I was not so interested in Linux. My interests were more along the lines of the application stack, not necessarily systems administration nor Internet networking.

I gave a presentation to the User’s Group on the combined use of PHP and MySQL — “Smart HTML pages with PHP“. Because of this I was recruited to write a Web-based membership application. Since flattery will get you everywhere with me, I was happy to do it. After a couple of weeks, the application was put into place and seemed to function correctly. That was a bit more than ten years ago, probably during the Spring of 2001.

The other day I got an automated email message from the User’s Group. The author of the message wanted to know if I wanted to continue my membership? I replied how that was not necessary since I had long since moved away to northern Indiana.

I then got to wondering whether or not the message I received had been sent by my application. It was a long shot, but I enquired anyway. Sure enough, I got a response from Jeff Schornick, a TriLUG board member, who told me “Yes, your application was the tool that had been used.” How satisfying! How wonderful to know that something I wrote more than ten years ago was still working.

Just as importantly, Jeff wanted to know about open source licensing. I had not explicitly licensed the software, something that I only learned was necessary from Dan Chudnov later. After a bit of back and forth, the original source code was supplemented with the GNU Public License, packaged up, and distributed from a Git repository. Over the years the User’s Group had modified it to overcome a few usability issues, and they wanted to distribute the source code using the most legitimate means possible.

This experience was extremely enriching. I originally offered my skills, and they returned benefits to the community greater than the expense of my time. The community then came back to me because they wanted to express their appreciation and give credit where credit was due.

Open source software not necessarily about computer technology. It is just as much, if not more, about people and the communities they form.

Use & understand: A DPLA beta-sprint proposal

Eric Lease Morgan — Thu, 01 Sep 2011 14:28:26 +0000

This essay describes, illustrates, and demonstrates how the Digital Public Library of America (DPLA) can build on the good work of others who support the creation and maintenance of collections and provide value-added services against texts — a concept we call “use & understand”.

This document is available in a three of formats: 1) HTML – for viewing on a desktop Web browser, 2) PDF – for printing, the suggested format, and 3) ePub – for reading on your portable device.

Eric Lease Morgan <emorgan@nd.edu>
University of Notre Dame

September 1, 2011

Executive summary
Introduction and assumptions
Find & get
Use & understand
Examples

Measure size
Measure difficulty
Side bar on quantitative bibliographic data
Measure concept
Plot on a timeline
Count word and phrase frequencies
Display in context
Display the proximity of a given word to other words
Display location of word in a text
Elaborate upon and visualize parts-of-speech analysis

Disclaimer
Software
Implementation how-to’s

Measurement services
Timeline services
Frequency, concordance, proximity, and locations in a text services
Parts-of-speech services
Priorities

Quick links

Word frequencies, concordances
Word/phrase locations
Proximity displays
Plato, Aristotle, and Shakespeare
Catholic Portal
Measuring size
Plot on a timeline
Lookup in Wikipedia and plot on a map
Parts-of-speech analysis
Measuring ideas

Summary
About the author

Executive summary

This Digital Public Library of America (DPLA) beta-sprint proposal “stands on the shoulders of giants” who have successfully implemented the processes of find & get — the traditional functions of libraries. We are sure the DPLA will implement the services of find & get very well. To supplement, enhance, and distinguish the DPLA from other digital libraries, we propose the implementation of “services against text” in an effort to support use & understand.

Globally networked computers combined with an abundance of full text, born-digital materials has made the search engines of Google, Yahoo, and Microsoft a reality. Advances in information retrieval have made relevancy ranking the norm as opposed to the exception. All of these things have made the problems of find & get less acute than they used to be. The problems of find & get will never be completely resolved, but they seem adequately addressed for the majority of people. Enter a few words into a search box. Click go. And select items of interest.

Use & understand is an evolutionary step in the processes and functions of a library. These processes and functions enable the reader to ask and answer questions of large and small sets of documents relatively easily. Through the use of various text mining techniques, the reader can grasp quickly the content of documents, extract some of their meaning, and evaluate them more thoroughly when compared to the traditional application of metadata. Some of these processes and functions include: word/phrase frequency lists, concordances, histograms illustrating the location of words/phrases in a text, network diagrams illustrating what author say “in the same breath” when they mention a given word, plotting publication dates on a timeline, measuring the weight of a concept in a text, evaluating texts based on parts-of-speech, supplementing texts with Wikipedia articles, and plotting place names on a world maps.

We do not advocate the use of these services as replacements for “close” reading. Instead we advocate them as tools to supplement learning, teaching, and scholarship – functions of any library.

Use & understand: A video introduction

Introduction and assumptions

Libraries are almost always a part of a larger organization, and their main functions can be divided into collection building, conservation & preservation, organization & classification, and public service. These functions are very much analogous to the elements of the DPLA articulated by John Palfrey: community, content, metadata, code, and tools & services.

This beta-Sprint proposal is mostly about tools & services, but in order to provide the proposed tools & services, we make some assumptions about and build upon the good work of people working on community, content, metadata, and code. These assumptions follow.

First, the community the DPLA encompasses is just about everybody in the United States. It is not only about the K-12 population. It is not only about students, teachers, and scholars in academia. It is not only about life-long learners, the businessperson, or municipal employees. It is about all of these communities at once and at the same time because we believe all of these communities have more things in common than they have differences. The tools & services described in this proposal can be useful to anybody who is able to read.

Second, the content of the DPLA is not licensed, much of it is accessible in full-text, and freely available for downloading and manipulation. More specifically, this proposal assumes the collections of the DPLA include things like but not necessarily limited to: digitized versions of public domain works, the full-text of open access scholarly journals and/or trade magazines, scholarly and governmental data sets, theses & dissertations, a substantial portion of the existing United States government documents, the archives of selected mailing lists, and maybe even the archives of blog postings and Twitter feeds. Moreover, we assume the DPLA is not merely a metadata repository, but also makes immediately available plain text versions of much of its collection.

Third, this proposal does not assume very many things regarding metadata beyond the need for the most basic of bibliographic information such as unique identifiers, titles, authors, subject/keyword terms, and location codes such as URLs. It does not matter to this proposal how the bibliographic metadata is encoded (MARC, XML, linked data, etc.). On the other hand, this proposal will advocate for additional bibliographic metadata, specifically, metadata that is quantitative in nature. These additions are not necessary for the fulfillment of the proposal, but rather side benefits because of it.

Finally, this proposal assumes the code & infrastructure of the DPLA supports the traditional characteristics of a library. In other words, it is assumed the code & infrastructure of the DPLA provide the means for the creation of collections and the discovery of said items. As described later, this proposal is not centered on the processes of find & get. Instead this proposal assumes the services of find & get are already well-established. This proposal is designed to build on the good work of others who have already spent time and effort in this area. We hope to “stand on the shoulders of giants” in this regard.

Given these assumptions about community, content, metadata, and infrastructure, we will now describe how the DPLA can exploit the current technological environment to provide increasingly useful services to its clientele. Through the process we hope to demonstrate how libraries could evolve and continue to play a meaningful role in our society.

Find & get

While it comes across as trite, with the advent of ubiquitous and globally networked computers, the characteristics of data and information have fundamentally changed. More specifically, since things like books and journals — the traditional meat and potatoes of libraries — no longer need to be manifested in analog forms, their digital manifestations lend themselves to new functionality. For example, digital versions of books and journals can be duplicated exactly, and they are much less limited to distinct locations in space and time. Similarly, advances in information retrieval have made strict Boolean logic applied to against relational databases less desirable to the reader than relevancy ranking algorithms and the application of term frequency/inverse document frequency models against indexes. Combined together these things have made the search engines of Google, Yahoo, and Microsoft a reality. Compared to twenty years ago, this has made the problem of find & get much less acute.

While the problem of find & get will never completely be resolved, many readers (not necessarily librarians) feel the problem is addressed simply enough. Enter a few words into a search box, click Go, and select items of interest. We don’t know about you, but we can find plenty of data & information. The problem now is what to do with it once it is identified.

We are sure any implementation of the DPLA will include superb functionality for find & get. In fact, our proposal assumes such functionality will exist. Some infrastructure will be created allowing for the identification of relevant content. At the very least this content will be described using metadata and/or the full-text will be mirrored locally. This metadata and/or full-text will be indexed and a search interface applied against it. Search results will probably be returned in any number of ordered lists: relevancy, date, author, title, etc. The interface may very well support functionality based on facets. The results of these searches will never be perfect, but in the eyes of most readers, the results will probably be good enough. This being the case, our proposal is intended to build on this good work and enable the reader to do things with content they identify. Thus we propose to build on the process of find & get to support a process we call use & understand.

Use & understand

The problem of find & get is always a means to an end, and very rarely the end itself. People want to do things with the content they find. We call these things “services against texts”, and they are denoted by action verbs including but not limited to:

* analyze * annotate * cite * compare & contrast * confirm * count & tabulate words, phrases, and ideas * delete * discuss * evaluate * find opposite * find similar * graph & visualize * learn from * plot on a map * plot on a timeline * purchase * rate * read * review * save * share * summarize * tag * trace idea * transform

We ask ourselves, “What services can be provisioned to make the sense of all the content one finds on the Internet or in a library? How can the content of a digital work be ‘read’ in such a way that key facts and concepts become readily apparent? And can this process be applied to an entire corpus and/or a reader’s personal search results?” Thus, we see the problem of find & get evolving into the problem of use & understand.

In our opinion, the answers to these questions lie in the combination of traditional library principles with the application of computer science. Because libraries are expected to know the particular information needs of their constituents, libraries are uniquely positioned to address the problem of use & understand. What do people do with the data and information they find & get from libraries, or for that matter, any other place? In high school and college settings, students are expected to read literature and evaluate it. They are expected to compare & contrast it with similar pieces of literature, extract themes, and observe how authors use language. In a more academic setting scholars and researchers are expected to absorb massive amounts of non-fiction in order to keep abreast of developments in their fields. Each disciplinary corpus is whittled down by peer-review. It is reduced through specialization. Now-a-days the corpus is reduced even further through the recommendation processes of social networking. The resulting volume of content is still considered overwhelming by many. Use & understand is a next step in the information flow. It comes after find & get, and it is a process enabling the reader to better ask and answer questions of an entire collection, subcollection, or individual work. By applying digital humanities computing process, specifically text mining and natural language processing, the process of use & understand can be supported by the DPLA. The examples in the following sections demonstrate and illustrate how this can be done.

Again, libraries are almost always a part of a larger organization, and there is an expectation libraries serve their constituents. Libraries do this in any number ways, one of which is attempting to understanding the “information needs” of the broader organization to provide both just-in-time as well as just-in-case collections and services. We are living, working, and learning in an environment of information abundance, not scarsity. Our production economy has all but migrated to a service economy. One of the fuels of service economies is data and information. As non-profit organizations, libraries are unable to compete when it comes to data provision. Consequently libraries may need to refocus and evolve. By combining its knowledge of the reader with the content of collections, libraries can fill a growing need. Because libraries are expected to understand the partiular needs of their particular clientele, libraries are uniquely positioned to fill this niche. Not Google. Not Yahoo. Not Microsoft.

Examples

Measure size

One of the simplest and most rudimentary services against texts the DPLA could provide in order to promote use & understand is to measure the size of documents in terms of word counts in addition to page counts.

Knowing the size of a document is important to the reader because it helps them determine the time necessary to consume the document’s content as well as implies the document’s depth of elaboration. In general, shorter books require less time to read, and longer books go into greater detail. But denoting the sizes of books in terms of page counts is too ambiguous to denote length. For any given book, a large print addition will contain more pages than the same book in paperback form, which will be different again from its first edition hard cover manifestation.

Not only can much of the ambiguity of document lengths be eliminated if they were denoted with word counts, but if bibliographic descriptions were augmented with word counts then meaningful comparisons between texts could easily be brought to light.

Suppose the DPLA has a collection of one million full-text items. Suppose the number of words in each item were counted and saved in bibliographic records. Thus, search results could then be sorted by length. Once bibliographic records were supplemented with word counts it would be possible to calculate the average length of a book in the collection. Similarly, the range of lengths could be associated with a relative scale such as: tiny books, short books, average length books, long books, and tome-like books. Bibliographic displays could then be augmented with gauge-like graphics to illustrate lengths.

Such was done against the Alex Catalogue of Electronic Texts. There are (only) 14,000 full-text documents in the collection, but after counting all the words in all the documents it was determined that the average length of a document is about 150,000 words. A search was then done against the Catalogue for Charles Dickens’s A Christmas Carol, Oliver Twist and David Copperfield, and the lengths of the resulting documents were compared using gauge-like graphics, as illustrated below:

A Christmas Carol

Oliver Twist

David Copperfield

At least a couple of conclusions can be quickly drawn from this comparison. A Christmas Carol is much shorter than David Copperfield, and Oliver Twist is an average length document.

There will certainly be difficulties counting the number of words in documents. Things will need to be considered in order to increase accuracy, things like: whether or not the document in question has been processed with optical character recognition, whether or not things like chapter headers are included, whether or not back-of-the-book indexes are included, whether nor not introductory materials are included. All of this also assumes a parsing program can be written which accurately extracts “words” from a document. The later is, in fact, fodder for an entire computer science project.

Despite these inherent difficulties, denoting the number of words in a document and placing the result in bibliographic records can help foster use & understand. We believe counting the number of words in a document will result in a greater number of benefits when compared to costs.

Measure difficulty

Measuring the inherent difficulty — readability score — of texts enables the reader to make judgements about those texts, and in turn, fosters use & understand. By including such measurements in the bibliographic records and search results, the DPLA will demonstrate ways it can “save the time of the reader”.

In the last century J. Peter Kincaid, Rudolf Flesch, and Robert Gunning worked both independently as well as collaboratively to create models of readability. Based on a set of factors (such as but not limited to: lengths of documents measured in words, the number of paragraphs in documents, the number of sentences in paragraphs, the number of words in sentences, the complexity of words, etc.) numeric values were calculated to determined the reading levels of documents. Using these models things like Dr. Seuss books are consistently determined to be easy to read while things like insurance policies are difficult. Given the full-text of a document in plain text form, it is almost trivial to compute any number of readability scores. The resulting values could be saved in bibliographic records, and these values could be communicated to the reader with the use of gauge-like graphics.

In a rudimentary way, the Alex Catalogue of Electronic texts has implemented this idea. For each item in the Catalogue the Fog, Flesch, and Kincaid readability scores have been calculated and saved to the underlying MyLibrary database. Searches were done against the Catalogue for Charles Dickens’s David Copperfield, Henry David Thoreau’s Walden, and Immanual Kant’s Fundamental Principles Of The Metaphysics Of Morals. The following graphics illustrate the readability scores of each. We believe the results are not surprising, but they are illustrative of this technique’s utility:

David Cooperfield

Walden

Metaphysics of Morals

If readability scores were integrated into bibliographic search engines (“catalogs”), then it would be possible to limit search results by reading level or even sort search results by them. Imagine being able to search a library catalog for all items dealing with Neo-Platonism, asking for shorter items as opposed to longer items, and limiting things further by readability score.

Readability scores are not intended to be absolute. Instead they are intended to be used as guidelines. If the reader is a novice when it comes to particular topic, and the reader is of high school age, that does not mean they are unable to read college level material. Instead, the readability scores would be used to set the expectations of the reader and help them make judgements before they begin reading a book.

Side bar on quantitative bibliographic data

Bibliographic systems are notoriously qualitative in nature making the process of compare & contrast between bibliographic items very subjective. If there were more quantitative data associated with bibliographic records, then mathematical processes could be applied against collections as a whole, subsets of the collection, or even individual items.

Library catalogs are essentially inventory lists denoting what a library owns (or licenses). For the most part, catalogs are used to describe the physical nature of a library collection: authors, titles, publication dates, pagination and size, notes (such as “Includes index.”), and subject terms. Through things like controlled vocabularies and authority lists, the nature of a collection can be posited, and some interesting questions can be answered. Examples include: what is the average age of the items in the collection, what are the collection’s major subject areas, who are the predominate authors of the works in the collection. These are questions whose answers are manifested now-a-days through faceted browse interfaces, but they are questions of the collection as a whole or subsets of the collection, not individual works. They are questions librarians find interesting, not necessarily readers who want to evaluate the significance of a given work.

If the bibliographic systems were to contain quantitative data, then the bibliographic information systems would be more meaningful and more useful. Dates are a very good example. The dates (years) in a library catalog denote when the item in hand (a book) was published, not when the idea in the book was manifested. Consequently, if Plato’s Dialogs were published today, then its library catalog record would have a value of 2011. While such a thing is certainly true, it is misleading. Plato did not write the Dialogs this year. They were written more than 2,500 years ago. Given our current environment, why can’t a library catalog include this sort of information?

Suppose the reader wanted to read all the works of Henry David Thoreau. Suppose the library catalog had accurately denoted the all the items in its collection by this author with the authority term, “Thoreau, Henry David”. Suppose the reader did an author search for “Thoreau, Henry David” and a list of twenty-five items was returned. Finally, suppose the reader wanted to begin by reading Thoreau’s oldest work first and progress to his latest. Using a library catalog, such a thing would not be possible because the dates in bibliographic records denote the date of publication, not the date of first conception or manifestation.

Suppose the reader wanted to plot on a timeline when Thoreau’s works were published, and the reader wanted to compare this with the complete works of Longfellow or Walt Whitman. Again, such a thing would not be possible because the dates in a library catalog denote publication dates, not when ideas were originally manifested. Why shouldn’t a library catalog enable the reader to easily create timelines?

To make things even more complicated, publication dates are regularly denoted as strings, not integers. Examples include: [1701], 186?, 19–, etc. These types of values are ambiguous. Their meaning and interpretation is bound to irregularly implemented “syntactical sugar”. Consequently, without all but heroic efforts, it is not easy to do any sort of compare & contrast evaluation when it comes to dates.

The DPLA has the incredible opportunity to make a fresh start when it comes to the definition of library catalogs. We know the DPLA will not want to reinvent the wheel. At the same time we believe the DPLA will want to exploit the current milieu, re-evaluate the possibilities of computer technology, and consequently refine and evolve the meaning of “catalog”. Traditional library catalogs were born in an era of relative information scarcity. Today we are dealing with problems of abundance. Library catalogs need to do many things differently in order to satisfy the needs/desires of the current reader. “Next-generation library catalogs” can do so much more than provide access to local collections. Facilitating ways to evaluate collections, sub-collections, or individual items through the use of quantitative analysis is just one example.

Measure concept

By turning a relevancy ranking algorithm on its head, it is be possible to measure the existence of concepts of a given work. If this were done for many works, then new comparisons between works would be possible, and again, making it possible for the reader to easily compare & contrast items in a corpus or search results. Of all the services against texts examples in this proposal, we know this one is the most avant-garde.

Term frequency/inverse document frequency (TFIDF) is a model at the heart of many relevancy ranking algorithms. Mathematically stated, TFIDF equals:

( c / t ) * log( d / f )

where:

c = number of times the query terms appear in a document
t = total number of words in a document
d = total number of documents in a corpus
f = total number of documents containing the query terms

In other words, TFIDF calculates relevancy (“aboutness”) by multiplying the ratio of query words and document sizes to the ratio of number of documents in a corpus and total frequency of query terms. Thus, if there are three documents each containing the word “music” three times, but one of them is 100 words long and the other two are 200 words long, then the first document is considered more relevant than the other two.

Written language — which is at the very heart of library content — is ambiguous, nuanced, and dynamic. Few, if any, concepts can be completely denoted by a single word or phrase. Instead, a single concept may be better described using a set of words or phrases. For example, music might be denoted thusly:

art, Bach, Baroque, beat, beauty, blues, composition, concert, dance, expression, guitar, harmony, instrumentation, key, keyboard, melody, Mozart, music, opera, percussion, performance, pitch, recording, rhythm, scale, score, song, sound, time, violin

If any document used some or all of these words with any degree of frequency, then it would probably be safe to say the document was about music. This “aboutness” could then be calculated by summing the TFIDF scores of all the music terms in a given document — a thing called the “document overlap measure”. Thus, one document might have a total music “aboutness” measure of 105 whereas another document might have a measure of 55.

We used a process very similar to the one outlined above in an effort to measure the “greatness” of the set of books called The Great Books Of The Western World. Each book in the set was evaluated in terms of it use of the 102 “great ideas” enumerated in the set’s introduction. We summed the computed TFIDF values of each great idea in each book, a value we call the Great Ideas Coefficient. Through this process we determined the “greatest” book in the set was Aristotleʼs Politics because it alluded to the totality of “great ideas” more than the others. Furthermore, we determined that Shakespeare wrote seven of the top ten books when it comes to the idea of love. The following figure illustrates the result of these comparisons. The bars above the line represent books greater than the hypothetical average great book, and the bars below the line are less great than the others.

Measuring the “greatness” of The Great Books of the Western World

The DPLA could implement very similar services against texts in one and/or two ways. First, it could denote any number of themes (like music or “great ideas”) and calculate coefficients denoting the aboutness of those themes for every book in the collection. Readers could then limit their searches by these coefficients or sort their search results accordingly. Find all books with subjects equal to philosophy. Sort the result by the philosophy coefficient.

Second, and possibly better, the DPLA could enable readers to denote their own more specialized and personalized themes. These themes and their aboutness coefficients could then be applied, on-the-fly, to search results. For example, find all books with subject terms equal to gardening, and sort the result by the reader’s personal definition of biology.

As stated earlier, written language is ambiguous and nuanced, but at the same time it is, to some degree, predicable. If it were not predicable, then no one would be able to understand another. Because of this predicability, language, to some degree, can be quantified. Once quantified, it can be measured. Once measured it can be sorted and graphed, and thus new meanings can be expressed and evaluated. The coefficients described in this section, like the measurements of length and readability, are to be taken with a grain of salt, but they can help the reader use & understand library collections, sub-collections, and individual items.

Plot on a timeline

Plotting things on a timeline is an excellent way to put events into perspective, and when written works are described with dates, then they are amenable to visualizations.

The DPLA could put this idea into practice by applying it against search results. The reader could do a search in the “catalog”, and the resulting screen could have a link labeled something like “Plot on a timeline”. By clicking the link the dates of search results could be extracted from the underlying metadata, plotted on a timeline, and displayed. At the very least such a function would enable the reader to visualize when things were published and answer rudimentary questions such as: are there clusters of publications, do the publications span a large swath of time, did one particular author publishing things on regular basis?

The dates in traditional bibliographic metadata denote the publication of an item, as mentioned previously. Consequently the mapping of monographs may not be useful as desired. On the other hand, the dates associated with things of a serial nature (blog postings, twitter feeds, journal articles, etc.) are more akin to dates of conception. We imagine the DPLA systematically harvesting, preserving, and indexing freely available and open access serial literature. This content is much more amenable to plotting on a timeline as illustrated below:

Timeline illustrating when serial literature was published

The timeline was created by aggregating selected RSS feeds, parsing out the dates, and plotting them accordingly. Different colored items represent different feeds. Each item in the timeline is hot providing the means to read the items’ abstracts and optionally viewing the items’ full text.

Plotting things on a timeline is another way the DPLA can build on the good work of find & get and help the reader use & understand.

Count word and phrase frequencies

Akin to traditional back-of-the-book indexes, word and phrase frequency tabulations are one of the simplest and most expedient ways of providing access to and overviews of a text. Like tables of contents and indexes, word and phrase frequecies increase a text’s utility and make texts easier to understand.

Back-of-the-book indexes are expensive to create and the product of an individual’s perspective. Moreover, back-of-the-book indexes are not created for fiction. Why not? Given the full-text of a work any number of back-of-the-book index-like displays could be created to enhance the reader’s experience. For example, by simply tabulating the occurrences of every word in a text (sans, maybe, stop words), and then displaying the resulting list alphabetically, the reader can have a more complete back-of-the-book index generated for them without the help of a subjective indexer. The same tabulation could be done again but instead of displaying the content alphabetically, the results could be ordered by frequency as in a word cloud. In either case each entry in the “index” could be associated with an integer denoting the number of times the word (or phrase) occurs in the text. The word (or phrase) could then be linked to a concordance (see below) in order to display how the word (or phrase) was used in context.

Take for example, Henry David Thoreaus’s Walden. This is a piece of non-fiction about a man who lives alone in the woods by a pond for just about two years. In the book’s introduction Ralph Waldo Emerson describes Thoreau as a man with a keen sense of physical space and an uncanny ability for measurement. The book itself describes one person’s vision of what it means to be human. Upon the creation and display of the 100 most frequently used two-word phrases (bigrams), these statements about the book are born out. Notice the high frequency of quantitative references as well as reference to men:

Compare Walden to James Joyce’s Ulysses, a fictional work describing a day in the life of Leopold Bloom as he walks through Dublin. Notice how almost every single bigram is associated with the name of a person

Interesting? Some people may react to these illustrations and say, “So what? I already knew that.” To which we reply, “Yes, but what about those people who haven’t read these texts?” Imagine being able to tabulate the word frequencies against any given set of texts — a novel, a journal article, a piece of non-fiction, all of the works by a given author or in a given genre. The results are able to tell the reader things about the works. For example, it might alert the reader to the central importance of a person named Bloom. When Bloom is mentioned in the text, then maybe the reader ought to be extra attention to what is being said. Frequency tabulations and word cloud can also alert the reader to what is not said in a text. Apparently religion is not a overarching theme in either of the above examples.

The 100 most frequent two-word phrases in Walden

The 100 most frequent two-word phrases in Ulysses

It is possible to tabulate word frequencies across texts. Again, using A Christmas Carol, Oliver Twist, and David Copperfield as examples, we discover the 6-word phrase “taken with a violent fit of” appears in both David Copperfield and A Christmas Carol. Moreover, the bigram “violent fit” appears on all three works. Specifically, characters in these three Dickens stories have violent fits of laughter, crying, trembling, and coughing. By concatenating the stories together and applying concordancing methods to them (described below) we see there are quite a number of violent things in the three stories:

  n such breathless haste and violent agitation, as seemed to betoken so
  ood-night, good-night!' The violent agitation of the girl, and the app
  sberne) entered the room in violent agitation. 'The man will be taken,
  o understand that, from the violent and sanguinary onset of Oliver Twi
  one and all, to entertain a violent and deeply-rooted antipathy to goi
  eep a little register of my violent attachments, with the date, durati
  cal laugh, which threatened violent consequences. 'But, my dear,' said
  in general, into a state of violent consternation. I came into the roo
  artly to keep pace with the violent current of her own thoughts: soon
  ts and wiles have brought a violent death upon the head of one worth m
   There were twenty score of violent deaths in one long minute of that
  id the woman, making a more violent effort than before; 'the mother, w
   as it were, by making some violent effort to save himself from fallin
  behind. This was rather too violent exercise to last long. When they w
   getting my chin by dint of violent exertion above the rusty nails on
  en who seem to have taken a violent fancy to him, whether he will or n
  peared, he was taken with a violent fit of trembling. Five minutes, te
  , when she was taken with a violent fit of laughter; and after two or
  he immediate precursor of a violent fit of crying. Under this impressi
  and immediately fell into a violent fit of coughing: which delighted T
  of such repose, fell into a violent flurry, tossing their wild arms ab
   and accompanying them with violent gesticulation, the boy actually th
  ght I really must have laid violent hands upon myself, when Miss Mills
   arm tied up, these men lay violent hands upon him -- by doing which,
   every aggravation that her violent hate -- I love her for it now -- c
   work himself into the most violent heats, and deliver the most wither
  terics were usually of that violent kind which the patient fights and
   me against the donkey in a violent manner, as if there were any affin
   to keep down by force some violent outbreak. 'Let me go, will you,--t
  hands with me - which was a violent proceeding for him, his usual cour
  en.' 'Well, sir, there were violent quarrels at first, I assure you,'
  revent the escape of such a violent roar, that the abused Mr. Chitling
  t gradually resolved into a violent run. After completely exhausting h
  , on which he ever showed a violent temper or swore an oath, was this
  ullen, rebellious spirit; a violent temper; and an untoward, intractab
  fe of Oliver Twist had this violent termination or no. CHAPTER III REL
  in, and seemed to presage a violent thunder-storm, when Mr. and Mrs. B
  f the theatre, are blind to violent transitions and abrupt impulses of
  ming into my house, in this violent way? Do you want to rob me, or to

These observations simply beg other questions. Is violence a common theme in Dickens’ works? What other adjectives are used to a greater or lesser degree in Dickens’ works? How do the use of these adjectives differ from other authors of the same time period or within the canon of English literature?

While works of fiction are the basis of most of the examples, there is no reason why similar processes couldn’t be applied to non-fiction as well. We also understand that the general reader will not be interested in these sorts of services against texts. Instead we see these sorts of services more applicable to students in high school and college. We also see these sorts of services being applicable to the scholar or researcher who needs to “read” large numbers of journal article. Finally, we do not advocate the use of these sorts of tools as a replacement for traditional “close” reading. These tools are supplements and additions to the reading process just as tables of contents and back-of-the-book indexes are today.

Display in context

Concordances — one of the oldest literary tools in existence — have got to be some of the more useful services against texts a library could provide because they systematically display words and concepts within the context of the larger written work making it very easy to compare & contrast usage. Originally implemented by Catholic priests as early as 1250 to study religious texts, concordances (sometimes called “key word in context” or KWIC indexes) trivialize the process of seeing how a concept is expressed in a work.

As an example of how concordances can be used to analyze texts, we asked ourselves, “How do Plato, Aristotle, and Shakespeare differ in their definition of man?” To answer this question we amassed all the works of the authors, searched each for the phrase “man is”, and displayed the results in a concordance-like fashion. From the results the reader can see how the definitions of Plato and Aristotle are very similar but much different from Shakespeare’s:

Plato’s definitions

  stice, he is met by the fact that man is a social being, and he tries to harmoni
  ption of Not-being to difference. Man is a rational animal, and is not -- as man
  ss them. Or, as others have said: Man is man because he has the gift of speech;
  wise man who happens to be a good man is more than human (daimonion) both in lif
  ied with the Protagorean saying, 'Man is the measure of all things;' and of this

Aristotle’s definitions

  ronounced by the judgement 'every man is unjust', the same must needs hold good
  ts are formed from a residue that man is the most naked in body of all animals a
  ated piece at draughts. Now, that man is more of a political animal than bees or
  hese vices later. The magnificent man is like an artist; for he can see what is
  lement in the essential nature of man is knowledge; the apprehension of animal a

Shakespeare’s definitions

   what I have said against it; for man is a giddy thing, and this is my conclusio
   of man to say what dream it was: man is but an ass, if he go about to expound t
  e a raven for a dove? The will of man is by his reason sway'd; And reason says y
  n you: let me ask you a question. Man is enemy to virginity; how may we barricad
  er, let us dine and never fret: A man is master of his liberty: Time is their ma

We do not advocate the use of concordances as the be-all and end-all of literary analysis but rather a pointer to bigger questions. Think how much time and energy would have been required if the digitized texts of each of these authors was not available, and if computers could not be applied against them. Concordances, as well as the other services against texts outlined in this proposal, make it easier to ask questions of collections, sub-collections, and individual works. This ease-of-use empowers the reader to absorb, observe, and learn from texts in ways that was not possible previously. We do not advocate these sort of services against texts as replacements for traditional reading processes, but rather we advocate them as alternative and supplemental tools for understanding the human condition or physical environment as manifested in written works.

Herein lies one of the main points of our proposal. By creatively exploiting the current environment where full-text abounds and computing horsepower is literally at everybody’s fingertips, libraries can assist the reader to “read” texts in new and different ways — ways that make it easier to absorb larger amounts of information and ways to understand it from new and additional perspectives. Concordances are just one example.

Display the proximity of a given word to other words

Visualizing the words frequently occurring near a given word is often descriptive and revealing. With the availability of full-text content, creating such visualization is almost trivial and have the potencial for greatly enhancing the reader’s experience. This enhanced reading process is all but impossible when the written word is solely accessible in analog forms, but in a digital form the process is almost easy.

For example, first take the word woodchuck as found in Henry David Thoreau’s Walden. Upon reading the book the reader learns of his literal distaste for the woodchuck. They eat is beans, and he wants to skin them. Compare the same author’s allusions to woodchucks in his work Two Weeks On The Concord And Merrimack Rivers. In this work, when woodchucks are mentioned he also alludes to other small animals such as foxes, minks, muskrats, and squirrels. In other words, the connotations surrounding woodchucks and between the two books are different as illustrated by the following network diagrams:

“woodchuck” in Walden

“woodchuck” in Rivers

The given word — woodchuck — is in the center. Each of the words connected to the given word are the words appearing most frequently near the given word. This same process is then applied to the connected words. Put another way, these network diagrams literally illustrate what an author says, “in the same breath” when they use a given word. Such visualizations are simply not possible through the process of traditional reading without spending a whole lot of effort. The DPLA could implement the sort of functionality described in this section and make the reader’s experience richer. It demonstrates how libraries can go beyond access (a problem that is increasingly not a problem) and move towards use & understand.

We do not advocate the use of this technology to replace traditional analysis, but rather to improve upon it. This technology, like all of the examples in the proposal, makes it easier to find interesting patterns for further investigation.

Display location of word in a text

Sometimes displaying where in a text, percentage-wise, a word or phrase exists can raise interesting questions, and by providing tools to do such visualizations the DPLA will foster the ability to more easily ask interesting questions.

For example, what comes to mind when you think of Daniel Defoe’s Robinson Curose? Do you think of a man shipwrecked on an island and the cannibal named Friday? Ask yourself, when in the story is the man shipwrecked and when does he meet Friday? Early in the story? In the middle? Towards the end? If you guessed early in the story, then you would be wrong because most of the story takes place on a boat, and only three-quarters of the way through the book does Friday appear, as illustrated by the following histogram:

We all know that Herman Melville’s book Moby Dick is about a sailor hunting a great white whale. Looking at a histogram of where the word “white” appears in the story, we see a preponderance of its occurrence forty percent the way through the book. Why? Upon looking at the book more closely we see that one of the chapters is entitled “The Whiteness of the Whale”, and it is almost entirely about the word “white”. This chapter appears about forty percent through the text. Who ever heard of an entire book chapter whose theme was a color?

“friday” in Crusoe

“white” in Moby Dick

In a Catholic pamphlet entitled Letters of an Irish Catholic Layman the word “catholic” is one of the more common and appears frequently in the text towards the beginning as well as the end

“catholic” in Layman

“lake erie” in Layman

“niagara falls” in Layman

After listing the most common two-word phrases in the book we see that there are many references to places in upper New York state:

The 100 most frequently used two-word phrases in Letters of an Irish Catholic Layman

Looking more closely at the locations of “Lake Erie” and “Niagra Falls” in the text, we see that these things are referenced in the places where the word “catholic” is not mentioned

Does the author go off on a tangent? Are there no catholics in these areas? The answers to the questions, and the question of why are left up to the reader, but the important point is the ability to quickly “read” the texts in ways that were not feasible when the books were solely in analog form. Displaying where in a text words or phrases occur literally illustrates new ways to view the content of libraries. These are examples of how the DPLA can build on find & get and increase use & understand.

Elaborate upon and visualize parts-of-speech analysis

Written works can be characterized through parts-of-speech analysis. This analysis can be applied to the whole of a library collection, subsets of the collection, or individual works. The DPLA has the opportunity to increase the functionality of a library by enabling the reader to elaborate upon and visualize parts-of-speech analysis. Such a process will facilitate greater use of the collection and improve understanding of it.

Because the English language follows sets of loosely defined rules, it is possible to systematically classify the words and phrases of written works into parts-of-speech. These include but are not limited to: nouns, pronouns, verbs, adjectives, adverbs, prepositions, punctuation, etc. Once classified, these parts-of-speech can be tabulated and quantitative analysis can begin.

Our own foray’s into parts-of-speech analysis, where the relative percentage use of parts-of-speech were compared, proved fruitless. But the investigation inspired other questions whose answers may be more broadly applied. More specifically, students and scholars are often times more interested in what an author says as opposed to how they say it. Such investigations can gleaned not so much from gross parts-of-speech measurements but rather the words used to denote each parts-of-speech. For example, the following table lists the 10 most frequently used pronouns and the number of times they occur in four works. Notice the differences:

Walden	Rivers	Northanger	Sense
I (1,809)	it (1,314)	her (1,554)	her (2,500)
it (1,507)	we (1,101)	I (1,240)	I (1,917)
my (725)	his (834)	she (1,089)	it (1,711)
he (698)	I (756)	it (1,081)	she (1,553)
his (666)	our (677)	you (906)	you (1,158)
they (614)	he (649)	he (539)	he (1,068)
their (452)	their (632)	his (524)	his (1,007)
we (447)	they (632)	they (379)	him (628)
its (351)	its (487)	my (342)	my (598)
who (340)	who (352)	him (278)	they (509)

While the lists are similar, they are characteristic of work from which they came. The first — Walden — is about an individual who lives on a lake. Notice the prominence of the word “I” and “my”. The second — Rivers — is written by the same author as the first but is about brothers who canoe down a river. Notice the higher occurrence of the word “we” and “our”. The later two works, both written by Jane Austin, are works with females as central characters. Notice how the words “her” and “she” appear in these lists but not in the former two. It looks as if there are patterns or trends to be measured here.

If the implementation of the DPLA were to enable the reader to do this sort of parts-of-speech analysis against search results, then the search results may prove to be more useful.

Nouns and pronouns play a special role in libraries because they are the foundation of controlled vocabularies, authority lists, and many other reference tools. Imagine being able to extract and tabulate all the nouns (things, names, and places) from a text. A word cloud like display would convey a lot of meaning about the text. On the other hand, a simple alphabetical list of the result could very much function like a back-of-the-book index. Each noun or noun phrase could be associated with any number of functions such as but not limited to:

look-up in a controlled vocabulary list in order to find more
look-up in an authority list in order to find more
show in context of the given work (concordance)
elaborate upon using a dictionary, thesaurus, encyclopedia, etc.
plot on a map

We demonstrated the beginnings of the look-up functions in a Code4Lib Journal article called “Querying OCLC Web Services for Name, Subject, and ISBN“. The concordance functionality is described above. The elaboration service is common place in today’s ebook readers. Through an interface designed for mobile devices, we implemented a combination of the elaborate and plot on a map services as a prototype. In this implementation the reader is presented with a tiny collection of classic works. The reader is then given the opportunity to browse the names or places index. After the reader selects a specific name or place the application displays a descriptive paragraph of the selection, an image of the selection, and finally, hypertext links to a Wikipedia article or a Google Maps display.


Screen shots of services against texts on a mobile device

Given the amount of full text content that is expected to be in or linked from the DPLA’s collection, there is so much more potential functionality for the reader. The idea of a library being a storehouse of books and journals is rapidly become antiquated. Because content is so readily available on the ‘Net, there is a need for libraries to evolve beyond its stereotypical function. By combining a knowledge of what readers do with information with the possibilities for full text analysis, the DPLA will empower the reader to more easily ask and answer questions of texts. And in turn, make it easier for the reader to use & understand what they are reading.

Disclaimer

People may believe the techiques described herein run contrary to the traditional processes of “close” reading. From our point of view, nothing could be further from the truth. We sincerely believe the techniques described in this proposal suppliment and enhance the reading process.

We are living in an age where we feel like we are drowning in data and information. But according to Ann Blair this is not a new problem. In her book, Too Much to Know, Blair chronicles in great detail the ways scholars since the 3rd Century have dealt with information overload. While they seem obvious in today’s world, they were innovations in their time. They included but were not limited to: copying texts (St. Jerome in the 3rd Century), creating concordances (Hugh St. Cher in the 13th Century), and filing wooden “cards” in a “catalog” (Athanasius Kircher 17th Century).

St. Jerome

Hugh St. Cher

Athanasius Kircher

Think of all the apparatus associated with a printed book. Books have covers, and sometimes there are dust jackets complete with a description of the book and maybe the author. On the book’s spine is the title and publisher. Inside the book there are cover pages, title pages, tables of contents, prefaces & introductions, tables of figures, the chapters themselves complete with chapter headings at the top of every page, footnotes & references & endnotes, epilogues, and an index or two. These extras — tables of contents, chapter headings, indexes, etc. — did not appear in books with the invention of the codex. Instead their existence was established and evolved over time.

In scholarly detail, Blair documents how these extras — as well as standard reference works like dictionaries, encyclopedias, and catalogs — came into being. She asserts the creation of these things became necessary as the number and lengths of books grew. These tools made the process of understanding the content of books easier. They reenforced ideas, and made the process of returning to previously read information faster. Accordingl to Blair, not everybody thought these tools — especially reference works — were a good idea. To paraphrase, “People only need a few good books, and people should read them over and over again. Things like encyclopedias only make the mind weaker since people area not exercising their memories.” Despite these claims, reference tools and the aparatus of printed books continue to exist and our venerable “sphere of knowledge” continues to grow.

Nobody can claim undertanding of a book if they read only the table of contents, flip through the pages, and glance at the index. Yes, they will have some understanding, but it will only be tertiary. We see the tools described in this proposal akin to tables of contents and back-of-the-book indexes. They are tools to find, get, use, and understand the data, information, and knowledge a book contains. They are a natural evolution considering the existence of books in digital forms. The services against texts described in this proposal enhance and supplement the reading process. They make it easier to compare & contrast the content of single books or an entire corpus. They make it faster and easier to extract pertinate information. Like a back-of-the-book index, they make it easier to ask questions of a text and get answers quickly. The tools described in this proposal are not intended to be end-all and be-all of textual analysis. Instead, they are intended to be pointers to interesting ideas, and it is left up to the reader to flesh out and confirm the ideas after closer reading.

Digital humanities investigations and specifically text mining computing techniques like the ones in this proposal can be viewed as modern-day processes for dealing with and taking advantage of information overload. Digital humanists use computers to evaluate all aspects of human expression. Writing. Music. Theator. Dance. Etc. Text mining is a particular slant on the digital humanities applying this evaluation process against sets of words. We are simply advocating these proceses become integrated with library collections and services.

Software

This section lists the software used to create our Beta-Sprint Propoal examples. All of the software is open source or freely accessible. None of the software is one-of-a-kind because each piece could be replaced by something else providing similar functionality.

Alex Catalogue of Electronic Texts – This is a collection and full-text index of approximately 14,000 public domain documents from the areas of American and English literature as well as Western philosophy. This “digital library”, created and maintined by the author since 1994, is a personal “sandbox” and laboratory for the implementation of new ideas in librarianship.
Google Charts – Implemented through a Javascript API (application programmer interface), Google Charts enabled us to create the histograms in the “display location of word in a text service”. It also provided the guage-like graphics for the “measure size” and “measure difficulty” services.
Google Maps – Another Javascript API, Google Maps was a part of the “plot on a map” service.
Lingua::Concordance – A Perl module, Lingua::Concordance was used to implement the “display in context” service. This module was written by the author.
Lingua::EN::Ngram – Another Perl module written by the author, Lingua::EN::Ngram was used to count and tabulate the words and n-length phrases in a given text. It plays a crucial role “count word and phrase frequencies” service.
Lingua::Fathom – This Perl module formed the basis of the “measure size” and “measure difficulty” services since its primary purpose is to calculate Fog, Flesch, and Kincaid readability scores.
Lingua::Stem::Snowball – This Perl module plays a role in the “measure concept” service. Given words as input, it outputs the words’ roots (or “stems”). These roots were then searched against the index of Alex Catalogue to determine the number of documents (f) containing the root. This value was then used to calculate TFIDF.
Lingua::TreeTagger – This a Perl interface to set of cross-platform binary applications whose purpose is to classify parts-of-speech. Lingua::TreeTagger was used to compare & contrast the ways pronouns were used in four classic works of literature.
MyLibrary – This is a digital library framework written in Perl. At its core are modules to manage library resources, librarians, and patron descriptions. Inter-relationships between resources, librarians, and patrons can be controlled through the creation and maintenance of facet/term combinations. MyLibrary was co-written by the author and implemented the concept of facets before faceted browse became popular. MyLibrary, in combination with Solr, forms the functional basis of the Alex Catalogue.
Protovis – This is the Javascript library used to visualize the “display the proximity of a given word to other words” service.
SIMILE Widgets Timeline – This is a Javascript library used to display timelines. It was used in the “plot on a timeline” service.
Solr – Solr is probably the most popular open source indexer in use by the library community, if not else where. It is used to index the full-text of the Alex Catalogue. It was also used to determine the value of f in the “measure concept” service.
Stanford Named Entity Recognizer – This is the set of Java programs used to extract the names and places from a document. These names and places were then linked to Wikipedia or plotted on a map — the “elaborate upon and visualize parts-of-speech” service.

This short list of software can be used to create a myriad of enhanced library services and tools, but the specific pieces of software listed above are not so important in and of themselves. Instead, they represent types of software which already exist and are freely available for use by anybody. Services against texts facilitating use & understand can be implemented with a wide variety of software applications. The services against texts outlined in this proposal are not limited to the software listed in this section.

Implementation how-to’s

Putting into practice the services against text described in this proposal would not be a trivial task, but process is entirely feasible. This section outlines a number of implementation how-to’s.

Measurement services

The measurement services (size, readability, and concept) would idealy be done against texts as they were added to the collection. The actual calculation of the size and readability scores are not difficult. All that is needed is the full text of the documents and software to do the counting. (Measuring concepts necessitates additional work since TFIDF requires a knowledge of the collection as a whole; measuring concepts can only be done once the bulk of the collection has been built. Measuring concepts is also a computationally intensive process.)

Instead, the challenge includes denoting locations to store the metadata, deciding whether or not to index the metadata, and figuring out how to display the metadata to the reader. The measurements themselves will be integers or decimal numbers. If MARC were the container for the bibliogrpahic data, then any one of a number of local notes could be used for storage. If a relational database were used, then additional fields could be used. If the DPLA wanted to enable the reader to limit or sort search results by any of the measurments, then the values will need to be indexed. We would be willing to guess the underlying indexer for the DPLA will be Solr, since it seems to be the current favorite. Indexing the measurements in Solr will be as easy as creating the necessary fields to a Solr configuration file, and adding the measurements to the fields as the balance of the bibliographic data is indexed. We would not suggest creating any visualizations of the measurements ahead of time, but rather on-the-fly and only as they were needed; the visualizations could probably be implemented using Javascript and embedded into the DPLA’s “catalog”.

Timeline services

Like the measurements, plotting the publication dates or dates of conception on a timeline can be implemented using Javascript and embedded into the DPLA’s “catalog”. For serial literature (blogs, open access journal articles, Twitter feeds, etc.) the addition of meaningful dates will have already been done. For more more traditional library catalog materials (books), the addition of dates of conception will be labor intensive. Therefore such a thing might not be feasible. On the other hand, this might be a great opportunity to practice a bit of crowdsourcing. Consider making a game out of the process, and try to get people outside the DPLA to denote when Plato, Thoreau, Longfellow, and Whitman wrote their great works.

Frequency, concordance, proximity, and locations in a text services

Implementing the frequency, concordance, proximity, and locations in a text services require no preprocessing. Instead these services can all be implemented on-the-fly by a program linked from the DPLA’s “catalog”. These services will require a single argument (a unique identifier) and some optional input parameters. Given a unique identifier, the program can look up basic bibliographic information from the catalog including the URL where the full-text resides, retrieve the full-text, and do the necessary processing. This URL could point to the local file system, or, if the network was deemed fast and reliable, the URL could point to the full-text in remote repositories such as the Internet Archive or the HathiTrust. These specific services against texts have been implemented in the Catholic Research Resources Alliance “Catholic Portal” application using “Analyze using text mining techniques” as the linked text. This is illustrated below:

Screen shot of the “Catholic Portal”

By the middle of September 2011 we expect the Hesburgh Libraries at the University of Notre Dame will have included very similar links in their catalog and “discovery system”. These links will provide access to frequency, concordance, and locations in a text services for sets of digitized Catholic pamphlets.

Parts-of-speech services

Based on our experience, the parts-of-speech services will require pre-processing. This is because the process of classifying words into categories of parts-of-speech is a time- and computing-intensive process. It does not seem feasible to extract the parts-of-speech from a document in real time.

To overcome this limitation, we classified our small sample of texts and saved the result in easily parsable text files. Our various scripts were then applied against these surrogates as opposed to the original documents. It should be noted that these surrogates, while not only computationally expensive, were also expensive in terms of disk space consuming more than double the space of the original.

We suggest one or two alternative strategies for the DPLA. First, determine what particular items from the DPLA’s collection may be the more popular. Once determined, have those items pre-processed outputting the surrogate files. These pre-processed items can then be used for demonstration purposes and generate interest in the parts-of-speech services. Second, when readers want to use these services against items that have not been pre-processed, then have the readers select their items, supply an email address, process the content, and notifiy the readers when the surrogates have been created. This second approach is akin to the just-in-time approach to collection development as opposed to the just-in-case philosophy.

Priorities

Obviously, we think all of the services against texts outlined above are useful, but practically speaking, it is not feasible to implement all of them once. Instead we advocate the following phased approach:

Word/phrase frequency, concordance, proximity, and locations in a text services – We suggest these services be implemented first, mostly because they can be written outside any “discovery system” hosted by the DPLA. Second, these services are the root of many of the other services, so it will be easier to build the others once these have been made available.
Measurments of size and readability – Calculating the values of size and readability on-the-fly is possible but is limiting in functionality. Pre-processing these values is relatively easy, and incorporating the result into the “discovery system” has many benefits. This is why we see these two services as the second highest priority.
Plot dates of publication on a timeline – Plotting dates will be easy enough if the content in question is of a serial nature and the dates represent “dates of conception”. But we are not sure content of a serial nature (blog postings, open access journal literature, Twitter feeds, etc.) will be included in the DPLA’s collection. Consequently, we suggest this service be implemented third.
Parts-of-speech analysis – Implementing services based on parts-of-speech will almost certainly require pre-processing as increase local storage requirements. While these costs are withing the DPLA’s control, they are expenses that may inhibit implementation feasibility. That is why they are listed fourth in the priority order.
After crowdsourcing the content, plot dates of conception on a timeline – We think this is one of the easier and more interesting services, especially if the dates in question are “dates of conception” for books, but alas, this data is not readily available. After figuring out how to acquire dates of conception for traditional catalog-like material — through something like crowdsourcing — implementing this service my be very enlightinging.
Measure ideas – This is probably the most avant-garde service described in the proposal. Its implementation can only be done after the bulk of the DPLA’s collection has been created. Furthermore, calculating TFIDF for a set of related keyword is computationally expensive. This can be a truly useful and innovative service, especially if the reader were able to create a personal concept for comparison. But because of the time and expense, we advocate this service be implemented last.

Quick links

This section lists most of the services outlined in the proposal as well as links to blog postings and example implementations.

Word frequencies, concordances

These URLs point to services generating word frequencies, concordances, histograms illustrating word locations, and network diagrams illustrating word proximities for Walden and Ulysses.

Word/phrase locations

Using the text mining techniques built into the “Catholic Portal” the reader can see where the words/phrases “catholic”, “lake erie”, and “niagara falls” are used in the text.

http://www.catholicresearch.net/concordances/?id=tormarc_lettersofirishca00iris

Proximity displays

Using network diagrams, the reader can see what words Thoreau uses “in the same breath” when he mentions the word “woodchuck”. These proximity displays are also incorporated into just about every item in the Alex Catalogue

./../2011/01/visualizing-co-occurrences-with-protovis/index.html

Plato, Aristotle, and Shakespeare

This blog posting first tabulates the most frequently used words by the authors, as well as their definitions of “man” and a “good man”.

./../2010/06/the-next-next-generation-library-catalog/index.html

Catholic Portal

The “Portal” is collection of rare, uncommon, and infrequently held materials brought together to facilitate Catholic studies. It includes some full text materials, and they are linked to text mining services.

http://www.catholicresearch.net/Record/tormarc_lettersofirishca00iris

Measuring size

In this blog posting a few works by Charles Dickens are compared & contrasted. The comparisons include size and word/phrase usage.

./../2010/12/text-mining-charles-dickens/index.html

Plot on a timeline

This blog posting describes how a timeline was created by plotting the publication dates of RSS feeds.

./../2010/12/mits-simile-timeline-widget/index.html

Lookup in Wikipedia and plot on a map

After extracting the names and places from a text, this service grabs Linked Data from DBedia, displays a descriptive paragraph, and allows the reader to look the name or place up in Wikipedia and/or plot it on a world map. This service is specifically designed for mobile devices.

http://dh.crc.nd.edu/sandbox/ner/mobile.html

Parts-of-speech analysis

This blog posting elaborates on how various parts of speech were used in a number of selected classic works.

./../2011/02/forays-into-parts-of-speech/index.html

Measuring ideas

The “greatness” of the Great Books was evaluated in a number of blog postings, and the two listed here give a good overview of the methodology.

Summary

In our mind, the combination of digital humanities computing techniques — like all the services against texts outined above — and the practices of librarianship would be a marriage made in heaven. By supplementing the DPLA’s collections with full text materials and then enhancing its systems to facilitate text mining and natural language processing, the DPLA can not only make it easier for readers to find data and information, but it can also make that data and information easier to use & understand.

We know the ideas outlined in this proposal are not typical library functions. But we also apprehend the need to take into account the changing nature of the information landscape. Digital content lends itself to a myriad of new possibilities. We are not saying analog forms of books and journals are antiquated nor useless. No, far from it. Instead, we believe the library profession has figured out pretty well how to exploit and take advantage of that medium and its metadata. On the other hand, the posibilities for full text digital content are still mostly unexplored and represent a vast untapped potencial. Building on and expanding on the education mission of libraries, services against texts may be a niche the profession — and the DPLA — can help fill. The services & tools described in this proposal are really only examples. Any number of additional services against texts could be implemented. We are only limited by our ability to think of action words denoting the things people want to do with texts once they find & get them. By augmenting a library’s traditional functions surrounding collection and sevices with the sorts of things described above, the role of libraries can expand and evolve to include use & understand.

About the author

Eric Lease Morgan considers himself to be a librarian first and a computer user second. His professional goal is to discover new ways to use computers to provide better library service. He has a BA in Philosophy from Bethany College in West Virginia (1982), and an MIS from Drexel University in Philadelphia (1987).

While he has been a practicing librarian for more than twenty years he has been writing software for more than thirty. He wrote his first library catalog in 1989, and it won him an award from Computers in Libraries Magazine. In a reaction to the “serials pricing crisis” he implemented the Mr. Serials Process to collect, organize, archive, index, and disseminate electronic journals. For these efforts he was awarded the Bowker/Ulrich’s Serials Librarianship Award in 2002. An advocate of open source software and open access publishing since before the phrases were coined, just about all of his software and publications are freely available online. One of his first pieces of open source software was a database-driven application called MyLibrary, a term which has become a part of the library vernacular.

As a member of the LITA/ALA Top Technology Trends panel for more than ten years, as well as the owner/moderator of a number of library-related mailing lists (Code4Lib, NGC4Lib, and Usability4Lib), Eric has his fingers on the pulse of the library profession. He coined the phrase “‘next-generation’ library catalog”. More recently, Eric has been applying text mining and other digital humanities computing techniques to his Alex Catalogue of Electronic Texts which he has been maintaining since 1994. Eric relishes all aspects of librarianship. He even makes and binds his own books. In his spare time, Eric plays blues guitar and Baroque recorder. He also enjoys folding origami, photography, growing roses, and fishing.

Raising awareness of open access publications

Eric Lease Morgan — Tue, 02 Aug 2011 15:51:44 +0000

I was asked the other day about ways to make people aware of open access journal publications, and this posting echoes much of my response.

Thanks again for taking the time this morning to discuss some of the ways open-access journals are using social media and other technology to distribute content and engage readers. I am on the board of [name deleted] recently transitioned to an open access format, and we are looking to maximize the capabilities of this new, free, and on-line format. To that end, any additional insights you might be able to share about effective social media applications for open-access sources, or other exemplary electronic journals you may be able to recommend, would be most helpful.

As you know, I have not been ignoring you as much as I have been out of town. Thank you for your patience.

I am only able to share my personal experiences here, and they are not intended to be standards of best practices. Yet, here are some ideas:

Exploit RSS – RSS is an XML technology used to syndicate content. It is the foundation of blogs. Do what you can to make sure your journal content is syndicated via RSS. This way people can “subscribe” to your journal and they will get alerts when new content becomes available.
Create a mailing list – On your journal’s site, allow people to submit their email addresses. Keep these email addresses in a list (database) and when new issues of your journal are created, send messages to the people in the list. Do not use the list for any other purpose.
Advertise – Identify mailing lists where discussions take place surrounding the topic of your journal. When your journal creates new issues, send a table of contents sort of message to the mailing lists.
Blog about your journal – If you or any of your colleagues who edit the journal blog, then write up things you find interesting in your journal in your blog. As long as your write up are sincere, people will not see this sort of things as self-promotion.
Use Facebook & Twitter – Do you and your editorial colleagues use Facebook or Twitter? Maybe your journal can have a Facebook page and/or a Twitter account. In either case, post messages about your journal on social networks.
Exploit SEO – SEO is code for “search engine optimization” which itself is code for “make it easy for Google to crawl your site”. If Google can easily crawl your site, then your content will more likely appear in Google search results, and therefore you will get more exposure.
Be regular – Publishing serial publications (blogs, journal articles, etc.) is difficult, but I believe your readers will build up trust for you if you make content available on a consistent basis. Otherwise, I think your publication will loose credibility.
Make your content searchable – When people come to your website, make sure people can easily search & browse the backfires. People will say, “I remember seeing an article on that topic at… I wonder if I can find it again?” Put another way, make sure your website is “usable”.
Allow for comments – While the articles you publish go through some sort of review, make it possible for the readership to comment as well. We no longer live in isolation, nor are we governed by the centralized elite. It is increasingly about the wisdom of the crowd.

The right software makes many of the tasks I outlined easier. I suggest you take a look at Open Journal Systems.

Good luck, and I commend you for going the open access route.

Poor man’s restoration

Eric Lease Morgan — Mon, 25 Jul 2011 15:05:28 +0000

This posting describes a poor man’s restoration process.

Yesterday, I spent about an hour and a half writing down a work/professional to-do list intended to span the next few months. I prioritized things, elaborated on things, and felt I like had the good beginnings of an implementable plan.

I put the fruits of my labors into my pocket and then went rowing around in my boat. After my swim and on the way back to the dock I realized my to-do list was still in my pocket. Sigh. After pulling it out I and seeing the state it was in, I decided to try to salvage it. Opening it up was difficult. Naturally, the paper tore, but I laid it down as flat as I could. I went home to get a few pieces of paper to support and sandwich my soaked to-do list. For the next few hours, as the paper dried in the hot weather we are experiencing, I continually flipped and turned the to-do list so it would not stick to its supports.

Page #1

Page #2

This morning, after the list was was a dry as it was going to be, I photographed both sides of it, did my best color-correct the image, converted the whole thing into a PDF file, and printed the result. While the it looks like heck, the time I spent salvaging my intellectual efforts were much shorter than the time I would have spent recreating the list. Like a blues, such recreations are never exactly the same as the originals. But it would have been a whole lot better if I hadn’t gone swimming with my to-do list in the first place.

I might not have done this restoration process in the “best” way, but that does not detract from the effort itself. I really do enjoy all aspects of library work.

My DPLA Beta-Sprint Proposal: The movie

Eric Lease Morgan — Fri, 22 Jul 2011 18:29:44 +0000

Please see my updated and more complete Digital Public Library of America Beta-Sprint Proposal. The following posting is/was a precursor.

The organizers of the Digital Public Library of America asked the Beta-Sprint Proposers to create a video outlining the progress of their work. Below is the script of my video as well as the video itself. Be gentle with me. Video editing is difficult.

Introduction

My name is Eric Morgan. I am a Digital Projects Librarian here at the University of Notre Dame, and I am going to outline, ever so briefly, my Digital Public Library of America Beta-Sprint Proposal. In a nutshell, the Proposal describes, illustrates, and demonstrates how the core functionality of a library can move away from “find & get” and towards “use & understand”.

Find & get

With the advent of ubiquitous and globally networked computers, the characteristics of data and information have fundamentally changed. More specifically, things like books and journals — the traditional meat and potatoes of libraries — no longer need to be manifested in analog forms, and their digital manifestations lend themselves to new functionality. For example, digital versions of books and journals can be duplicated exactly, and they are much less limited to distinct locations in space and time. This, in turn, has made things like the search engines of Google, Yahoo, and Microsoft a reality. Compared to twenty years ago, this has made the problem of find & get much less acute. While the problem of find & get will never completely be resolved, many people feel the problem is addressed simply enough. Enter a few words into a search box, click Go, and select items of interest.

Use & undertand

The problem of find & get is always a means to an end, and not the end itself. People want to do things with the content they find. I call these things “services against texts” and they are denoted by action verbs such as analyze, annotate, cite, compare & contrast, confirm, delete, discuss, evaluate, find opposite, find similar, graph & visualize, learn from, plot on a map, purchase, rate, read, review, save, share, summarize, tag, trace idea, or transform. Thus, the problem of find & get is evolving into the problem of use & understand. I ask myself, “What services can be provisioned to make the sense of all the content one finds on the Internet or in a library?” In my opinion, the answer lies in the combination of traditional library principles and the application of computer science. Because libraries are expected to know the particular information needs of their constituents, libraries are uniquely positioned to address the problem of use & understand. Not Google. Not Yahoo. Not Microsoft.

Examples

How do we go about doing this? We begin by exploiting the characteristics of the increasingly available of full text content. Instead of denoting the length of a book by the number of pages it contains, we measure it by the number of words. Thus, we will be able to unambiguously compare & contrast the lengths of documents. By analyzing the lengths of paragraphs, the lengths of sentences, and the lengths of words in a document, we will be able to calculate readability scores, and we will be better able to compare & contrast the intended reading levels of a book or article. By tabulating the words or phrases in multiple documents and then comparing those tabulations with each other libraries will make it easier for readers to learn about the similarities and differences between items in a corpus. Such a service will enable people to answer questions like, “How does the use of the phrase ‘good man’ differ between Plato, Aristotle, and Shakespeare?” If there were tools aware of the named people and places in a document, then a reader’s experience could be enriched with dynamic annotations and plots on a world map. Our ability to come up with ideas for additional services against texts is only limited by our imagination and our ability to understand the information needs of our clientele. My Beta Sprint Proposal demonstrates how many of these ideas can be implemented today and with the currently available technology.

Thank you

Thank you for the opportunity to share some of my ideas about the Digital Public Library of America, my Beta Sprint Proposal, and the role of libraries in the near future.

DPLA Beta Sprint Submission

Eric Lease Morgan — Mon, 20 Jun 2011 20:38:26 +0000

I decided to give it a whirl and particpate in the DPLA Beta Sprint, and below is my submission:

DPLA Beta Sprint Submission

My DPLA Beta Sprint submission will describe and demonstrate how the digitized versions of library collections can be made more useful through the application of text mining and various other digital humanities computing techniques.

Full text content abounds, and full text indexing techniques have matured. While the problem of discovery will never be completely solved, it is much less acute than it was even a decade ago. Whether the library profession or academia believes it or not, most people do not feel as if they have a problem finding data, information, and knowledge. To them it is as easy as entering a few words or phrases into a search box and clicking Go.

It is now time to move beyond the problem of find and spend increased efforts trying to solve the problem of use. What does one do with all the information they find and acquire? How can it be put into the context of the reader? What actions can the reader apply against the content they find? How can it be compared & contrasted? What makes one piece of information — such as a book, an article, a chapter, or even a paragraph — more significant than another? How might the information at hand be used to solve problems or create new insights?

There is no single answer to these questions, but this submission will describe and demonstrate one set of possibilities. It will assume the existence of full text content of just about any type — such as books the Internet Archive, open access journals, or blog postings. It will outline how these texts can be analyzed to find patterns, extract themes, and identify anomalies. It will describe how entire corpora or search results can be post-processed to not only refine the discovery process but also make sense of the results and enable the reader to quickly grasp the essence of textual documents. Since actions speak louder than words, this submission will also present a number of loosely joined applications demonstrating how this analysis can be implemented through Web browsers and/or portable computing devices such as tablet computers.

By exploiting the current environment — full text content coupled with ubiquitous computing horsepower — the DPLA can demonstrate to the wider community how libraries can remain relevant in the current century. This submission will describe and demonstrate a facet of that vision.

Next-generation library catalogs, or ‘Are we there yet?’

Eric Lease Morgan — Wed, 01 Jun 2011 14:39:40 +0000

Next-generation library catalogs are really indexes, not catalogs, and increasingly the popular name for such things is “discovery system”. Examples include VuFind, Primo combined with Primo Central, Blacklight, Summon, and to a lesser extent Koha, Evergreen, OLE, and XC. While this may be a well-accepted summary of the situation, I really do not think it goes far enough. Indexers address the problem of find, but in my opinion, find is not the problem to be solved. Everybody can find. Most people believe Google has all but solved that problem. Instead, the problem to solve is use. Just as much as people want to find information, they want to use it, to put it into context, and to understand it. With the advent of so much full text content, the problem of find is much easier to solve than it used to be. What is needed is a “next-generation” library catalog including tools and interfaces designed to make the use and understanding of information easier. Both the “Catholic Portal” and the discovery systems of the Hesburgh Libraries at the University of Notre Dame are beginning to implement some of these ideas. When it comes to “next-generation” library catalogs we might ask the question, “Are we there yet?”. I think the answer is, “No, not yet.”

This text was originally written for a presentation to the Rare Books and Manuscripts Section of the American Library Association during a preconference meeting, June 23, 2011. It is available in a number of formats including this blog posting, a one-page PDF document intended as a handout, and an ePub file.

Numbers of choices

There are currently a number of discovery systems from which a library can choose, and it is very important to note that they have more things in common than differences. VuFind, Primo combined with Primo Central, Summon, and Blacklight are all essentially indexer/search engine combinations. Even more, they all use same “free” and open source software — Lucene — at their core. All of them take some sort of bibliographic data (MARC, EAD, metadata describing journal articles, etc.), stuff it into a data structure (made up authors, titles, key words, and control numbers), index it in the way the information retrieval community has been advocating for at least the past twenty years, and finally, provide a way to query the index with either one-box-one-button or fielded interfaces. Everything else — facets, cover art, reviews, favorites, etc. — is window dressing. When and if any sort of OCLC/EBSCOHost combination manifests itself, I’m sure the underlying technology will be very similar.

Koha, Evergreen, and OLE (Open Library Environment) are more traditional integrated library systems. They automate traditional library processes. Acquisitions. Cataloging. Serials Control. Circulation. Etc. They are database applications, not indexers, designed to manage an inventory. Search — the “OPAC” — is one of these processes. The primary difference between these applications and the integrated library systems of the recent past is their distribution mechanism. Koha and Evergreen are open source software, and therefore as “free as a free kitten”. OLE is still in development, but will be distributed as open source. Everything else is/was licensed for a fee.

When talking about “next-generation” library catalogs and “discovery systems”, many people allude to the Extensible Catalog (XC) which is not catalog nor an index. More accurately, it is system enabling and empowering the library community to manage and transform its bibliographic data on a massive scale. It offer ways for a library to harvest content from OAI-PMH data repositories (such as library catalogs), do extensive find/replace or enhancement operations against the harvested data, expose the result via OAI-PMH again, and finally, support the NCIP protocol so the circulation status of items found in an index can be determined. XC is middleware designed to provide functionality between an integrated library system and discovery system.

Find is not the problem

With the availability of wide-spread full text indexing, the need to organize content according to a classification system — to catalog items — has diminished. This need is not negated, but it is not as necessary as it used to be. In the past, without the availability of wide-spread full text indexing, classification systems provided two functions: 1) to organize the collection into a coherent whole with sub-parts, and 2) to surrogate physical items enumerated in a list. The aggregate of metadata elements — whether they be titles, authors, contributors, key words, subject terms, etc. — acted as “dummies” for the physical item containing the information. They are/were pointers to the book, the journal article, the piece of sheet music, etc. With the advent of wide-spread full text indexing, these two functions are not needed as much as they were in the past. Through the use of statistical analysis and direct access to the thing itself, indexers/search engines make the organization and discovery of information easier and less expenses. Note, I did not say “better”, just simpler and with greater efficiency.

Because wide-spread full text indexing abounds, the problem of find is not as acute as it used to be. In my opinion, it is time to move away from the problem of find and towards the problem of use. What does a person do with the information once they find and acquire it? Does it make sense? Is it valid? Does it have a relationship other things, and if so, then what is that relationship and how does it compare? If these relationships are explored, then what new knowledge might one uncover, or what existing problem might be solved? These are the questions of use. Find is a means to an end, not the end itself. Find is a library problem. Use the problem everybody else wants to solve.

True, classification systems provide a means to discover relationships between information objects, but the predominate classification systems and processes employed today are pre-coordinated and maintained by institutions. As such they posit realities that may or may not match the cognitive perception of today’s readers. Moreover, they are manually applied to information objects. This makes the process literally slow and laborious. Compared to post-coordinated and automated techniques, the manual process of applying classification to information objects is deemed expensive and of diminishing practical use. Put another way, the application of classification systems against information objects today is like icing on a cake, leather trim in a car, or a cherry on a ice cream sundae. They make their associated things richer, but they are not essencial their core purpose. They are extra.

Text mining

Through the use of a process called text mining, it is possible to provide new services against individual items in a collection as well as to collections as a whole. Such services can make information more useful.

Broadly defined, text mining is an automated process for analyzing written works. Rooted in linguistics, it makes the assumption that language — specifically written language — adheres to sets of loosely defined norms, and these norms are manifested in combinations of words, phrases, sentences, lines of a poem, paragraphs, stanzas, chapters, works, corpora, etc. Additionally, linguistics (and therefore text mining) also assumes these manifestations embody human expressions, meanings, and truth. By systematically examining the manifestations of written language as if they were natural objects, the expressions, meanings, and truths of a work may be postulated. Such is the art and science of text mining.

The process of text mining begins with counting, specifically, counting the number of words (n) in a document. This results in a fact — a given document is n words long. By comparing n across a given corpus of documents, new facts can be derived, such as one document is longer than another, shorter than another, or close to an average length. Once words have been counted they can be tallied. The result is a list of words and their associated frequencies. Some words occur often. Others occur infrequently. The examination of such a list tells a reader something about the given document. The comparison of frequency lists between documents tells the reader even more. By comparing the lengths of documents, the frequency of words, and their existence in an entire corpus a reader can learn of the statistical significance of given words. Thus, the reader can begin to determine the “aboutness” of a given document. This rudimentary counting process forms the heart of most relevancy ranking algorithms of indexing applications and is called “term frequency inverse document frequency” or TFIDF.

Not only can words be tallied but they can be grouped into different parts-of-speech (POS): nouns, pronouns, verbs, adjectives, adverbs, prepositions, function (“stop”) words, etc. While it may be interesting to examine the proportional use of each POS, it may be more interesting to examine the individual words in each POS. Are the personal pronouns singular or plural? Are they feminine or masculine? Are the names of places centered around a particular geographic location? Do these places exist in the current time, a time in the past, or a time in future? Compared to other documents, is there a relatively higher or lower use of color words, action verbs, names of famous people, or sets of words surrounding a particular theme? Knowing the answers to these questions can be quite informative. Just as these processes can be applied to words they can be applied to phrases, sentences, paragraphs, etc. The results can be charted, graphed, and visualized. They can be used to quickly characterize single documents or collections of documents.

The results of text mining processes are not to be taken as representations of truth, any more than the application of Library of Congress Subject Headings completely denote the aboutness of text. Text mining builds on the inherent patterns of language, but language is fluid and ambiguous. Therefore the results of text mining lend themselves to interpretation. The results of text mining are intended to be indicators, guides, and points of reference, and all of these things are expected to be interpreted and then used to explain, describe, and predict. Nor is text mining intended to be a replacement for the more traditional process of close reading. The results of text mining are akin to a book’s table of contents and back-of-the-book index. They outline, enumerate, and summarize. Text mining does the same. It is a form of analysis and a way to deal with information overload.

Assuming the availability of increasing numbers of full text information objects, a library’s “discovery system” could easily incorporate text mining for the purposes of enhancing the traditional cataloging process as well as increasing the usefulness of found material. In my opinion, this is the essence of a true “next-generation” library catalog.

Two examples

An organization called the Catholic Research Resources Alliance (CRRA) brings together rare, uncommon, and infrequently held materials into a thing colloquially called the “Catholic Portal”. The content for the Portal comes from a variety of metadata formats (MARC, EAD, and Dublin Core) harvested from participating member institutions. Besides supporting the Web 2.0 features we have all come to expect, it also provides item level indexing of finding aids, direct access to digitized materials, and concordancing services. The inclusion of concordance features makes the Portal more than the usual discovery system.

For example, the St. Michael’s College at the University of Toronto is a member of the CRRA. They have been working with the Internet Archive for a number years, and consequently measurable portions of their collection have been digitized. After being given hundreds of Internet Archive unique identifiers, a program was written which mirrored digital content and bibliographic descriptions (MARC records) locally. The MARC records were ingested into the Portal (an implementation of VuFind), and search results were enhanced to include links to both the locally mirrored content as well as the original digital surrogate. In this way, the Portal is pretty much just like any other discovery system. But the bibliographic displays go further because they contain links to text mining interfaces.

The “Catholic Portal”

Through these interfaces, the reader can learn many things. For example, in a book called Letters Of An Irish Catholic Layman the word “catholic” is one of the most frequently used. Using the concordance, the reader can see that “Protestants and Roman Catholics are as wide as the poles asunder”, and “good Catholics are not alarmed, as they should be, at the perverseness with which wicked men labor to inspire the minds of all, but especially of youth, with notions contrary to Catholic doctrine”. This is no big surprise, but instead a confirmation. (No puns intended.) On the other hand, some of the statistically most significant two-word phrases are geographic identities (“upper canada”, “new york”, “lake erie”, and “niagara falls”) . This is interesting because such things are not denoted in the bibliographic metadata. Moreover, a histogram plotting where in the document “niagra fals” occurs can be juxtaposed with a similar histogram for the word “catholic”. Why does the author talk about Catholics when they do not talk about upstate New York? Text mining makes it easier to bring these observations to light in a quick and easy-to-use manner.

Concordance highlighting geographic two-word phrases

Where the word “catholic” is located in the text

Where “niagra falls” is located in the text

Some work being done in the The Hesburgh Libraries at the University of Notre Dame is in the same vein. Specifically, the Libraries is scanning Catholic pamphlets, curating the resulting TIFF images, binding them together to make PDF documents, embedding the results of OCR (optical character recognition) into the PDFs, saving the PDFs on a Web server, linking to the PDFs from the catalog and discovery system, and finally, linking to text mining services from the catalog and discovery system. Consequently, once found, the reader will be able to download a digitized version of a pamphlet, print it, read it in the usual way, and analyze it for patterns and meanings in ways that may have been overlooked through the use of traditional analytic methods.

Are we there yet?

Are we there yet? Has the library profession solved the problem of “next-generation” library catalogs and discovery systems? In my opinion, the answer is, “No.” To date the profession continues to automate its existing processes without truly taking advantage of computer technology. The integrated library systems are more open than they used to be. Consequently control over the way they operate is being transfered from vendors to the library community. The OPACs of yesterday are being replaced with the discovery systems of today. They are easier to use and better meet readers’ desires. They are not perfect. They are not catalogs. But they do make the process of find more efficient.

On the other hand, our existing systems do not take advantage of the current environment. They do not exploit the wide array and inherent functionality of available full text literature. Think of the millions of books freely available from the Internet Archive, Google Books, the HathiTrust, and Project Gutenberg. Think of the thousands of open access journal titles. Think about all the government documents, technical reports, theses & dissertations, conference proceedings, blogs, wikis, mailing list archives, and even “tweets” freely available on the Web. Even without the content available through licensing, this content has the makings of a significant library of any type. The next step is to provide enhanced services against this content — services that go beyond discovery and access. Once done, the library profession moves away from being a warehouse to an online place where data and information can be put into context, used to address existing problems, and/or create new knowledge.

The problem of find as reached the point of diminishing returns. The problem of use is now the problem requiring a greater amount of the profession’s attention.

Fun with RSS and the RSS aggregator called Planet

Eric Lease Morgan — Wed, 25 May 2011 22:54:15 +0000

This posting outlines how I refined a number of my RSS feeds and then aggregated them into a coherent whole using Planet.

Many different RSS feeds

I have, more or less, been creating RSS (Real Simple Syndication) feeds since 2002. My first foray was not really with RSS but rather with RDF. At that time the functions of RSS and RDF were blurred. In any event, I used RDF as a way of syndicating randomly selected items from my water collection. I never really pushed the RDF, and nothing really became of it. See “Collecting water and putting it on the Web” for details.

In December of 2004 I started marking up my articles, presentations, and travelogues in TEI and saving the result in a database. The webified version of these efforts was something called Musings on Information and Librarianship. I described the database supporting the process is a specific entry called “My personal TEI publishing system“. A program — make-rss.pl — was used to make the feed.

Since then blogs have become popular, and almost by definition, blogs support RSS in a really big way. My RSS was functional, but by comparison, everybody else’s was exceptional. For many reasons I started drifting away from my personal publishing system in 2008 and started moving towards WordPress. This manifested itself in this blog — Mini-Musings.

To make things more complicated, I started blogging on other sites for specific purposes. About a year ago I started blogging for the “Catholic Portal”, and more recently I’ve been blogging about research data management/curation — Days in the Life of a Librarian — at the University of Notre Dame.

In September of 2009 I started implementing a reading list application. Print an article. Read it. Draw and scribble on it. (Read, “Annotate it.”) Scan it. Convert it into a PDF document. Do OCR against it. Save the result to a Web-accessible file system. Do data entry against a database to describe it. Index the metadata and extracted OCR. And finally, provide a searchable/browsable interface to the whole lot. The result is a fledgling system I call “What’s Eric Reading?” Since I wanted to share my wealth (after all, I am a librarian) I created an RSS feed against this system too.

I was on a roll. I went back to my water collection and created a full-fledged RSS feed against it as well. See the simple Perl script — water2rss.pl — to see how easy it is.

Ack! I now have six different active RSS feeds, not counting the feeds I can get from Flickr and YouTube:

That’s too many, even for an ego surfer like myself. What to do? How can I consolidate these things? How can I present my writings in a single interface? How can I make it easy to syndicate all of this content in a standards-compliant way?

Planet

The answer to my questions is/was Planet — “an awesome ‘river of news’ feed reader. It downloads news feeds published by web sites and aggregates their content together into a single combined feed, latest news first.”

A couple of years ago the Code4Lib community created an RSS “planet” called Planet Code4Lib — “Blogs and feeds of interest to the Code4Lib community, aggregated.” I think it is maintained by Jonathan Rochkind, but I’m not sure. It is pretty nice since it brings together the RSS feeds from quite a number of library “hackers”. Similarly, there is another planet called Planet Cataloging which does the same thing for library cataloging feeds. This one is maintained by Jennifer W. Baxmeyer and Kevin S. Clarke. The combined planets work very well together, except when individual blogs are in both aggregations. When this happens I end up reading the same blog postings twice. Not a big deal. You get what you pay for.

After a tiny bit of investigation, I decided to use Planet to aggregate and serve my RSS feeds. Installation and configuration was trivial. Download and unpack the distribution. Select an HTML template. Edit a configuration file denoting the location of RSS feeds and where the output will be saved. Run the program. Tweak the template. Repeat until satisfied. Run the program on a regular basis, preferably via cron. Done. My result is called Planet Eric Lease Morgan.

The graphic design may not be extraordinarily beautiful, but the content is not necessarily intended to be read via an HTML page. Instead the content is intended to be read from inside one’s favorite RSS reader. Planet not only aggregates content but syndicates it too. Very, very nice.

What I learned

I learned a number of things from this process. First I learned that standards evolve. “Duh!”

Second, my understanding of open source software and its benefits was re-enforced. I would not have been able to do nearly as much if it weren’t for open source software.

Third, the process provided me with a means to reflect on the processes of librarianship. My particular processes for syndicating content needed to evolve in order to remain relevant. I had to go back and modify a number of my programs in order for everything to work correctly and validate. The library profession seemingly hates to do this. We have a mindset of “Mark it and park it.” We have a mindset of “I only want to touch book or record once.” In the current environment, this is not healthy. Change is more the norm than not. The profession needs to embrace change, but then again, all institutions, almost by definition, abhor change. What’s a person to do?

Forth, the process enabled me to come up with a new quip. The written word read transcends both space and time. Fun!?

Finally, here’s an idea for the progressive librarians in the crowd. Use the Planet software to aggregate RSS fitting your library’s collection development policy. Programatically loop through the resulting links to copy/mirror the remote content locally. Curate the resulting collection. Index it. Integrate the subcollection and index into your wider collection of books, jourals, etc. Repeat.

Book reviews for Web app development

Eric Lease Morgan — Sun, 15 May 2011 14:01:40 +0000

This is a set of tiny book reviews covering the topic of Web app development for the iPhone, iPad, and iPod Touch.

Unless you’ve been living under a rock for the past three or four years, then you know the increasing popularity of personal mobile computing devices. This has manifested itself through “smart phones” like the iPhone and “tablet computers” like the iPad and to some extent the iPod Touch. These devices, as well as other smart phones and tablet computers, get their network connections from the ether, their screens are smaller than the monitors of desktop computers, and they employ touch screens for input instead of keyboards and mice. All of these things significantly change the user’s experience and thus their expectations.

As a librarian I am interested in providing information services to my clientele. In this increasingly competitive environment where the provision of information services includes players like Google, Amazon, and Facebook, it behooves me to adapt to the wider environment of my clientele as opposed to the other way around. This means I need to learn how to provide information services through mobile computing devices. Google does it. I have to do it too.

Applications for mobile computing devices fall into two categories: 1) native applications, and 2) “Web apps”. The former are binary programs written in compiled languages like Objective-C (or quite possibly Java). These types of applications are operating system-specific, but they are also able to take full advantage of the underlying hardware. This means applications for things like iPhone or iPad can interoperate with the devices’ microphone, camera, speakers, geo-location functions, network connection, local storage, etc. Unfortunately, I don’t know any compiled languages to any great degree, and actually I have little desire to do so. After all, I’m a lazy Perl programmer, and I’ve been that way for almost twenty years.

The second class of applications are Web apps. In reality, these things are simply sets of HTML pages specifically designed for mobiles. These “applications” have the advantage of being operating system independent but are dead in the water without the existence of a robust network connection. These applications, in order to be interactive and meet user expectations, also need to take full advantage of CSS and Javascript, and when it comes to Javascript it becomes imperative to learn and understand how to do AJAX and AJAX-like data acquisition. If I want to provide information services through mobile devices, then the creation of Web apps seems much more feasible. I know how to create well-formed and valid HTML. I can employ the classic LAMP stack to do any hard-core computing. There are a growing number of CSS frameworks making it easy to implement the mobile interface. All I have to do is learn Javascript, and this is not nearly as difficult as it used to be with the emergence of Javascript debuggers and numerous Javascript libraries. For me, Web apps seem to be the way to go.

Over the past couple of years I went out and purchased the following books to help me learn how to create Web apps. Each of them are briefly described below, but first, here’s a word about WebKit. There are at least three HTML frameworks driving the majority of Web browsers these days. Gecko which is the heart of Firefox, WebKit which is the heart of Safari and Chrome, and whatever Microsoft uses as the heart of Internet Explorer. Since I do not own any devices that run the Android or the Windows operating systems, all of my development is limited to Gecko or WebKit based browsers. Luckily, WebKit seems to be increasing in popularity, and this makes it easier for me to rationalize my development in iPhone, iPad, and iPod Touch. The books reviewed below also lean in this direction.

Beginning iPhone And iPad Web Apps (2010, 488 pgs.) by Chris Apers and Daniel Paterson – This is one my more recent purchases and I think I like this book the best. First and foremost, it is the most agnostic of all the books, even though some of the examples use WebKit. True to its title, it describes the use of HTML5, CSS, and Javascript to implement mobile interfaces. This includes whole chapters to the use of vector graphics and fonts, audio and video content, special effects with (WebKit-specific) CSS, touch and gesture events with Javascript, location-aware programming, and client-side data storage. Moreover, this book is the best of the bunch when it comes to describing how mobile interfaces are different from browser-based interfaces. Mobile interfaces are not just smaller versions of their older siblings! If you are going to buy one book, then buy this one. I think it will serve you for the longest period of time.
Building iPhone Apps With HTML, CSS, and Javascript (2010, 166 pgs.) by Jonathan Stark – Being shorter than the previous book, this one is not as thorough but still covers all the bases. On the other hand, unlike the previous title, it does describe how to use a Javascript library for mobile (JQTouch), and how to use PhoneGap to convert a Web app into a native application with many of the native application benefits. This book is a quick read and a good introduction.
Dashcode For Dummies (2011, 436 pgs.) by Jesse Feiler – Dashcode is a development environment originally designed to facilitate the creation of Macintosh OS X dashboard widgets. As you may or may not know, these widgets are self-contained HTML/Javascript/CSS files intended to support simple utility functions. Tell the time. Display the weather. Convert currencies. Render XML files. Etc. Dashcode evolved and now enables the developer to create Web apps for the Macintosh family of i-devices. I bought this book because I own these devices, and I thought the book might help me exploit their particular characteristics. It does not. Dashcode includes no internal links to the underlying hardware. This book describes how to use Dashcode very well, but Dashcode applications are not really the kind I want to create. I suppose I could use Dashcode to create the skin of my application but the overhead may be excessive and the result may be too device dependent.
Developing Hybrid Applications For The iPhone (2009, 195 pgs.) by Lee S. Barney – By introducing the idea of a “hybrid” application, this book picks up where the Dashcode book left off. It does this by describing two Javascript packages (QuickConnectiPhone and PhoneGap) allowing the developer to interact with the underlying hardware. I’ve read this book a couple of times, I’ve looked over it a few more, and in the end I am still challanged. I’m excited about accessing things like hardware’s camera, GPS funcationality, and file system, but after reading this book I’m still confused on actually how to do it. The content of this book is an advanced topic to be tackled after the basics have been mastered.
Safari And WebKit Development For iPhone OS 3.0 (2010, 383 pgs.) by Richard Wagner – This book is practical, and the one I relied upon the most, but only before I bought Beginning iPhone And iPad Web Apps. It gives an overview of WebKit, Javascript, and CSS. It advocates Web app frameworks like iUI, iWebKit, and UIUIKit. It describes how to design interfaces for the small screen of iPhone and iPod Touch. It has a chapter the specific Javascript events supported by iPhone and iPod Touch. Like a couple of the other books, it describes how to use the HTML5 canvas to render graphics. I was excited to learn how to interact with the phone, maps, and SMS functions of the devices, but learned that this is done simply through specialized URLs. When the book talks about “offline applications” it is really talking about local database storage — another feature of HTML5. A couple things I should have explored but haven’t yet include bookmarklets and data URLs. The book describes how to take advantage of these concepts. This book is really a second edition of similar book with a different title but written by the same author in 2008. Its content is not as current as it could be, but the fundamentals are there.

Based on the things I’ve learned from these books, I’ve created several mobile interfaces. Each of them deserve their own blog posting so I will only outline them here:

iMobile – A rough mobile interface to much of the Infomotions domain. Written a little more than a year ago, it combines backend Perl scripts with the iUI Javascript framework to render content. Now that I look back on it, the hacks there are pretty impressive, if I do say so myself. Of particular interest is the image gallery which gets its content from OAI-PMH data stored on the server, and my water collection which reads an XML file of my own design and plots where the water was collected on a Google map. iMobile was created from the knowledge I gained from Safari And WebKit Development For iPhone OS 3.0.
DH@ND – The home page for a fledgling initiative called Digital Humanities at the University of Notre Dame. The purpose of the site is to support sets of tools enabling students and scholars to simultaneously do “close reading” and “distant reading”. It was built using the principles gleaned from the books above combined with a newer Javascript framework called JQueryMobile. There are only two things presently of note there. The first is Alex Lite for Mobile, a mobile interface to a tiny catalogue of classic novels. Browse the collection by author or title. Download and read selected books in ePub, PDF, or HTML formats. The second is Geo-location. After doing named-entity extraction against a limited number of classic novels, this interface displays a word cloud of place names. The user can then click on place names and have them plotted on a Google Map.

Remember, the sites listed above are designed for mobile, primarly driven by the WebKit engine. If you don’t use a mobile device to view the sites, then your milage will vary.

Web app development is beyond a trend. It has all but become an expectation. Web app implementation requires an evolution in thinking about Web design as well as an additional skill set which includes advanced HTML, CSS, and Javascript. These are not your father’s websites. There are a number of books out there that can help you learn about these topics. Listed above are just a few of them.

Alex Lite (version 2.0)

Eric Lease Morgan — Tue, 12 Apr 2011 01:41:10 +0000

This posting describes Alex Lite (version 2.0) — a freely available, standards-compliant distribution of electronic texts and ebooks.

Alex Lite in a browser

Alex Lite on a mobile

A few years ago I created the first version of Alex Lite. Its primary purpose was to: 1) explore and demonstrate how to transform a particular flavor of XML (TEI) into a number of ebook formats, and 2) distribute the result on a CD-ROM. The process was successful. I learned a lot of about XSLT — the primary tool for doing this sort of work.

Since then two new developments have occurred. First, a “standard” ebook format has emerged — ePub. Based on XHTML, this standard specifies packaging up numerous XML files into a specialized ZIP archive. Software is intended to uncompress the file and display the result. Second, mobile devices have become more prevalent. Think “smart phones” and iPads. These two things have been combined to generate an emerging ebook market. Consequently, I decided to see how easy it would be to transform my TEI files into ePub files, make them available on the Web as well as a CD-ROM, and finally implement a “Webapp” for using the whole thing.

Alex Lite (version 2.0) is the result. There you will find a rudimentary Web browser-based “catalogue” of electronic texts. Browsable by authors and titles (no search), a person can read as many as eigthy classic writings in the forms of HTML, PDF, and ePub files. Using just about any mobile device, a person should be able to use a differnt interface to the collection with all of the functionality of the original. The only difference is the form factor, and thus the graphic design.

The entire Alex Lite distribution is designed to be given away and used as a stand-alone “library”. Download the .zip file. Uncompress it (about 116 MB). Optionally save the result on your Web server. Open the distribution’s index.html file with your browser or mobile. Done. Everything is included. Supporting files. HTML files. ePub files. PDF’s. Since all the files have been run through validators, a CD of Alex Lite should be readable for quite some time. Give away copies to your friends and relatives. Alex Lite makes a great gift.

Computers and their networks are extremely fragile. If they were to break, then access to much of world’s current information would suddently become inaccessible. Creating copies of content, like Alex Lite, are a sort of insurance against this catastrophe. Marking-up content in forms like TEI make it realatively easy to migrate ideas forward. TEI is just the information, not display nor container. Using XSLT it is possible to create different containers and different displays. Having copies of content locally enables a person to control their own destiny. Linking to content only creates maintenance nightmares.

Alex Lite is a fun little hack. Share it with your friends, and use it to evolve your definition of a library.

Where in the world is the mail going?

Eric Lease Morgan — Thu, 24 Mar 2011 01:22:09 +0000

For a good time, I geo-located the subscribers from a number of mailing lists, and then plotted them on a Google map. In other words, I asked the question, “Where in the world is the mail going?” The answer was sort of surprising.

I moderate/manage three library-specific mailing lists: Usability4Lib, Code4Lib, and NGC4Lib. This means I constantly get email messages from the LISTSERV application alerting me to new subscriptions, unsubscriptions, bounced mail, etc. For the most part the whole thing is pretty hands-off, and all I have to do is manually unsubscribe people because their address changed. No big deal.

It is sort of fun to watch the subscription requests. They are usually from places within the United States but not always. I then got to wondering, “Exactly where are these people located?” Plotting the answer on a world map would make such things apparent. This process is called geo-location. For me it is easily done by combining a Perl module called Geo::IP with the Google Maps API. The process was not too difficult and implemented in a program called domains2map.pl:

get a list of all the subscribers to a given mailing list
remove all information but the domain of the email addresses
get the latitude and longitude for a given domain — geo-locate the domain
increment the number of times this domain occurs in the list
got to Step #3 for each item in the list
build a set of Javascript objects describing each domain
insert the objects into an HTML template
output the finished HTML

The results are illustrated below.

Usability4Lib – 600 subscribers
interactive map	pie chart

Code4Lib – 1,700 subscribers
interactive map	pie chart

NGC4Lib – 2,100 subscribers
interactive map	pie chart

It is interesting to note how many of the subscribers seem to be located in Mountain View (California). This is because many people use Gmail for their mailing list subscriptions. The mailing lists I moderate/manage are heavily based in the United States, western Europe, and Australia — for the most part, English-speaking countries. There is a large contingent of Usability4Lib subscribers located in Rochester (New York). Gee, I wonder why. Even though the number of subscribers to Code4Lib and NGC4Lib is similar, the Code4Libbers use Gmail more. NGC4Lib seems to have the most international subscription base.

In the interest of providing “access to the data behind the chart”, you can download the data sets: code4lib.txt, ngc4lib.txt, and usability4lib.txt. Fun with Perl, Google Maps, and mailing list subscriptions.

For something similar, take a gander at my water collection where I geo-located waters of the world.

Constant chatter at Code4Lib

Eric Lease Morgan — Sun, 20 Mar 2011 14:02:12 +0000

As illustrated by the chart, it seems as if the chatter was constant during the most recent Code4Lib conference.

For a good time and in the vein of text mining, I made an effort to collect as many tweets with the hash tag #c4l11 as well as the backchannel log files. (“Thanks, lbjay!”). I then parsed the collection into fields (keys, author identifiers, date stamps, and chats/tweets), and stuffed them into a database. I then created a rudimentary tab-delimited text file consisting of a key (representing a conference event), a start time, and an end time. Looping through this file I queried my database returning the number of chats and tweets associated with each time interval. Lastly, I graphed the result.

Constant chatter at Code4Lib, 2011

As you can see there are a number of spikes, most notably associated with keynote presentations and Lightning Talks. Do not be fooled, because each of these events are longer than balance of the events in the conference. The chatter was rather constant throughout Code4Lib 2011.

When talking about the backchannel, many people say, “It is too distracting; there is too much stuff there.” I then ask myself, “How much is too much?” Using the graph as evidence, I can see there are about 300 chats per event. Each event is about 20-30 minutes long. That averages out to 10ish chats per minute or 1 item every 6 seconds. I now have a yardstick. When the chat volume is equal to or greater than 1 item every 6 seconds, then there is too much stuff for many people to follow.

The next step will be to write a program allowing people to select time ranges from the chat/tweet collection, extract the associated data, and apply analysis tools against them. This includes things like concordances, lists of frequently used words and phrases, word clouds, etc.

Finally, just like traditional books, articles, microforms, and audio-visual materials things things like backchannel log files, tweets, blogs, and mailing list archives are forms of human expression. Do what degree do these things fall into the purview of library collections? Why (or why not) should libraries actively collect and archive them? If it is within our purview, then what do libraries need to do differently in order build such collections and take advantage of their fulltext nature?

How “great” are the Great Books?

Eric Lease Morgan — Wed, 16 Mar 2011 10:33:05 +0000

In this posting I present two quantitative methods for denoting the “greatness” of a text. Through this analysis I learned that Aristotle wrote the greatest book. Shakespeare wrote seven of the top ten books when it comes to love. And Aristophanes’s Peace is the most significant when it comes to war. Once calculated, this description – something I call the “Great Ideas Coefficient” – can be used as a benchmark to compare & contrast one text with another.

Research questions

In 1952 Robert Maynard Hutchins et al. compiled a set of books called the Great Books of the Western World. [1] Comprised of fifty-four volumes and more than a couple hundred individual works, it included writings from Homer to Darwin. The purpose of the set was to cultivate a person’s liberal arts education in the Western tradition. [2]

To create the set a process of “syntopical reading” was first done. [3]. (Syntopical reading is akin to the emerging idea of “distant reading” [4], and at the same time complementary to the more traditional “close reading”.) The result was an enumeration of 102 “Great Ideas” commonly debated throughout history. Through the syntopical reading process, through the enumeration of timeless themes, and after thorough discussion with fellow scholars, the set of Great Books was enumerated. As stated in the set’s introductory materials:

…but the great books posses them [the great ideas] for a considerable range of ideas, covering a variety of subject matters or disciplines; and among the great books the greatest are those with the greatest range of imaginative or intellectual content. [5]

Our research question is then, “How ‘great’ are the Great Books?” To what degree do they discuss the Great Ideas which apparently define their greatness? If such degrees can be measured, then which of the Great Books are greatest?

Great Ideas Coefficient defined

To measure the greatness of any text – something I call a Great Ideas Coefficient – I apply two methods of calculation. Both exploit the use of term frequency inverse document frequency (TFIDF).

TFIDF is a well-known method for calculating statistical relevance in the field of information retrieval (IR). [6] Query terms are supplied to a system and compared to the contents of an inverted index. Specifically, documents are returned from an IR system in a relevancy ranked order based on: 1) the ratio of query term occurrences and the size of the document multiplied by 2) the ratio of the number of documents in the corpus and the number of documents containing the query terms. Mathematically stated, TFIDF equals:

(c/t) * log(d/f)

where:

c = number of times the query terms appear in a document
t = total number of words in a document
d = total number of documents in a corpus
f = total number of documents containing the query terms

For example, suppose a corpus contains 100 documents. This is d. Suppose two of the documents contain a given query term (such as “love”). This is f. Suppose also the first document is 50 words long (t) and contains the word love once (c). Thus, the first document has a TFIDF score of 0.034:

(1/50) * log(100/2) = 0.0339

Where as, if the second document is 75 words long (t) and contains the word love twice (c), then the second document’s TFIDF score is 0.045:

(2/75) * log(100/2) = 0.0453

Thus, the second document is considered more relevant than the first, and by extension, the second document is probably more “about” love than the first. For our purposes relevance and “aboutness” are equated with “greatness”. Consequently, in this example, when it comes to the idea of love, the second document is “greater” than the first. To calculate our first Coefficient I sum all 102 Great Idea TFIDF scores for a given document, a statistic called the “overlap score measure”. [7] By comparing the resulting sums I can compare the greatness of the texts as well as examine correlations between Great Ideas. Since items selected for inclusion in the Great books also need to exemplify the “greatest range of imaginative or intellectual content”, I also produce a Coefficient based on a normalized mean for all 102 Great Ideas across the corpus.

Great Ideas Coefficient calculated

To calculate the Great Ideas Coefficient for each of the Great Books I used the following process:

Mirrored versions of Great Books – By searching and browsing the Internet 222 of the 260 Great Books were found and copied locally, giving us a constant (d) equal to 222.
Indexed the corpus – An inverted index was created. I used Solr for this. [8]
Calculated TFIDF for a given Great Idea – First the given Great Idea was stemmed and searched against the the index resulting in a value for f. Each Great Book was retrieved from the local mirror whereby the size of the work (t) was determined as well as the number of times the stem appeared in the work (c). TFIDF was then calculated.
Repeated Step #3 for each of the Great Ideas – Go to Step #3 each of the Great Ideas.
Summed each of the TFIDF scores – The Great Idea TFIDF scores were added together giving us our first Great Ideas Coefficient for a given work.
Saved the result – Each of the individual scores as well as the Great Ideas Coefficient was saved to a database.
Returned to Step #3 for each of the Great Books – Go to Step #3 each of the other works in the corpus.

The end result was a file in the form of a matrix with 222 rows and 104 columns. Each row represents a Great Book. Each column is a local identifier, a Great Ideas TFIDF score, and a book’s Great Ideas Coefficient. [9]

The Great Books analyzed

Sorting the matrix according to the Great Ideas Coefficient is trivial. Upon doing so I see that Kant’s Introduction To The Metaphysics Of Morals and Aristotle’s Politics are the first and second greatest books, respectively. When the matrix is sorted by the love column, I see Plato’s Symposium come out as number one, but Shakespeare claims seven of the top ten items with his collection of Sonnets being the first. When the matrix is sorted by the war column, then Aristophanes’s Peace is the greatest.

Unfortunately, denoting overall greatness in the previous manner is too simplistic because it does not fit the definition of greatness posited by Hutchins. The Great Books are expected to be great because they exemplify the “greatest range of imaginative or intellectual content”. In other words, the Great Books are great because they discuss and elaborate upon a wide spectrum of the Great Ideas, not just a few. Ironically, this does not seem to be the case. Most of the Great Books have many Great Idea scores equal to zero. In fact, at least two of the Great Ideas – cosmology and universal – have TFIDF scores equal to zero across the entire corpus, as illustrated by Figure 1. This being the case, I might say that none of the Great Books are truly great because none of them significantly discuss the totality of the Great Ideas.

Figure 1 – Box plot scores of Great Ideas

To take this into account and not allow the value of the Great Idea Coefficient to be overwhelmed by one or two Great Idea scores, I calculated the mean TFIDF score for each of the Great Ideas across the matrix. This vector represents an imaginary but “typical” Great Book. I then compared the Great Idea TFIDF scores for each of the Great Books with this central quantity to determine whether or not it is above or below the typical mean. After graphing the result I see that Aristotle’s Politics is still the greatest book with Hegel’s Philosophy Of History being number two, and Plato’s Republic being number three. Figure 2 graphically illustrates this finding, but in a compressed form. Not all works are listed in the figure.

Figure 2 – Individual books compared to the “typical” Great Book

Summary

How “great” are the Great Books? The answer depends on what qualities a person wants to measure. Aristotle’s Politics is great in many ways. Shakespeare is great when it comes to the idea of love. The calculation of the Great Ideas Coefficient is one way to compare & contrast texts in a corpus – “syntopical reading” in a digital age.

Notes

[1] Hutchins, Robert Maynard. 1952. Great books of the Western World. Chicago: Encyclopædia Britannica.

[2] Ibid. Volume 1, page xiv.

[3] Ibid. Volume 2, page xi.

[4] Moretti, Franco. 2005. Graphs, maps, trees: abstract models for a literary history. London: Verso, page 1.

[5] Hutchins, op. cit. Volume 3, page 1220.

[6] Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. 2008. An introduction to information retrieval. Cambridge: Cambridge University Press, page 109.

[7] Ibid.

[8] Solr – http://lucene.apache.org/solr/

[9] This file – the matrix of identifiers and scores – is available at http://bit.ly/cLmabY, but a more useful and interactive version is located at http://bit.ly/cNVKnE

Code4Lib Conference, 2011

Eric Lease Morgan — Sat, 12 Mar 2011 18:52:52 +0000

This posting documents my experience at the 2011 Code4Lib Conference, February 8-10 in Bloomington (Indiana). In a sentence, the Conference was well-organized, well-attended, and demonstrated the over-all health and vitality of this loosely structured community. At the same time I think the format of the Conference will need to evolve if it expects to significantly contribute to the library profession.

student center

computers

Code4Libbers

Day #1 (Tuesday, February 8)

The Conference officially started on Tuesday, February 8 after the previous day’s round of pre-conference activities. Brad Wheeler (Indiana University) gave the introductory remarks. He alluded to the “new normal”, and said significant change only happens when there are great leaders or financial meltdowns such as the one we are currently experiencing. In order to find stability in the current environment he advocated true dependencies and collaborations, and he outlined three tensions: 1) innovation versus solutions at scale, 2) local-ness and cloudiness, and 3) propriety verus open. All of these things, he said, are false dichotomies. “There needs to be a balance and mixture of all these tension.” Wheeler used his experience with Kuali as an example and described personal behavior, a light-weight organization, and local goals as the “glue” making Kuali work. Finally, he said the library community needs to go beyond “toy” projects and create something significant.

The keynote address, Critical collaborations: Programmers and catalogers? Really?, was given by Diane Hillman (Metadata Management). In it she advocated greater collaboration between the catalogers and coders. “Catalogers and coders do not talk with each other. Both groups get to the nitty-gritty before their is an understanding of the problem.” She said change needs to happen, and it should start within our own institutions by learning new skills and having more cross-departmental meetings. Like Wheeler, she had her own set of tensions: 1) “cool” services versus the existing online public access catalog, and 2) legacy data versus prospective data. She said both communities have things to learn from each other. For example, catalogers need to learn to use data that is not created by catalogers, and catalogers need not always look for leadership from “on high”. I asked what the coders needed to learn, but I wasn’t sure what the answer was. She strongly advocated RDA (Resource Description and Access), and said, “It is ready.” I believe she was looking to the people in the audience as people who could create demonstration projects to show to the wider community.

Karen Coombs (OCLC) gave the next presentation, Visualizing library data. In it she demonstrated a number of ways library information can be graphed through the use of various mash-up technologies: 1) a map of holdings, 2) QR codes describing libraries, 3) author timelines, 4) topic timelines, 5) FAST headings in a tag cloud, 6) numbers of libraries, 7) tree relationships between terms, and 8) pie charts of classifications. “Use these things to convey information that is not a list of words”.

In Hey, Dilbert, where’s my data?”, Thomas Barker (University of Pennsylvania) described how he is aggregating various library data sets into a single source for analysis — http://code.google.com/p/metridoc/

Tim McGeary (Lehigh University) shared a Kuali update in Kuali OLE: Architecture of diverse and linked data. OLE (Open Library Environment) is the beginnings of an open source library management system. Coding began this month (February) with goals to build community, implement a “next-generation library catalog”, re-examine business operations, break away from print models of doing things, create an enterprise-level system, and reflect the changes in scholarly work. He outlined the structure of the system and noted three “buckets” for holding different types of content: 1) descriptive — physical holdings, 2) semantic — conceptual content, and 3) relational — financial information. They are scheduled to release their first bits of code by July.

Cary Gordon (The Cherry Hill Company) gave an overview of Drupal 7 functionality in Drupal 7 as a rapid application development tool. Of most interest to me was the Drupal credo, “Sacrifice the API. Preserve the data.” In the big scheme of things, this makes a lot of sense to me.

After lunch first up was Josh Bishoff (University of Illinois) with Enhancing the mobile experience: mobile library services at Illinois. The most important take-away was the importance between a mobile user experience and a desktop user experience. They are not the same. “This is not a software problem but rather an information architecture problem.”

Scott Hanrath (University of Kansas) described his participation in the development of Anthologize in One week, one tool: Ultra-rapid open sources development among strangers. He enumerated the group’s three criteria for success: 1) usefulness, 2) low walls & high ceilings, and 3) feasibility. He also attributed the project’s success to extraordinary outreach efforts — marketing, good graphic design, blurbs, logos, etc.

cabin

graveyard

chruch

VuFind beyond MARC: Discovering everything else by Demian Katz (Villanova University) described how VuFind supports the indexing of non-MARC metadata through the use of “record drivers”. Acquire metadata. Map it to Solr fields. Index it while denoting it as a special metadata type. Search. Branch according to metadata type. Display. He used Dublin Core OAI-PMH metadata as an example.

The last formal presentation of the day was entitled Letting in the light: Using Solr as an external search component by Jay Luker and Benoit Thiell (Astrophysics Data System). ADS is a bibliographic information system for astronomers. It uses a pre-print server originally developed at CERN. They desired to keep much of the functionality of the original server as possible but enhance it with Solr indexing. They described how they hacked the two systems to allow the searching and retrieving of millions of records at a time. Of all the presentations at the Conference, this one was the most computer science-like.

The balance of the day was given over to breakout sessions, lightning talks, a reception in the art museum, and craft beer drinking in the hospitality suite. Later that evening I retired to my room and hacked on Twitter feeds. “What do library programmers do for a good time?”

Day #2 (Wednesday, February 9)

The next day began with a presentation by my colleagues at Notre Dame, Rick Johnson and Dan Brubakerhorst. In A Community-based approach to developing a digital exhibit at Notre Dame using the Hydra Framework, they described how they are building and maintaining a digital library framework based on a myriad of tools: Fedora, Active Fedora, Solr, Hydrangia, Ruby, Blacklight. They gave examples of ingesting EAD files. They are working on an ebook management application. Currently they are building a digitized version of city plans.

I think the most inspiring presentation was by Margaret Heller (Dominican University) and Nell Tayler (Chicago Underground) called Chicago Underground Library’s community-based cataloging system. Tayler began and described a library of gray literature. Poems. Comics. All manner of self publications were being collected and loosely cataloged in order to increase the awareness of the materials and record their existence. The people doing the work have little or no cataloging experience. They decided amongst themselves what metadata they were going to use. They wanted to focus on locations and personal characteristics of the authors/publishers of the material. They whole thing reminded me of the times I suggested cataloging local band posters because somebody will find everything interesting at least once.

Gabriel Farrell (Drexel University) described the use of a non-relational database called CouchDB in Beyond sacrilege: A CouchApp catalog. With a REST-ful interface, complete with change log replication and different views, CouchApp seems to be cool as well as “kewl”.

Matt Zumwalt (MediaShelf) in Opinionated metadata: Bringing a bit o sanity to the world of XML metdata described OM which looked like a programatic way of working with XML in Ruby but I thought his advice on how to write good code was more interesting. “Start with people’s stories, not the schema. Allow the vocabulary to reflect the team. And talk to the other team members.”

Ben Anderson (eXtensible Catalog) in Enhancing the performance of extensibility of XC’s metadata services toolkit outlined the development path and improvements to the Metadata Services Toolkit (MST). He had a goal of making the MST faster and more robust, and he did much of this by taking greater advantage of MySQL as opposed to processing various things in Solr.

wires

power supply

water cooler

In Ask Anything! a.k.a. the ‘Human Search Engine moderated by Dan Chudnov (Library of Congress) a number of people stood up, asked the group a question, and waited for an answer. The technique worked pretty well and enabled many people to identify many others who: 1) had similar problems, or 2) offered solutions. For better or for worse, I asked the group if they had any experience with issues of data curation, and I was “rewarded” for my effort with the responsibility to facilitate a birds-of-a-feather session later in the day.

Standing in for Mike Grave, Tim Shearer (University of North Carolina at Chapel Hill) presented GIS on the cheap. Using different content from different sources, Grave is geo-tagging digital objects by assigning them latitudes and longitudes. Once this is done, his Web interfaces read the tagging and place the objects on a map. He is using a Javascript library called Open Layers for the implementation.

In Let’s get small: A Microservices approach to library websites by Sean Hannan (Johns Hopkins University) we learned how a myriad of tools and libraries are being used by Hannan to build websites. While the number of tools and libraries seemed overwhelming I was impressed at the system’s completeness. He was practicing the Unix Way when it comes to website maintenance.

When a person mentions the word “archives” at a computer conference, one of the next words people increasingly mention is “forensics”, and Mark Matienzo (Yale University) in Fiwalk with me: Building emergent pre-ingest workflows for digital archival records using open source-forensic software described how he uses forensic techniques to read, organize, preserve digital media — specifically hard drives. He advocated a specific workflow for doing his work, a process for analyzing the disk’s content with a program called Gumshoe, and Advanced Forensic Framework 4 (AFF4) for doing forensics against file formats. Ultimately he hopes to write an application binding the whole process together.

I paid a lot of attention to David Lacy (Villanova University) when he presented (Yet another) home-grown digital library system, built upon open source XML technologies and metadata standards because the work he has done directly effects a system I am working on colloquially called the “Catholic Portal”. In his system Lacy described a digital library system complete with METS files, a build process, an XML database, and an OAI-PMH server. Content is digitized, described, and ingested into VuFind. I feel embarrassed that I had not investigated this more thoroughly before.

Break-out (birds-of-a-feather) sessions were up next and I facilitated one on data curation. Between ten and twelve of us participated, and in a nutshell we outlined a whole host of activities and issues surrounding the process of data management. After listing them all and listening to the things discussed more thoroughly by the group I was able to prioritize. (“Librarians love lists.”) At the top was, “We won’t get it right the first time”, and I certainly agree. Data management and data curation are the new kids on the block and consequently represent new challenges. At the same time, our profession seems obsessed with the creation of processes, implementations, and not evaluating the processes as needed. In our increasingly dynamic environment, such a way of thinking is not feasible. We will have to practice. We will have to show our ignorance. We will have to experiment. We will have to take risks. We will have to innovate. All of these things assume imperfection from the get go. At the same time the issues surrounding data management have a whole lot in common with issues surrounding just about any other medium. The real challenge is the application of our traditional skills to the current environment. A close second in the priorities was the perceived need for cross-institutional teams — groups of people including the office of research, libraries, computing centers, legal counsel, and of course researchers who generate data. Everybody has something to offer. Everybody has parts of the puzzle. But no one has all the pieces, all the experience, nor all the resources. Successful data management projects — defined in any number of ways — require skills from across the academe. Other items of note on the list included issues surrounding: human subjects, embargoing, institution repository versus discipline repositories, a host of ontologies, format migration, storage and back-up versus preservation and curation, “big data” and “little data”, entrenching one’s self in the research process, and unfunded mandates.

text mining

As a part of the second day’s Lighting Talks I shared a bit about text mining. I demonstrated how the sizes of texts — measured in words — could be things we denote in our catalogs thus enabling people to filter results in an additional way. I demonstrated something similar with Fog, Flesch, and Kincaid scores. I illustrated these ideas with graphs. I alluded to the “colorfulness” of texts by comparing & contrasting Thoreau with Austen. I demonstrated the idea of “in the same breath” implemented through network diagrams. And finally, I tried to describe how all of these techniques could be used in our “next generation library catalogs” or “discovery systems”. The associated video, here, was scraped from the high quality work done by the University of Indiana. “Thanks guys!”

At the end of the day we were given the opportunity to visit the University’s data center. It sounded a lot like a busman’s holiday to me so I signed up for the 6 o’clock show. I got on the little bus with a few other guys. One was from Australia. Another was from Florida. They were both wondering whether or not the weather was cold. It being around 10° Fahrenheit I had to admit it was. The University is proud of their data center. It can withstand tornado-strength forces. It is built into the side of a hill. It is only have full, if that, which is another way of saying, “They have a lot of room to expand.” We saw the production area. We saw the research area. I was hoping to see lots of blinking lights and colorful, twisty cables, but the lights were few and the cables were all blue. We saw Big Red. I wanted to see where the network came in. “It is over there, in that room”. Holding up my hands I asked, “How big is the pipe?”. “Not very large,” was the reply, “and the fiber optic cable is only the size of a piece of hair.” It thought the whole thing was incongruous. All this infrastructure and it literally hangs on the end of a thread. One of the few people I saw employed by the data center made a comment while I was taking photographs. “Those are the nicest packaged cables you will ever see.” She was very proud of her handiwork, and I was happy to take a few pictures of them.

Big Red

generator

wires

Day #3 (Thursday, February 10)

The last day of the conference began with a presentation by Jason Casden and Joyce Chapman (North Carolina State University Libraries) with Building a open source staff-facing tablet app for library assessment. In it they first described how patron statistics were collected. Lots of paper. Lots of tallies. Lots of data entry. Little overall coordination. To resolve this problem they created a tablet-based tool allowing the statistics collector to roam through the library, quickly tally how many people were located where and doing what, and update a centralized database rather quickly. Their implementation was an intelligent use of modern technology. Kudos.

Ian Mulvany (Medeley) was a bit of an entrepreneur when he presented Medeley’s API and university libraries: Three example to create value on behalf of Jan Reichelt. His tool, Medeley, is intended to solve real problems for scholars: making them more efficient as writers, and more efficient as discoverers. To do this he provides a service where PDF files are saved centrally, analyzed for content, and enhanced through crowd sourcing. Using Medeley’s API things such as reading lists, automatic repository deposit, or “library dashboard” applications could be written. As of this writing Medeley is sponsoring a contest with cash prizes to see who can create the most interesting application from their API. Frankly, the sort of application described by Reichelt is the sort of application I think the library community should have created a few years ago.

In Practical relevancy testing, Naomi Dushay (Stanford University) advocated doing usability testing against the full LAMP stack. To do this she uses a program called Cucumber to design usability tests, run them, look at the results, adjust software configurations, and repeat.

Kevin Clarke (NESCent) in Sharing between data repositories first compared & contrasted two repository systems: Dryad and TreeBase. Both have their respective advantages & disadvantages. As a librarian he understands why it is good idea to have the same content in both systems. To this end he outlined and described how such a goal could be accomplished using a file packaging format called BagIt.

The final presentation of the conference was given by Eric Hellman (Gluejar, Inc) and called Why (Code4) libraries exist. In it he posited that more than half of the books sold in the near future will be in ebook format. If this happens, then, he asked, will libraries become obsolete? His answer was seemingly both no and yes. “Libraries need to change in order to continue to exists, but who will drive this change? Funding agencies? Start-up companies? Publishers? OCLC? ILS vendors?” None of these things, he says. Instead, it may be the coders but we (the Code4Lib community) have a number of limitations. We are dispersed, poorly paid, self-trained, and too practical. In short, none of the groups he outlined entirely have what it takes to keep libraries alive. On the other hand, he said, maybe libraries are not really about books. Instead, maybe, they are about space, people, and community. In the end Hellman said, “We need to teach, train, and enable people to use information.”

conference center

bell

hidden flywheel

Summary

All in all the presentations were pretty much what I expected and pretty much what was intended. Everybody was experiencing some sort of computing problem in their workplace. Everybody used different variations of the LAMP stack (plus an indexer) to solve their problems. The presenters shared their experience with these solutions. Each presentation was like variations of a 12-bar blues. A basic framework is assumed, and the individual uses the framework to accomplish to create beauty. If you like the idea of the blues framework, then you would have liked the Code4Lib presentations. I like the blues.

In the past eight months I’ve attended at least four professional conferences: Digital Humanities 2010 (July), ECDL 2010 (September), Data Curation 2010 (December), and Code4Lib 2011 (February). Each one had about 300 people in attendance. Each one had something to do with digital libraries. Two were more academic in nature. Two were more practical. All four were communities unto themselves; at each conference there were people of the in-crowd, new comers, and folks in between. Many, but definitely not most, of the people I saw were a part of the other conferences but none of them were at all four. All of the conferences shared a set of common behavioral norms and at the same time owned a set of inside jokes. We need to be careful and not go around thinking our particular conference or community is the best. Each has something to offer the others. I sincerely do not think there is a “best” conference.

The Code4Lib community has a lot to offer the wider library profession. If the use of computers in libraries is only going to grow (which is an understatement), then a larger number of people who practice librarianship will need/want to benefit from Code4Lib’s experience. Yet the existing Code4Lib community is reluctant to change the format of the conference to accomodate a greater number of people. Granted, larger numbers of attendees make it more difficult to find venues, enable a single shared conference experience, and necessitates increased governance and bureaucracy. Such are the challenges of a larger group. I think the Code4Lib community is growing and experiencing growing pains. The mailing list increases by at least one or two new subscribers every week. The regional Code4Lib meetings continue. The journal is doing just fine. Code4Lib is a lot like the balance of the library profession. Practical. Accustomed to working on a shoe string. Service oriented. Without evolving in some way, the knowledge of Code4Libbers is not going to have a substancial effect on the wider library community. This makes me sad.

Next year’s conference — Code4Lib 2012 — will be held in Seattle (Washington). See you there?

wires

self-portrait

Foray’s into parts-of-speech

Eric Lease Morgan — Sun, 06 Feb 2011 00:33:06 +0000

This posting is the first of my text mining essays focusing on parts-of-speech. Based on the most rudimentary investigations, outlined below, it seems as if there is not much utility in the classification and description of texts in terms of their percentage use of parts-of-speech.

Background

For the past year or so I have spent a lot of my time counting words. Many of my friends and colleagues look at me strangely when I say this. I have to admit, it does sound sort of weird. On the other hand, the process has enabled me to easily compare & contrast entire canons in terms of length and readability, locate statistically significant words & phrases in individual works, and visualize both with charts & graphs. Through the process I have developed two Perl modules (Lingua::EN::Ngram and Lingua::Concordance), and I have integrated them into my Alex Catalogue of Electronic Texts. Many people are still skeptical about the utility of these endeavors, and my implementations do not seem to be compelling enough to sway their opinions. Oh well, such is life.

My ultimate goal is to figure out ways to exploit the current environment and provide better library service. The current environment is rich with full text. It abounds. I ask myself, “How can I take advantage of this full text to make the work of students, teachers, and scholars both easier and more productive?” My current answer surrounds the creation of tools that take advantage of the full text — making it easier for people to “read” larger quantities of information, find patterns in it, and through the process create new knowledge.

Much of my work has been based on rudimentary statistics with little regard to linguistics. Through the use of computers I strive to easily find patterns of meaning across works — an aspect of linguistics. I think such a thing is possible because the use of language assumes systems and patterns. If it didn’t then communication between ourselves would be impossible. Computers are all about systems and patterns. They are very good at counting and recording data. By using computers to count and record characteristics of texts, I think it is possible to find patterns that humans overlook or don’t figure as significant. I would really like to take advantage of core reference works which are full of meaning — dictionaries, thesauri, almanacs, biographies, bibliographies, gazetteers, encyclopedias, etc. — but the ambiguous nature of written language makes the automatic application of such tools challenging. By classifying individual words as parts-of-speech (POS), some of this ambiguity can be reduced. This posting is my first foray into this line of reasoning, and only time will tell if it is fruitful.

Comparing parts-of-speech across texts

My first experiment compares & contrasts POS usage across texts. “To what degree are there significant differences between authors’ and genres’ usage of various parts-of-speech?”, I asked myself. “Do some works contain a greater number of nouns, verbs, and adjectives than others?” If so, then maybe this would be one way to differentiate works, and make it easier for the student to both select a work for reading as well as understand its content.

POS tagging

To answer these questions, I need to first identify the POS in a document. In the English language there are eight generally accepted POS: 1) nouns, 2) pronouns, 3) verbs, 4) adverbs, 5) adjectives, 6) prepositions, 7) conjunctions, and 8) interjections. Since I am a “lazy Perl programmer”, I sought a POS tagger and in the end settled on one called Lingua::TreeTagger — a full-featured wrapper around a command line driven application called Tree Tagger. Using a process called the Hidden Markov Model, TreeTagger systematically goes through a document and guesses the POS for a given word. According to the research, it can do this with 96% accuracy because is has accurately modeled the systems and patterns of the English language alluded to above. For example, it knows that sentences begin with capital letters and end with punctuation marks. It knows that capitalized words in the middle of sentences are the names of things and the names of things are nouns. It knows that most adverbs end in “ly”. It knows that adjectives often precede nouns. Similarly, it knows the word “the” also precedes nouns. In short, it has done its best to model the syntactical nature of a number of languages and it uses these models to denote the POS in a document.

For example, below is the first sentence from Abraham Lincoln’s Gettysburg Address:

Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.

Using Lingua::TreeTagger it is trivial to convert the sentence into the following XML snippet where each element contains two attributes (a lemma of the word in question and its POS) and the word itself:

Four score and seven years ago our fathers brought forth on this continent , a new nation , conceived in Liberty , and dedicated to the proposition that all men are created equal .

Each POS is represented by a different code. TreeTagger uses as many as 58 codes. Some of the less obscure are: CD for cardinal number, CC for conjunction, NN for noun, NNS for plural noun, JJ for adjective, VBP for the verb to be in the third-person plural, etc.

Using a slightly different version of the same trivial code, Lingua::TreeTagger can output a delimited stream where each line represents a record and the delimited values are words, lemmas, and POS. The first ten records from the sentence above are displayed below:

Word	Lemma	POS
Four	Four	CD
score	score	NN
and	and	CC
seven	seven	CD
years	year	NNS
ago	ago	RB
our	our	PP$
fathers	father	NNS
brought	bring	VVD
forth	forth	RB

In the end I wrote a simple program — tag.pl — taking a file name as input and streaming to standard output the tagged text in delimited form. Executing the code and saving the output to a file is simple:

$ bin/tag.pl corpus/walden.txt > pos/walden.pos

Consequently, I now have a way to quickly and easily denote the POS for each word in a given plain text file.

Counting and summarizing

Now that the POS of a given document are identified, the next step is to count and summarize them. Counting is something at which computers excel, and I wrote another program — summarize.pl — to do the work. The program’s input takes the following form:

summarize.pl

The first command line argument denotes what POS will be output. “All” denotes the POS defined by Tree Tagger. “Simple” denotes Tree Tagger POS mapped to the eight generally accepted POS of the English language. The use of “nouns”, “pronouns”, “verbs”, “adverbs”, and “adjectives” tells the script to output the tokens (words) or lemmas in each of these classes.

The second command line argument tells the script whether to tally tokens (words) or lemmas when counting specific items.

The last argument is the file to read, and it is expected to be in the form of tag.pl’s output.

Using summarize.pl to count the simple POS in Lincoln’s Address, the following output is generated:

$ summarize.pl simple t address.pos noun 41 pronoun 29 adjective 21 verb 51 adverb 31 determiner 35 preposition 39 conjunction 11 interjection 0 symbol 2 punctuation 39 other 11

In other words, of the 272 words found in the Gettysburg Address 41 are nouns, 29 are pronouns, 21 are adjectives, etc.

Using a different from of the script, a list of all the pronouns in the Address, sorted by the number of occurances, can be generated:

$ summarize.pl pronouns t address.pos we 10 it 5 they 3 who 3 us 3 our 2 what 2 their 1

In other words, the word “we” — a particular pronoun — was used 10 times in the Address.

Consequently, I now have tool enabling me to count the POS in a document.

Preliminary analysis

I now have the tools necessary to answer one of my initial questions, “Do some works contain a greater number of nouns, verbs, and adjectives than others?” To answer this I collected nine sets of documents for analysis:

Henry David Thoreau’s Excursions (73,734 words; Flesch readability score: 57 )
Henry David Thoreau’s Walden (106,486 words; Flesch readability score: 55 )
Henry David Thoreau’s A Week on the Concord and Merrimack Rivers (117,670 words; Flesch readability score: 56 )
Jane Austen’s Sense and Sensibility (119,625 words; Flesch readability score: 54 )
Jane Austen’s Northanger Abbey (76,497 words; Flesch readability score: 58 )
Jane Austen’s Emma (156,509 words; Flesch readability score: 60 )
all of the works of Plato (1,162,460 words; Flesch readability score: 54 )
all of the works of Aristotle (950,078 words; Flesch readability score: 50 )
all of the works of Shakespeare (856,594 words; Flesch readability score: 72 )

Using tag.pl I created POS files for each set of documents. I then used summary.pl to output counts of the simple POS from each POS file. For example, after creating a POS file for Walden, I summarized the results and learned that it contains 23,272 nouns, 10,068 pronouns, 8,118 adjectives, etc.:

$ summarize.pl simple t walden.pos noun 23272 pronoun 10068 adjective 8118 verb 17695 adverb 8289 determiner 13494 preposition 16557 conjunction 5921 interjection 37 symbol 997 punctuation 14377 other 2632

I then copied this information into a spreadsheet and calculated the relative percentage of each POS discovering that 19% of the words in Walden are nouns, 8% are pronouns, 7% are adjectives, etc. See the table below:

POS	%
noun	19
pronoun	8
adjective	7
verb	15
adverb	7
determiner	11
preposition	14
conjunction	5
interjection	0
symbol	1
punctuation	12
other	2

I repeated this process for each of the nine sets of documents and tabulated them here:

POS	Excursions	Rivers	Walden	Sense	Northanger	Emma	Aristotle	Shakespeare	Plato	Average
noun	20	20	19	17	17	17	19	25	18	19
verb	14	14	15	16	16	17	15	14	15	15
punctuation	13	13	12	15	15	15	11	16	13	14
preposition	13	13	14	13	13	12	15	9	14	13
determiner	12	12	11	7	8	7	13	6	11	10
pronoun	7	7	8	12	11	11	5	11	7	9
adverb	6	6	7	8	8	8	6	6	6	7
adjective	7	7	7	5	6	6	7	5	6	6
conjunction	5	5	5	3	3	3	5	3	6	4
other	2	2	2	3	3	3	3	3	3	3
symbol	1	1	1	1	1	0	1	2	1	1
interjection	0	0	0	0	0	0	0	0	0	0
Percentage and average of parts-of-speech usage in 9 works or corpra

The result was very surprising to me. Despite the wide range of document sizes, and despite the wide range of genres, the relative percentages of POS are very similar across all of the documents. The last column in the table represents the average percentage of each POS use. Notice how the each individual POS value differs very little from the average.

This analysis can be illustrated in a couple of ways. First, below are nine pie charts. Each slice of each pie represents a different POS. Notice how all the dark blue slices (nouns) are very similar in size. Notice how all the red slices (verbs), again, are very similar. The only noticeable exception is in Shakespeare where there is a greater number of nouns and pronouns (dark green).

Thoreau’s Excursions	Thoreau’s Walden	Thoreau’s Rivers
Austen’s Sense	Austen’s Northanger	Austen’s Emma
all of Plato	all of Aristotle	all of Shakespeare

The similarity across all the documents can be further illustrated with a line graph:

Across the X axis is each POS. Up and down the Y axis is the percentage of usage. Notice how the values for each POS in each document are closely clustered. Each set of documents uses relatively the same number of nouns, pronouns, verbs, adjectives, adverbs, etc.

Maybe such a relationship between POS is one of the patterns of well-written documents? Maybe it is representative of works standing the test of time? I don’t know, but I doubt I am the first person to make such an observation.

Conclusion

My initial questions were, “To what degree are there significant differences between authors’ and genres’ usage of various parts-of-speech?” and “Do some works contain a greater number of nouns, verbs, and adjectives than others?” Based on this foray and rudimentary analysis the answers are, “No, there are not significant differences, and no, works do not contain different number of nouns, verbs, adjectives, etc.”

Of course, such a conclusion is faulty without further calculations. I will quite likely commit an error of induction if I base my conclusions on a sample of only nine items. While it would require a greater amount of effort on my part, it is not beyond possibility for me to calculate the average POS usage for every item in my Alex Catalogue. I know there will be some differences — especially considering the items having gone through optical character recognition — but I do not know the degree of difference. Such an investigation is left for a later time.

Instead, I plan to pursue a different line of investigation. The current work examined how texts were constructed, but in actuality I am more interested in the meanings works express. I am interested in what they say more than how they say it. Such meanings may be gleaned not so much from gross POS measurements but rather the words used to denote each POS. For example, the following table lists the 10 most frequently used pronouns and the number of times they occur in four works. Notice the differences:

Walden	Rivers	Northanger	Sense
I (1,809)	it (1,314)	her (1,554)	her (2,500)
it (1,507)	we (1,101)	I (1,240)	I (1,917)
my (725)	his (834)	she (1,089)	it (1,711)
he (698)	I (756)	it (1,081)	she (1,553)
his (666)	our (677)	you (906)	you (1,158)
they (614)	he (649)	he (539)	he (1,068)
their (452)	their (632)	his (524)	his (1,007)
we (447)	they (632)	they (379)	him (628)
its (351)	its (487)	my (342)	my (598)
who (340)	who (352)	him (278)	they (509)

While the lists are similar, they are characteristic of work from which they came. The first — Walden — is about an individual who lives on a lake. Notice the prominence of the word “I” and “my”. The second — Rivers — is written by the same author as the first but is about brothers who canoe down a river. Notice the higher occurrence of the word “we” and “our”. The later two works, both written by Jane Austin, are works with females as central characters. Notice how the words “her” and “she” appear in these lists but not in the former two. (Compare these lists of pronouns with the list from Lincoln’s Address and even more interesting things appear.) It looks as if there are patterns or trends to be measured here.

‘More later.

Visualizing co-occurrences with Protovis

Eric Lease Morgan — Mon, 10 Jan 2011 00:34:21 +0000

This posting describes how I am beginning to visualize co-occurrences with a Javascript library called Protovis. Alternatively, I an trying to answer the question, “What did Henry David Thoreau say in the same breath when he used the word ‘walden’?”

“In the same breath”

Network diagrams are great ways to illustrate relationships. In such diagrams nodes represent some sort of entity, and lines connecting nodes represent some sort of relationship. Nodes clustered together and sharing many lines denote some kind of similarity. Conversely, nodes whose lines are long and not interconnected represent entities outside the norm or at a distance. Network diagrams are a way of visualizing complex relationships.

Are you familiar with the phrase “in the same breath”? It is usually used to denote the relationship between one or more ideas. “He mentioned both ‘love’ and ‘war’ in the same breath.” This is exactly one of the things I want to do with texts. Concordances provide this sort of functionality. Given a word or phrase, a concordance will find the query in a corpus and display the words on either side of it. A KWIK (key word in context) index, concordances make it easier to read how words or phrases are used in relationship with their surrounding words. The use of network diagrams seem like good idea to see — visualize — how words or phrases are used within the context of surrounding words.

Protovis is a Javascript charting library developed by the Stanford Visualization Group. Using Protovis a developer can create all sorts of traditional graphs (histograms, box plots, line charts, pie charts, scatter plots) through a relatively easy-to-learn API (application programmer interface). One of the graphs Protovis supports is an interactive simulation of network diagrams called “force-directed layouts“. After experiencing some of the work done by a few of my colleagues (“Thank you Michael Clark and Ed Summers“), I wondered whether or not network diagrams could be used to visualize co-occurrences in texts. After discovering Protovis, I decided to try to implement something along these lines.

Implementation

The implementation of the visualization requires the recursive creation of a term matrix. Given a word (or regular expression), find the query in a text (or corpus). Identify and count the d most frequently used words within b number of characters. Repeat this process d times with each co-occurrence. For example, suppose the text is Walden by Henry David Thoreau, the query is “spring”, d is 5, and b is 50. The implementation finds all the occurrences of the word “spring”, gets the text 50 characters on either side of it, finds the 5 most commonly used words in those characters, and repeats the process for each of those words. The result is the following matrix:

spring	day	morning	first	winter
day	days	night	every	today
morning	spring	say	day	early
first	spring	last	yet	though
winter	summer	pond	like	snow

Thus, the most common co-occurrences for the word “spring” are “day”, “morning”, “first”, and “winter”. Each of these co-occurrences are recursively used to find more co-occurrences. In this example, the word “spring” co-occurs with times of day and seasons. These words then co-occur with more times of day and more seasons. Similarities and patterns being to emerge. Depending on the complexity of a writer’s sentence structure, the value of b (“breath”) may need to be increased or decreased. As the value of d (“detail”) is increased or decreased so does the number of co-occurrences to return.

Once this matrix is constructed, Protovis requires it to be converted into a simple JSON (Javascript Object Notation) data structure. In this example, “spring” points to “day”, “morning”, “first”, and “winter”. “Day” points to “days”, “night”, “every”, and “today”. Etc. As terms point to multiples of other terms, a network diagram is manifested, and the magic of Protovis is put to work. See the following illustration:

“spring” in Walden

It is interesting enough to see the co-occurrences of any given word in a text, but it is even more interesting to compare the co-occurrences between texts. Below are a number of visualizations from Thoreau’s Walden. Notice how the word “walden” frequently co-occurs with the words “pond”, “water”, and “woods”. This makes a lot of sense because Walden Pond is a pond located in the woods. Notice how the word “fish” is associated with “pond”, “fish”, and “fishing”. Pretty smart, huh?

“walden” in Walden

“fish” in Walden

“woodchuck” in Walden

“woods” in Walden

Compare these same words with the co-occurrences in a different work by Thoreau, A Week on the Concord and Merrimack Rivers. Given the same inputs the outputs are significantly different. For example, notice the difference in co-occurrences given the word “woodchuck”.

“walden” in Rivers

“fish” in Rivers

“woodchuck” in Rivers

“woods” in Rivers

Give it a try

Give it a try for yourself. I have written three CGI scripts implementing the things outlined above:

In each implementation you are given the opportunity to input your own queries, define the “size of the breath”, and the “level of detail”. The result is an interactive network diagram visualizing the most frequent co-occurrences of a given term.

The root of the Perl source code is located at http://infomotions.com/sandbox/network-diagrams/.

Implications for librarianship

The visualization of co-occurrences obviously has implications for text mining and the digital humanities, but it also has implications for the field of librarianship.

Given the current environment where data and information abound in digital form, libraries have found themselves in an increasingly competitive environment. What are libraries to do? Lest they become marginalized, librarians can not rest on their “public good” laurels. Merely providing access to information is not good enough. Everybody feels as if they have plenty of access to information. What is needed are methods and tools for making better use of the data and information they acquire. Implementing text mining and visualization interfaces are one way to accomplish that goal within context of online library services. Do a search in the “online catalog”. Create a subset of interesting content. Click a button to read the content from a distance. Provide ways to analyze and summarize the content thus saving the time of the reader.

Us librarians have to do something differently. Think like an entrepreneur. Take account of your resources. Examine the environment. Innovate and repeat.

MIT’s SIMILE timeline widget

Eric Lease Morgan — Tue, 21 Dec 2010 01:04:04 +0000

For a good time, I took a stab at learning how to implement a MIT SIMILE timeline widget. This posting describes what I learned.

Background

The MIT SIMILE Widgets are a set of cool Javascript tools. There are tools for implementing “exhibits”, time plots, “cover flow” displays a la iTunes, a couple of other things, and interactive timelines. I have always had a fondness for timelines since college when I created one to help me study for my comprehensive examinations. Combine this interest with the rise of digital humanities and my belief that library data is too textual in nature, I decided to learn how to use the timeline widget. Maybe this tool can be used in Library Land?

Screen shot of local timeline implementation

Implementation

The family of SIMILE Widgets Web pages includes a number of sample timelines. By playing with the examples you can see the potencial of the tool. Going through the Getting Started guide was completely necessary since the Widget documentation has been written, re-written, and moved to other platforms numerous times. Needless to say, I found the instructions difficult to use. In a nutshell, using the Timeline Widget requires the developer to:

load the libraries
create and modify a timeline object
create a data file
load the data file
render the timeline

Taking hints from “timelines in the wild“, I decided to plot my writings — dating from 1989 to the present. Luckily, just about all of them are available via RSS (Really Simple Syndication), and they include:

Consequently, after writing my implementation’s framework, the bulk of the work was spent converting RSS files into an XML file the widget could understand. In the end I:

created an HTML file complete with the widget framework
downloaded the totality of RSS entries from all my my RSS feeds
wrote a slightly different XSL file for each RSS feed
wrote a rudimentary shell script to loop through each XSL/RSS combination and create a data file
put the whole thing on the Web

You can see the fruits of these labors on a page called Eric Lease Morgan’s Writings Timeline, and you can download the source code — timeline-2010-12-20.tar.gz. From there a person can scroll backwards and forwards in time, click on events, read an abstract of the writing, and hyperlink to the full text. The items from the Water Collection work in the same way but also include a thumbnail image of the water. Fun!?

Take-aways

I have a number of take-aways. First, my implementation is far from perfect. For example, the dates from the Water Collection are not correctly formatted in the data file. Consequently, different Javascript interpreters render the dates differently. Specifically, the Water Collection links to not show up in Safari, but they do show up in Firefox. Second, the timeline is quite cluttered in some places. There has got to be a way to address this. Third, timelines are a great way to visualize events. From the implementation you can readily see what how often I was writing and on what topics. The presentation makes so much more sense compared to a simple list sorted by date, title, or subject terms.

Library “discovery systems” could benefit from the implementation of timelines. Do a search. Get back a list of results. Plot them on a timeline. Allow the learner, teacher, or scholar to visualize — literally see — how the results of their query compare to one another. The ability to visualize information hinges on the ability to quantify information characteristics. In this case, the quantification is a set of dates. Alas, dates in our information systems are poorly recorded. It seems as if we — the library profession — have made it difficult for ourselves to participate in the current information environment.

Illustrating IDCC 2010

Eric Lease Morgan — Thu, 09 Dec 2010 01:52:48 +0000

This posting illustrates the “tweets” assigned to the hash tag #idcc10.

I more or less just got back from the 6th International Data Curation Conference that took place in Chicago (Illinois). Somewhere along the line I got the idea of applying digital humanities computing techniques against the conference’s Twitter feed — hash tag #idcc10. After installing a Perl module implementing the Twitter API (Net::Twitter::Lite), I wrote a quick hack, fed the results to Wordle, and got the following word cloud:

What sorts of conclusions can you make based on the content of the graphic?

The output static and rudimentary. What I’d really like to do is illustrate the tweets over time. Get the oldest tweets. Illustrate the result. Get the newer tweets. Update the illustration. Repeat for all the tweets. Done. In the end I see some sort of moving graphic where significant words represent bubbles. The size of the bubbles grow in size depending on number of times they are used. Each bubble is attached to other bubbles with a line representing associations. The color of the bubbles might represent parts of speech. Using this technique a person could watch the ebb and flow of the virtual conversation.

For a good time time, you can also download the Perl script used to create the textual output. Called twitter.pl, it is only forty-three lines long and many of those lines are comments.

Ruler & Compass by Andrew Sutton

Eric Lease Morgan — Mon, 06 Dec 2010 01:44:48 +0000

I most thoroughly enjoyed reading and recently learning from a book called Ruler & Compass by Andrew Sutton.

The other day, while perusing the bookstore for a basic statistics book, I came across Ruler & Compass by Andrew Sutton. Having always been intrigued by geometry and the use of only a straight edge and compass to describe a Platonic cosmos, I purchased this very short book, a ruler, and a compass with little hesitation. I then rushed home to draw points, lines, and circles for the purposes of constructing angles, perpendiculars, bisected angles, tangents, all sorts of regular polygons, and combinations of all the above to create beautiful geometric patterns. I was doing mathematics, but not a single number was to be seen. Yes, I did create ratios but not with integers, and instead with the inherent lengths of lines. Facinating!


triangle

square	pentagon

hexagon	elipse	“golden” ratio

Geometry is not a lot unlike both music and computer programming. All three supply the craftsman with a set of basic tools. Points. Lines. Circles. Tones. Durations. Keys. If-then statements. Variables. Outputs. Given these “things” a person is empowered to combine, compound, synthesize, analyze, create, express, and describe. They are mediums for both the artist and scientists. Using them effectively requires thinking as well as “thinquing“. All three are arscient processes.

Anybody could benefit by reading Sutton’s book and spending a few lovely hours practicing the geometric constructions contained therein. I especially recommend this activity to my fellow librarians. The process is not only intellectually stimulating but invigorating. Librarianship is not all about service or collections. It is also about combining and reconstituting core principles — collection, organization, preservation, and dissemination. There is an analogy to be waiting to be seen here. Reading and doing the exercises in Ruler & Compass will make this plainly visible.

Text mining Charles Dickens

Eric Lease Morgan — Sat, 04 Dec 2010 13:03:30 +0000

This posting outlines how a person can do a bit of text mining against three works by Charles Dickens using a set of two Perl modules — Lingua::EN::Ngram and Lingua::Concordance.

Lingua::EN::Ngram

I recently wrote a Perl module called Lingua::EN::Ngram. Its primary purpose is to count all the ngrams (two-word phrases, three-word phrases, n-word phrases, etc.) in a given text. For two-word phrases (bigrams) it will order the output according to a statistical probability (t-score). Given a number of texts, it will count the ngrams common across the corpus. As of version 0.02 it supports non-ASCII characters making it possible to correctly read and parse a greater number of Romantic languages — meaning it correctly interprets characters with diacritics. Lingua::EN::Ngram is available from CPAN.

Lingua::Concordance

Concordances are just about the oldest of textual analysis tools. Originally developed in the Late Middle Ages to analyze the Bible, they are essentially KWIC (keyword in context) indexes used to search and display ngrams within the greater context of a work. Given a text (such as a book or journal article) and a query (regular expression), Lingua::Concordance can display the occurrences of the query in the text as well as map their locations across the entire text. In a previous blog posting I used Lingua::Concordance to compare & contrast the use of the phrase “good man” in the works of Aristotle, Plato, and Shakespeare. Lingua::Concordance too is available from CPAN.

Charles Dickens

In keeping with the season, I wondered about Charles Dickens’s A Christmas Carol. How often is the word “Christmas” used in the work and where? In terms of size, how does A Christmas Carol compare to some of other Dickens’s works? Are there sets of commonly used words or phrases between those texts?

Answering the first question was relatively easy. The word “Christmas” is occurs eighty-six (86) times, and twenty-two (22) of those occurrences are in the the first ten percent (10%) of the story. The following bar chart illustrates these facts:

The length of books (or just about any text) measured in pages in ambiguous, at best. A much more meaningful measure is number of words. The following table lists the sizes, in words, of three Dickens stories:

story	size in words
A Christmas Carol	28,207
Oliver Twist	156,955
David Copperfield	355,203

For some reason I thought A Christmas Carol was much longer.

A long time ago I calculated the average size (in words) of the books in my Alex Catalogue. Once I figured this out, I discovered I could describe items in the collection based on relative sizes. The following “dial” charts bring the point home. Each one of the books is significantly different in size:

A Christmas Carol

Oliver Twist

David Copperfield

If a person were pressed for time, then which story would you be able to read?

After looking for common ngrams between texts, I discovered that “taken with a violent fit of” appears both David Copperfield and A Christmas Carol. Interesting!? Moreover, the phrase “violent fit” appears on all three works. Specifically, characters in these three Dickens stories have violent fits of laughter, crying, trembling, and coughing. By concatenating the stories together and applying concordancing methods I see there are quite a number of violent things in the three stories:

  n such breathless haste and violent agitation, as seemed to betoken so
  ood-night, good-night!' The violent agitation of the girl, and the app
  sberne) entered the room in violent agitation. 'The man will be taken,
  o understand that, from the violent and sanguinary onset of Oliver Twi
  one and all, to entertain a violent and deeply-rooted antipathy to goi
  eep a little register of my violent attachments, with the date, durati
  cal laugh, which threatened violent consequences. 'But, my dear,' said
  in general, into a state of violent consternation. I came into the roo
  artly to keep pace with the violent current of her own thoughts: soon 
  ts and wiles have brought a violent death upon the head of one worth m
   There were twenty score of violent deaths in one long minute of that 
  id the woman, making a more violent effort than before; 'the mother, w
   as it were, by making some violent effort to save himself from fallin
  behind. This was rather too violent exercise to last long. When they w
   getting my chin by dint of violent exertion above the rusty nails on 
  en who seem to have taken a violent fancy to him, whether he will or n
  peared, he was taken with a violent fit of trembling. Five minutes, te
  , when she was taken with a violent fit of laughter; and after two or 
  he immediate precursor of a violent fit of crying. Under this impressi
  and immediately fell into a violent fit of coughing: which delighted T
  of such repose, fell into a violent flurry, tossing their wild arms ab
   and accompanying them with violent gesticulation, the boy actually th
  ght I really must have laid violent hands upon myself, when Miss Mills
   arm tied up, these men lay violent hands upon him -- by doing which, 
   every aggravation that her violent hate -- I love her for it now -- c
   work himself into the most violent heats, and deliver the most wither
  terics were usually of that violent kind which the patient fights and 
   me against the donkey in a violent manner, as if there were any affin
   to keep down by force some violent outbreak. 'Let me go, will you,--t
  hands with me - which was a violent proceeding for him, his usual cour
  en.' 'Well, sir, there were violent quarrels at first, I assure you,' 
  revent the escape of such a violent roar, that the abused Mr. Chitling
  t gradually resolved into a violent run. After completely exhausting h
  , on which he ever showed a violent temper or swore an oath, was this 
  ullen, rebellious spirit; a violent temper; and an untoward, intractab
  fe of Oliver Twist had this violent termination or no. CHAPTER III REL
  in, and seemed to presage a violent thunder-storm, when Mr. and Mrs. B
  f the theatre, are blind to violent transitions and abrupt impulses of
  ming into my house, in this violent way? Do you want to rob me, or to

These observations simply beg other questions. Is violence a common theme in Dickens works? What other adjectives are used to a greater or lesser degree in Dickens works? How does the use of these adjectives differ from other authors of the same time period or within the canon of English literature?

Summary

The combination of the Internet, copious amounts of freely available full text, and ubiquitous as well as powerful desktop computing, it is now possible to analyze texts in ways that was not feasible twenty years ago. While the application of computing techniques against texts dates back to at least Father Busa’s concordance work in the 1960s, it has only been in the last decade that digital humanities has come into its own. The application of digital humanities to library work offers great opportunities for the profession. Their goals are similar and their tools are complementary. From my point of view, their combination is a marriage made in heaven.

A .zip file of the texts and scripts used to do the analysis is available for you to download and experiment with yourself. Enjoy.

AngelFund4Code4Lib

Eric Lease Morgan — Thu, 02 Dec 2010 13:01:28 +0000

The second annual AngelFund4Code4Lib — a $1,500 stipend to attend Code4Lib 2011 — is now accepting applications.

These are difficult financial times, but we don’t want this to dissuade people from attending Code4Lib. [1] Consequently a few of us have gotten together, pooled our resources, and made AngelFund4Code4Lib available. Applying for the stipend is easy. In 500 words or less, write what you hope to learn at the conference and email it to angelfund4code4lib@infomotions.com. We will then evaluate the submissions and select the awardee. In exchange for the financial resources, and in keeping with the idea of giving back to the community, the awardee will be expected to write a travelogue describing their take-aways and post it to the Code4Lib mailing list.

The deadline for submission is 5 o’clock (Pacific Time), Thursday, December 17. The awardee will be announced no later than Friday, January 7.

Submit your application. We look forward to helping you out.

If you would like to become an “angel” too, then drop us a line. We’re open to possibilities.

P.S. Check out the additional Code4Lib scholarships. [2]

[1] Code4Lib 2011 – http://code4lib.org/conference/2011/
[2] addtional scholarships – http://bit.ly/dLGnnx

Eric Lease Morgan,
Michael J. Giarlo, and
Eric Hellman

Crowd sourcing the Great Books

Eric Lease Morgan — Sat, 06 Nov 2010 16:16:24 +0000

This posting describes how crowd sourcing techniques are being used to determine the “greatness” of the Great Books.

The Great Books of the Western World is a set of books authored by “dead white men” — Homer to Dostoevsky, Plato to Hegel, and Ptolemy to Darwin. [1] In 1952 each item in the set was selected because the set’s editors thought the selections significantly discussed any number of their 102 Great Ideas (art, cause, fate, government, judgement, law, medicine, physics, religion, slavery, truth, wisdom, etc.). By reading the books, comparing them with one another, and discussing them with fellow readers, a person was expected to foster their on-going liberal arts education. Think of it as “life long learning” for the 1950s.

I have devised and implemented a mathematical model for denoting the “greatness” of any book. The model is based on term frequency inverse document frequency (TFIDF). It is far from complete, nor has it been verified. In an effort to address the later, I have created the Great Books Survey. Specifically, I am asking people to vote on which books they consider greater. If the end result is similar to the output of my model, then the model may be said to represent reality.

The survey itself is an implementation of the Condorcet method. (“Thanks Andreas.”) First, I randomly select one of the Great Ideas. I then randomly select two of the Great Books. Finally, I ask the poll-taker to choose the “greater” of the two books based on the given Great Idea. For example, the randomly selected Great Idea may be war, and the randomly selected Great Books may be Shakespeare’s Hamlet and Plato’s Republic. I then ask, “Which is book is ‘greater’ in terms of war?” The answer is recorded and an additional question is generated. The survey is never-ending. After 100’s of thousands of votes are garnered I hope too learn which books are the greatest because they got the greatest number of votes.

Because the survey results are saved in an underlying database, it is trivial to produce immediate feedback. For example, I can instantly return which books have been voted greatest for the given idea, how the two given books compare to the given idea, a list of “your” greatest books, and a list of all books ordered by greatness. For a good time, I am also geo-locating voters’ IP addresses and placing them on a world map. (“C’mon Antartica. You’re not trying!”)

The survey was originally announced on Tuesday, November 2 on the Code4Lib mailing list, Twitter, and Facebook. To date it has been answered 1,247 times by 125 people. Not nearly enough. So far, the top five books are:

Augustine’s City Of God And Christian Doctrine
Cervantes’s Don Quixote
Shakespeare’s Midsummer Nights Dream
Chaucers’s Canterbury Tales And Other Poems
Goethe’s Faust

There are a number of challenging aspects regarding the validity of the survey. For example, many people feel unqualified to answer some of the randomly generated questions because they have not read the books. My suggestion is, “Answer the question anyway,” because given enough votes randomly answered questions will cancel themselves out. Second, the definition of “greatness” is ambiguous. It is not intended to be equated with popularity but rather the “imaginative or intellectual content” the book exemplifies. [2] Put in terms of a liberal arts education, greatness is the degree a book discusses, defines, describes, or alludes to the given idea more than the other. Third, people have suggested I keep track of how many times people answer with “I don’t know and/or neither”. This is a good idea, but I haven’t implemented it yet.

Please answer the survey 10 or more times. It will take you less than 60 seconds if you don’t think about it too hard and go with your gut reactions. There are no such things as wrong answers. Answer the survey about 100 times, and you will may get an idea of what types of “great books” interest you most.

Vote early. Vote often.

[1] Hutchins, Robert Maynard. 1952. Great books of the Western World. Chicago: Encyclopedia Britannica.

[2] Ibid. Volume 3, page 1220.

Great Books data set

Eric Lease Morgan — Sat, 06 Nov 2010 13:42:20 +0000

This posting makes the Great Books data set freely available.

As described previously, I want to answer the question, “How ‘great’ are the Great Books?” In this case I am essentially equating “greatness” with statistical relevance. Specifically, I am using the Great Books of the Western World’s list of “great ideas” as search terms and using them to query the Great Books to compute a numeric value for each idea based on term frequency inverse document frequency (TFIDF). I then sum each of the great idea values for a given book to come up with a total score — the “Great Ideas Coefficient”. The book with the largest Coefficient is then considered the “greatest” book. Along the way and just for fun, I have also kept track of the length of each book (in words) as well as two scores denoting each book’s reading level, and one score denoting each book’s readability.

The result is a canonical XML file named great-books.xml. This file, primarily intended for computer-to-computer transfer contains all the data outlined above. Since most data analysis applications (like databases, spreadsheets, or statistical packages) do not deal directly with XML, the data was transformed into a comma-separated value (CSV) file — great-books.csv. But even this file, a matrix of 220 rows and 104 columns, can be a bit unwieldily for the uninitiated. Consequently, the CSV file has been combined with a Javascript library (called DataTables) and embedded into an HTML for file general purpose use — great-books.htm.

The HTML file enables you to sort the matrix by column values. Shift click on columns to do sub-sorts. Limit the set by entering queries into the search box. For example:

sort by the last column (coefficient) and notice how Kant has written the “greatest” book
sort by the column labeled “love” and notice that Shakespeare has written seven (7) of the top ten (10) “greatest books” about love
sort by the column labeled “war” and notice that something authored by the United States is ranked #2 but also has very poor readability scores
sort by things like “angel” or “god”, then ask yourself, “Am I surprised at what I find?”

Even more interesting questions may be asked of the data set. For example, is their a correlation between greatness and readability? If a work has a high love score, then it is likely it will have a high (or low) score from one or more of the other columns? What is the greatness of the “typical” Great Book? Is this best represented as the average of the Great Ideas Coefficient or would it be better stated as the value of the mean of all the Great Ideas? In the case of the later, which books are greater than most, which books are typical, an which books are below typical? This sort of analysis, as well as the “kewl” Web-based implementation, is left up the the gentle reader.

Now ask yourself, “Can all of these sorts of techniques be applied to the principles and practices of librarianship, and if so, then how?”

ECDL 2010: A Travelogue

Eric Lease Morgan — Sun, 10 Oct 2010 13:57:55 +0000

This posting outlines my experiences at the European Conference on Digital Libraries (ECDL), September 7-9, 2010 in Glasgow (Scotland). From my perspective, many of the presentations were about information retrieval and metadata, and the advances in these fields felt incremental at best. This does not mean I did not learn anything, but it does re-enforce my belief that find is no longer the current problem to be solved.

University of Glasgow

vaulted ceiling

Adam Smith

Day #1 (Tuesday, September 7)

After the usual logistic introductions, the Conference was kicked off with a keynote address by Susan Dumais (Microsoft) entitled The Web changes everything: Understanding and supporting people in dynamic information environments. She began, “Change is the hallmark of digital libraries… digital libraries are dynamic”, and she wanted to talk about how to deal with this change. “Traditional search & browse interfaces only see a particular slice of digital libraries. An example includes the Wikipedia article about Bill Gates.” She enumerated at least two change metrics: the number of changes and the time between changes. She then went about taking snapshots of websites, measuring the changes, and ultimately dividing the observations into at least three “speeds”: fast, medium, and slow. In general the quickly changing sites (fast) had a hub & spoke architecture. The medium change speed represented popular sites such as mail and Web applications. The slowly changing sites were generally entry pages or sites accessed via search. “Search engines need to be aware of what people seek and what changes over time. Search engines need to take change into account.” She then demonstrated an Internet Explorer plug-in (DiffIE) which highlights the changes in a website over time. She advocated weighing search engine results based on observed changes in a website’s content.

Visualization was the theme of Sascha Tönnies‘s (L3S Research) Uncovering hidden qualities — Benefits of quality measures for automatic generated metadata. She described the use of tag clouds with changes in color and size. The experimented with “growbag” graphs which looked a lot of network graphs. She also explored the use of concentric circle diagrams (CCD), and based on her observations people identified with them very well. “In general, people liked the CDD graph the best because the radius intuitively represented a distance from the central idea.”

What appeared to me as the interpretation of metadata schemes through the use of triples, Panorea Gaitanou (Ionian University) described a way to query many cultural heritage institution collections in Query transformation in a CIDOC CRM Based cultural metadata integration environment. He called the approach MDL (Metadata Description Language). Lots of mapping and lots of XPath.

Michael Zarro (Drexel University) evaluated user comments written against the Library of Congress Flickr Commons Project in User-contributed descriptive metadata for libraries and cultural institutions. As a result, he was able to group the comments into at least four types. The first, personal/historical, were exemplified by things like, “I was there, and that was my grandfather’s house.” The second, links out, pointed to elaborations such as articles on Wikipedia. The third, corrections/translations, were amendments or clarifications. The last, links in, were pointers to Flickr groups. The second type of annotations, links out, were the most popular.

thistle

rose

purple flower

Developing services to support research data management and sharing was a panel discussion surrounding the topic of data curation. My take-away from Sara Jone‘s (DDC) remarks was, “There are no incentives for sharing research data”, and when given the opportunity for sharing data owners react by saying things like, “I’m giving my baby away… I don’t know the best practices… What are my roles and responsibilities?” Veerle Van den Eynden (United Kingdom Data Archive) outlined how she puts together infrastructure, policy, and support (such as workshops) to create successful data archives. “infrastructure + support + policy = data sharing” She enumerated time, attitudes and privacy/confidentiality as the bigger challenges. Robin Rice (EDINA) outlined services similar to Van den Eynden’s but was particularly interested in social science data and its re-use. There is a much longer tradition of sharing social science data and it is definitely not intended to be a dark archive. He enumerated a similar but different set of barriers to sharing: ownership, freedom of errors, fear of scooping, poor documentation, and lack of rewards. Rob Grim (Tilburg University) was the final panelist. He said, “We want to link publications with data sets as in Economists Online, and we want to provide a number of additional services against the data.” He described data sharing incentive, “I will only give you my data if you provide me with sets of services against it such as who is using it as well as where it is being cited.” Grim described the social issues surrounding data sharing as the most important. He compared & contrasted sharing with preservation, and re-use with archiving. “Not only is it important to have the data but it is also important to have the tools that created the data.”

From what I could gather, Claudio Gennaro (IST-CNR) in An Approach to content-based image retrieval based on the Lucene search engine library converted the binary content of images in to strings, indexed the strings with Lucene, and then used Lucene’s “find more like this one” features to… find more like this one.

Stina Westman (Aalto University) gave a paper called Evaluation constructs for visual video summaries. She said, “I want to summarize video and measure things like quality, continuity, and usefulness for users.” To do this she enumerated a number of summarizing types: 1) storyboard, 2) scene clips, 3) fast forward technologies, and 4) user-controlled fast forwarding. After measuring satisfaction, scene clips provided the best recognition but storyboards were more enjoyable. The clips and fast forward technologies were perceived as the best video surrogates. “Summaries’ usefulness are directly proportional to the effort to use them and the coverage of the summary… There is little difference between summary types… There is little correlation between the type of performance and satisfaction.”

Frank Shipman (Texas A&M University) in his Visual expression for organizing and accessing music collections in MusicWiz asked himself, “Can we provide access to music collections without explicit metadata; can we use implicit metadata instead?” The implementation of his investigation was an application called MusicWiz which is divided into a user interface and an inference engine. It consists of six modules: 1) artist, 2) metadata, 3) audio signal, 4) lyrics, 5) a workspace expression, and 6) similarity. In the end Shipman found “benefits and weaknesses to organizing personal music collections based on context-independent metadata… Participants found the visual expression facilitated their interpretation of mood… [but] the lack of traditional metadata made it more difficult to locate songs…”

distillers

barrels

whiskey

Day #2 (Wednesday, September 8)

Liina Munari (European Commission) gave the second day’s keynote address called Digital libraries: European perspectives and initiatives. In it she presented a review of the Europeana digital library funding and future directions. My biggest take-aways was the following quote: “Orphan works are the 20th Century black hole.”

Stephan Strodl (Vienna University of Technology) described a system called Hoppla facilitating back-up and providing automatic migration services. Based on OAIS, it gets its input from email, a hard disk, or the Web. It provides data management access, preservation, and storage management. The system outsources the experience of others to implement these services. It seemingly offers suggestions on how to get the work done, but it does not actually do the back-ups. The title of his paper was Automating logical preservation for small institutions with Hoppla.

Alejandro Bia (Miguel Hernández University) in Estimating digitization costs in digital libraries using DiCoMo advocated making a single estimate for digitizing, and then making the estimate work. “Most of the cost in digitization is the human labor. Other things are known costs.” Based on past experience Bia graphed a curve of digitization costs and applied the curve to estimates. Factors that go into the curve includes: skill of the labor, familiarity with the material, complexity of the task, the desired quality of the resulting OCR, and the legibility of the original document. The whole process reminded me of Medieval scriptoriums.

city hall

lion

stair case

Andrew McHugh (University of Glasgow) presented In pursuit of an expressive vocabulary for preserved New Media art. He is trying to preserve (conserve) New Media art by advocating the creation of medium-independent descriptions written by the artist so the art can be migrated forward. He enumerated a number of characteristics of the art to be described: functions, version, materials & dependencies, context, stakeholders, and properties.

In An Analysis of the evolving coverage of computer science sub-fields in the DBLP digital library Florian Reitz (University of Trier) presented an overview of the Digital Bibliography & Library Project (DBLP) — a repository of computer science conference presentations and journal articles. The (incomplete) collection was evaluated, and in short he saw the strengths and coverage of the collection change over time. In a phrase, he did a bit of traditional collection analysis against is non-traditional library.

A second presentation, Analysis of computer science communities based on DBLP, was then given on the topic of the DBLP, this time by Maria Biryukov (University of Luxembourg). She first tried to classify computer science conferences into sets of subfields in an effort to rank which conferences were “better”. One way this was done was through an analysis of who participated, the number of citations, the number of conference presentations, etc. She then tracked where a person presented and was able to see flows and patterns of publishing. Her conclusion — “Authors publish all over the place.”

In Citation graph based ranking in Invenio by Ludmila Marian (European Organization for Nuclear Research) the question was asked, “In a database of citations consisting of millions of documents, how can good precision be achieved if users only supply approximately 2-word queries?” The answer, she says, may lie in citation analysis. She weighed papers based on the number and locations of citations in a manner similar to Google PageRank, but in the end she realized the imperfection of the process since older publications seemed to unnaturally float to the top.

Day #3 (Thursday, September 9)

Sandra Toze (Dalhousie University) wanted to know how digital libraries support group work. In her Examining group work: Implications for the digital library as sharium she described the creation of an extensive lab for group work. Computers. Video cameras. Whiteboards. Etc. Students used her lab and worked in a manner she expected doing administrative tasks, communicating, problem solving, and the generation of artifacts. She noticed that the “sharium” was a valid environment for doing work, but she noticed that only individuals did information seeking while other tasks were done by the group as a whole. I found this later fact particularly interesting.

In an effort to build and maintain reading lists Gabriella Kazai (Microsoft) presented Architecture for a collaborative research environment based on reading list sharing. The heart of the presentation was a demonstration of ScholarLynk as well as Research Desktop — tools to implement “living lists” of links to knowledge sources. I went away wondering whether or not such tools save people time and increase knowledge.

The last presentation I attended was by George Lucchese (Texas A&M University) called CritSpace: A Workplace for critical engagement within cultural heritage digital libraries where he described a image processing tool intended to be used by humanities scholars. The tool does image processing, provides a workspace, and allows researchers to annotate their content.

Bothwell Castle

Stirling Castle

Doune Castle

Observations and summary

It has been just more than one month since I was in Glasgow attending the Conference, and much of the “glow” (all onomonopias intended) has worn off. The time spent was productive. For example, I was able to meet up with James McNulty (Open University) who spent time at Notre Dame with me. I attended eighteen presentations which were deemed innovative and scholarly by way of extensive review. I discussed digital library issues with numerous people and made an even greater number of new acquaintances. Throughout the process I did some very pleasant sight seeing both with conference attendees and on my own. At the same time I do not feel as if my knowledge of digital libraries was significantly increased. Yes, attendance was intellectually stimulating demonstrated by the number of to-do list items written in my notebook during the presentations, but the topics of discussion seemed worn out and not significant. Interesting but only exemplifying subtle changes from previous research.

My attendance was also a mission. More specifically, I wanted to compare & contrast the work going on here with the work being done at the 2010 Digital Humanities conference. In the end, I believe the two groups are not working together but rather, as one attendee put it, “talking past one another.” Both groups — ECDL and Digital Humanities — have something in common — libraries and librarianship. But on one side are computer scientists, and on the other side are humanists. The first want to implement algorithms and apply them to many processes. If such a thing gets out of hand, then the result is akin to a person owning a hammer and everything looking like a nail. The second group is ultimately interested in describing the human condition and addressing questions about values. This second process is exceedingly difficult, if not impossible, to measure. Consequently any sort of evaluation is left up to a great deal of subjectivity. Many people would think these two processes are contradictory and/or conflicting. In my opinion, they are anything but in conflict. Rather, these two processes are complementary. One fills the deficiencies of the other. One is more systematic where the other is more judgmental. One relates to us as people, and the other attempts to make observations devoid of human messiness. In reality, despite the existence of these “two cultures”, I see the work of the scientists and the work of the humanists to be equally necessary in order for me to make sense of the world around me. It is nice to know libraries and librarianship seem to represent a middle ground in this regard. Not ironically, that is one of most important reasons I explicitly chose my profession. I desired to practice both art and science — arscience. It is just too bad that these two groups do not work more closely together. There seems to be too much desire for specialization instead. (Sigh.)

Because of a conflict in acronyms, the ECDL conference has all but been renamed to Theory and Practice of Digital Libraries (TPDL), and next year’s meeting will take place in Berlin. Despite the fact that this was my third for fourth time attending ECDL, and I doubt I will attend next year. I do not think information retrieval and metadata standards are as important as they have been. Don’t get me wrong. I didn’t say they were unimportant, just not as important as they used to be. Consequently, I think I will be spending more of my time investigating the digital humanities where content has already been found and described, and is now being evaluated and put to use.

River Clyde

River Teith

Dan Marmion

Eric Lease Morgan — Sun, 03 Oct 2010 20:09:30 +0000

Dan Marmion recruited and hired me to work at the University of Notre Dame during the Summer of 2001. The immediate goal was to implement a “database-driven website”, which I did with the help of the Digital Access and Information Architecture Department staff and MyLibrary.

About eighteen months after I started working at the University I felt settled in. It was at that time when I realized I had accomplished all the goals I had previously set out for myself. I had a family. I had stuff. I had the sort of job I had always aspired to have in a place where I aspired to have it. I woke up one morning and asked myself, “Now what?”

After a few months of cogitation I articulated a new goal: to raise a happy, healthy, well-educated child. (I only have one.) By now my daughter is almost eighteen years old. She is responsible and socially well-adjusted. She is stands up straight and tall. She has a pretty smile. By this time next year I sincerely believe she will be going to college with her tuition paid for by Notre Dame. Many of the things that have been accomplished in the past nine years and many of the things to come are results from Dan hiring me.

Dan Marmion died Wednesday, September 22, 2010 from brain cancer. “Dan, thank you for the means and the opportunities. You are sorely missed.”

Great Books data dictionary

Eric Lease Morgan — Fri, 24 Sep 2010 11:13:27 +0000

This is a sort of Great Books data dictionary in that it describes the structure and content of two data files containing information about the Great Books of the Western World.

The data set is manifested in two files. The canonical file is great-books.xml. This XML file consists of a root element (great-books) and many sub-elements (books). The meat of the file resides in these sub-elements. Specifically, with the exception of the id attribute, all the book attributes enumerate integers denoting calculated values. The attributes words, fog, and kincaid denote the length of the work, two grade levels, and a readability score, respectively. The balance of the attributes are “great ideas” as calculated through a variation Term Frequency Inverse Document Frequency (TFIDF) cumulating in a value called the Great Ideas Coefficient. Finally, each book element includes sub-elements denoting who wrote the work (author), the work’s name (title), the location of the file was used as the basis of the calculations (local_url), and the location of the original text (original_url).

The second file (great-books.csv) is a derivative of the first file. This comma-separated file is intended to be read by something like R or Excel for more direct manipulation. It includes all the information from great-books.xml with the exception of the author, title, and URLs.

Given either one of these two files the developer or statistician is expected to evaluate or re-purpose the results of the calculations. For example, given one or the other of these files the following questions could be answered:

What is the “greatest” book and who wrote it?
What is the average “great book” score?
Are there clusters of great ideas?
Which authors wrote extensively on what great ideas?
Is there a correlation between greatness and length and readability?

The really adventurous developer will convert the XML file into JSON and then create a cool (or “kewl”) Web interface allowing anybody with a browser to do their own evaluation and presentation. This is an exercise left up to the reader.

Twitter, Facebook, Delicious, and Alex

Eric Lease Morgan — Sat, 18 Sep 2010 23:20:20 +0000

I spent time last evening and this afternoon integrating Twitter, Facebook, and Delicious into the my Alex Catalogue. The process was (almost) trivial:

create Twitter, Facebook, and Delicious accounts
select and configure the Twitter button I desired to use
acquire the Delicious javascript for bookmarking
place the results of Steps #1 and #2 into my HTML
rebuild my pages
install and configure the Twitter application for Facebook

Because of this process I am able to “tweet” from Alex, its search results, any of the etexts in the collection, as well as any results from the use of the concordances. These tweets then get echoed to Facebook.

(I tried to link directly to Facebook using their Like Button, but the process was cumbersome. Iframes. Weird, Facebook-specific Javascript. Pulling too much content from the header of my pages. Considering the Twitter application for Facebook, the whole thing was not worth the trouble.)

I find it challenging to write meaningful 140 character comments on the Alex Catalogue, especially since the URLs take up such a large number of the characters. Still, I hope to regularly find interesting things in the collection and share them with the wider audience. To see the fruits of my labors to date, see my Twitter feed — http://twitter.com/ericleasemorgan.

Only time will tell whether or not this “social networking” thing proves to be beneficial to my library — all puns intended.

Where in the world are windmills, my man Friday, and love?

Eric Lease Morgan — Sun, 12 Sep 2010 22:32:19 +0000

This posting describes how a Perl module named Lingua::Concordance allows the developer to illustrate where in the continum of a text words or phrases appear and how often.

Windmills, my man Friday, and love

When it comes to Western literature and windmills, we often think of Don Quiote. When it comes to “my man Friday” we think of Robinson Crusoe. And when it comes to love we may very well think of Romeo and Juliet. But I ask myself, “How often do these words and phrases appear in the texts, and where?” Using digital humanities computing techniques I can literally illustrate the answers to these questions.

Lingua::Concordance

Lingua::Concordance is a Perl module (available locally and via CPAN) implementing a simple key word in context (KWIC) index. Given a text and a query as input, a concordance will return a list of all the snippets containing the query along with a few words on either side. Such a tool enables a person to see how their query is used in a literary work.

Given the fact that a literary work can be measured in words, and given then fact that the number of times a particular word or phrase can be counted in a text, it is possible to illustrate the locations of the words and phrases using a bar chart. One axis represents a percentage of the text, and the other axis represents the number of times the words or phrases occur in that percentage. Such graphing techniques are increasingly called visualization — a new spin on the old adage “A picture is worth a thousand words.”

In a script named concordance.pl I answered such questions. Specifically, I used it to figure out where in Don Quiote windmills are mentiond. As you can see below they are mentioned only 14 times in the entire novel, and the vast majority of the time they exist in the first 10% of the book.

  $ ./concordance.pl ./don.txt 'windmill'
  Snippets from ./don.txt containing windmill:
	* DREAMT-OF ADVENTURE OF THE WINDMILLS, WITH OTHER OCCURRENCES WORTHY TO
	* d over by the sails of the windmill, Sancho tossed in the blanket, the
	* thing is ignoble; the very windmills are the ugliest and shabbiest of 
	* liest and shabbiest of the windmill kind. To anyone who knew the count
	* ers say it was that of the windmills; but what I have ascertained on t
	* DREAMT-OF ADVENTURE OF THE WINDMILLS, WITH OTHER OCCURRENCES WORTHY TO
	* e in sight of thirty forty windmills that there are on plain, and as s
	* e there are not giants but windmills, and what seem to be their arms a
	* t most certainly they were windmills and not giants he was going to at
	*  about, for they were only windmills? and no one could have made any m
	* his will be worse than the windmills," said Sancho. "Look, senor; thos
	* ar by the adventure of the windmills that your worship took to be Bria
	*  was seen when he said the windmills were giants, and the monks' mules
	*  with which the one of the windmills, and the awful one of the fulling
  
  A graph illustrating in what percentage of ./don.txt windmill is located:
	 10 (11) #############################
	 20 ( 0) 
	 30 ( 0) 
	 40 ( 0) 
	 50 ( 0) 
	 60 ( 2) #####
	 70 ( 1) ##
	 80 ( 0) 
	 90 ( 0) 
	100 ( 0)

If windmills are mentioned so few times, then why do they play so prominently in people’s minds when they think of Don Quiote? To what degree have people read Don Quiote in its entirity? Are windmills as persistent a theme throughout the book as many people may think?

What about “my man Friday”? Where does he occur in Robinson Crusoe? Using the concordance features of the Alex Catalogue of Electronic Texts we can see that a search for the word Friday returns 185 snippets. Mapping those snippets to percentages of the text results in the following bar chart:

Friday in Robinson Crusoe

Obviously the word Friday appears towards the end of the novel, and as anybody who has read the novel knows, it is a long time until Robinson Crusoe actually gets stranded on the island and meets “my man Friday”. A concordance helps people understand this fact.

What about love in Romeo and Juliet? How often does the word occur and where? Again, a search for the word love returns quite a number of snippets (175 to be exact), and they are distributed throughout the text as illustrated below:

love in Romeo and Juliet

“Maybe love is a constant theme of this particular play,” I state sarcastically, and “Is there less love later in the play?”

Digital humanities and librarianship

Given the current environment, where full text literature abounds, digital humanities and librarianship are a match made in heaven. Our library “discovery systems” are essencially indexes. They enable people to find data and information in our collections. Yet find is not an end in itself. In fact, it is only an activity at the very beginning of the learning process. Once content is found it is then read in an attempt at understanding. Counting words and phrases, placing them in the context of an entire work or corpus, and illustrating the result is one way this understanding can be accomplished more quickly. Remember, “Save the time of the reader.”

Integrating digital humanities computing techniques, like concordances, into library “discovery systems” represent a growth opportunity for the library profession. If we don’t do this on our own, then somebody else will, and we will end up paying money for the service. Climb the learning curve now, or pay exorbitant fees later. The choice is ours.

Ngrams, concordances, and librarianship

Eric Lease Morgan — Mon, 30 Aug 2010 05:08:47 +0000

This posting describes how the extraction of ngrams and the implementation of concordances are integrated into the Alex Catalogue of Electronic Texts. Given the increasing availability of full-text content in libraries, the techniques described here could easily be incorporated into traditional library “discovery systems” and/or catalogs, if and only if the library profession were to shift its definition of what it means to practice librarianship.

Lingua::EN::Bigram

During the past couple of weeks, in fits of creativity, one of the things I spent some of my time on was a Perl module named Lingua::EN::Bigram. At version 0.03, it now supports not only bigrams, trigrams, and quadgrams (two-, three-, and four-word phrases, respectively), but also ngrams — multi-word phrases of an arbitrary length.

Given this enhanced functionality, and through the use of a script called ngrams.pl, I learned that the 10 most frequently used 5-word phrases and the number of times they occur in Henry David Thoreau’s Walden seem to surround spacial references:

a quarter of a mile (6)
i have no doubt that (6)
as if it were a (6)
the other side of the (5)
the surface of the earth (4)
the greater part of the (4)
in the midst of a (4)
in the middle of the (4)
in the course of the (3)
two acres and a half (3)

Whereas the same process applied to Thoreau’s A Week on the Concord and Merrimack Rivers returns lengths and references to flowing water, mostly:

a quarter of a mile (8)
on the bank of the (7)
the surface of the water (6)
the middle of the stream (6)
as if it were the (5)
as if it were a (4)
is for the most part (4)
for the most part we (4)
the mouth of this river (4)
in the middle of the (4)

While not always as clear cut as the examples outlined above, the extraction and counting of ngrams usually supports the process of “distant reading” — a phrase coined by Franco Moretti in Graphs, Maps, Trees: Abstract Models for Literary History (2007) to denote the counting, graphing, and mapping of literary texts. With so much emphasis on reading in libraries, I ask myself, “Ought the extraction of ngrams be applied to library applications?”

Concordances

Concordances are literary tools used to evaluate texts. Dating back to as early as the 12th or 13th centuries, they were first used to study religious materials. Concordances take many forms, but they usually list all the words in a text, the number of times each occurs, and most importantly, places where each word within the context of its surrounding text — a key-word in context (KWIC) index. Done by hand, the creation of concordances is tedious and time consuming, and therefore very expensive. Computers make the work of creating a concordance almost trivial.

Each of the full text items in the Alex Catalogue of Electronic Texts (close to 14,000 of them) is accompanied with a concordance. They support the following functions:

list of all the words in the text starting with a given letter and the number of times each occurs
list the most frequently used words in the text and the number of times each occurs
list the most frequently used ngrams in a text and the number of times each occurs
display individual items from the lists above in a KWIC format
enable the student or scholar to search the text for arbitrary words or phrases (regular expressions) and have them displayed in a KWIC format

Such functionality allows people to answer many questions quickly and easily, such as:

Does Mark Twain’s Adventures of Huckleberry Finn contain many words beginning with the letter z, and if so, how many times and in what context?
To what extent does Aristotle’s Metaphysics use the word “good”, and maybe just as importantly, how is the word “evil” used in the same context?
In Jack London’s Call of the Wild the phrase “man in the red sweater” is one of the more frequently used. Who was this man and what role does he play in the story?
Compared to Shakespeare, to what extent does Plato discuss love, and how do the authors’ expositions differ?

The counting of words, the enumeration of ngrams, and the use of concordances are not intended to short-circuit traditional literary studies. Instead, they are intended to supplement and enhance the process. Traditional literary investigations, while deep and nuanced, are not scalable. A person is not able to read, compare & contrast, and then comprehend the essence of all of Shakespeare, all of Plato, and all of Charles Dickens through “close reading”. An individual simply does not have enough time. In the words of Gregory Crane, “What do you do with a million books?” Distant reading, akin to the proceses outlined above, make it easier to compare & contrast large corpora, discover patterns, and illustrate trends. Moreover, such processes are reproducible, less prone to subjective interpretation, and not limited to any particular domain. The counting, graphing, and mapping of literary texts makes a lot of sense.

The home page for the concordances is complete with a number of sample texts. Alternatively, you can search the Alex Catalogue and find an item on your own.

Library “discovery systems” and/or catalogs

The amount of full text content available to libraries has never been greater than it is today. Millions of books have been collectively digitized through Project Gutenberg, the Open Content Alliance, and the Google Books Project. There are thousands of open access journals with thousands upon thousands of freely available scholarly articles. There are an ever-growing number of institutional repositories both subject-based as well as institutional-based. These too are rich with full text content. None of this even considers the myriad of grey literature sites like blogs and mailing list archives.

Library “discovery systems” and/or catalogs are designed to organize and provide access to the materials outlined above, but they need to do more. First of all, the majority of the profession’s acquisitions processes assume collections need to be paid for. With the increasing availability of truly free content on the Web, greater emphasis needs to be placed on harvesting content as opposed to purchasing or licensing it. Libraries are expected to build collections designed to stand the test of time. Brokering access to content through licensing agreements — one of the current trends in librarianship — will only last as long as the money lasts. Licensing content makes libraries look like cost centers and negates the definition of “collections”.

Second, library “discovery systems” and/or catalogs assume an environment of sacristy. They assume the amount of accessible, relevant data and information needed by students, teachers, and researchers is relatively small. Thus, a great deal of the profession’s efforts go into enabling people to find their particular needle in one particular haystack. In reality, current indexing technology makes the process of finding relavent materials trivial, almost intelligent. Implemented correctly, indexers return more content than most people need, and consequently they continue to drink from the proverbial fire hose.

Let’s turn these lemons into lemonade. Let’s redirect some of the time and money spent on purchasing licenses towards the creation of full text collections by systematic harvesting. Let’s figure out how to apply “distant reading” techniques to the resulting collections thus making them, literally, more useful and more understandable. These redirections represent a subtle change in the current direction of librarianship. At the same time, they retain the core principles of the profession, namely: collection, organization, preservation, and dissemination. The result of such a shift will result in an increased expertise on our part, the ability to better control our own destiny, and contribute to the overall advancement of our profession.

What can we do to make these things come to fruition?

Lingua::EN::Bigram (version 0.03)

Eric Lease Morgan — Tue, 24 Aug 2010 02:37:39 +0000

I uploaded version 0.03 of Lingua::EN::Bigram to CPAN today, and it now supports not just bigrams, trigrams, quadgrams, but ngrams — an arbitrary phrase length.

In order to test it out, I quickly gathered together some of my more recent essays, concatonated them together, and applied Lingua::EN::Bigram against the result. Below is a list of the top 10 most common bigrams, trigrams, and quadgrams:

  bigrams                 trigrams                  quadgrams
  52  great ideas         36  the number of         25  the number of times
  43  open source         36  open source software  13  the total number of
  38  source software     32  as well as            10  at the same time
  29  great books         28  number of times       10  number of words in
  24  digital humanities  27  the use of            10  when it comes to
  23  good man            25  the great books       10  total number of documents
  22  full text           23  a set of              10  open source software is
  22  search results      20  eric lease morgan      9  number of times a
  20  lease morgan        20  a number of            9  as well as the
  20  eric lease          19  total number of        9  through the use of

Not surprising since I have been writing about the Great Books, digital humanities, indexing, and open source software. Re-affirming.

Lingu::EN::Bigram is available locally as well as from CPAN.

Lingua::EN::Bigram (version 0.02)

Eric Lease Morgan — Mon, 23 Aug 2010 00:02:45 +0000

I have written and uploaded to CPAN version 0.02 of my Perl module Lingua::EN::Bigram. From the README file:

This module is designed to: 1) pull out all of the two-, three-, and four-word phrases in a given text, and 2) list these phrases according to their frequency. Using this module is it possible to create lists of the most common phrases in a text as well as order them by their probable occurrence, thus implying significance. This process is useful for the purposes of textual analysis and “distant reading”.

Using this module I wrote a script called n-grams.pl. Feed it a plain text file, and it will return the top 10 most significant bigrams (as calculated by T-Score) as well as the top 10 most common trigrams and quadgrams. For example, here is the output of n-grams.pl when Henry David Thoreau’s Walden is input:

  Bi-grams (T-Score, count, bigram)
  4.54348783312048  22  one day  
  4.35133234596553  19  new england  
  3.705427371426    14  walden pond  
  3.66575742655033  14  one another  
  3.57857056272537  13  many years  
  3.55592136768501  13  every day  
  3.46339791276118  12  fair haven  
  3.46101939872834  12  years ago  
  3.38519781332654  12  every man  
  3.29818626191729  11  let us  
  
  Tri-grams (count, trigram)
  41  in the woods
  40  i did not
  28  i do not
  28  of the pond
  27  as well as
  27  it is a
  26  part of the
  25  that it was
  25  as if it
  25  out of the
  
  Quad-grams (count, quadgram)
  20  for the most part
  16  from time to time
  15  as if it were
  14  in the midst of
  11  at the same time
   9  the surface of the
   9  i think that i
   8  in the middle of
   8  worth the while to
   7  as if they were

The whole thing gets more interesting when you compare that output to another of Thoreau’s works — A Week on the Concord and Merrimack Rivers:

  Bi-grams (T-Score, count, bi-gram)
  4.62683939320543  22  one another  
  4.57637831535376  21  new england  
  4.08356124174142  17  let us  
  3.86858364314677  15  new hampshire  
  3.43311180449584  12  one hundred  
  3.31196701774012  11  common sense  
  3.25007069543896  11  can never  
  3.15955504269006  10  years ago  
  3.14821552996352  10  human life  
  3.13793008615654  10  told us  
  
  Tri-grams (count, tri-gram)
  41  as well as
  38  of the river
  34  it is a
  30  there is a
  30  one of the
  28  it is the
  27  as if it
  26  it is not
  26  if it were
  24  it was a
  
  Quad-grams (count, quad-gram)
  21  for the most part
  20  as if it were
  17  from time to time
   9  on the bank of
   8  the bank of the
   8  in the midst of
   8  a quarter of a
   8  the middle of the
   8  quarter of a mile
   7  at the same time

Ask yourself, “Are their similarities between the outputs? How about differences? Do you notice any patterns or anomalies? What sorts of new discoveries might be made if n-grams.pl where applied to the entire corpus of Thoreau’s works? How might the output be different if a second author’s works were introduced?” Such questions are the core of digital humanities research. With the increasing availability of full text content in library collections, such are the questions the library profession can help answer if the profession were to expand it’s definition of “service”.

Search and retrieve are not the pressing problems to solved. People can find more data and information than they know what to do with. Instead, the pressing problems surround use and understanding. Lingua::EN::Bigram is an example of how these newer and more pressing problems can be addressed. The module is available for downloading (locally as well as from CPAN). Also for your perusal is n-grams.pl.

Cool URIs

Eric Lease Morgan — Sun, 22 Aug 2010 18:07:42 +0000

I have started implementing “cool” URIs against the Alex Catalogue of Electronic Texts.

As outlined in Cool URIs for the Semantic Web, “The best resource identifiers… are designed with simplicity, stability and manageability in mind…” To that end I have taken to creating generic URIs redirecting user-agents to URLs based on content negotiation — 303 URI forwarding. These URIs also provide a means to request specific types of pages. The shapes of these URIs follow, where “key” is a foreign key in my underlying (MyLibrary) database:

http://infomotions.com/etexts/id/key – generic; redirection based on content negotiation
http://infomotions.com/etexts/page/key – HTML; the text itself
http://infomotions.com/etexts/data/key – RDF; data about the text
http://infomotions.com/etexts/concordance/key – concordance; a means for textual analysis

For example, the following URIs return different versions/interfaces of Henry David Thoreau’s Walden:

This whole thing makes my life easier. No need to remember complicated URLs. All I have to remember is the shape of my URI and the foreign key. Through the process this also makes the URLs easier to type, shorten, distribute, and display.

The downside of this implementation is the need for an always-on intermediary application doing the actual work. The application, implemented as mod_perl module, is called Apache2::Alex::Dereference and available for your perusal. Another downside is the need for better, more robust RDF, but that’s for later.

rsync, a really cool utility

Eric Lease Morgan — Thu, 19 Aug 2010 01:42:18 +0000

Without direct physical access to my co-located host, backing up and preserving the Infomotions’ 150 GB of website is challenging, but through the use of rsync things are a whole lot easier. rsync is a really cool utility, and thanks go to Francis Kayiwa who recommended it to me in the first place. “Thank you!”

Here is my rather brain-dead back-up utility:

# rsync.sh - brain-dead backup of wilson

# change directories to the local store
cd /Users/eric/wilson

# get rid of any weird Mac OS X filenames
find ./ -name '.DS_Store' -exec rm -rf {} \;

# do the work for one remote file system...
rsync --exclude-from=/Users/eric/etc/rsync-exclude.txt \
    -avz wilson:/disk01/ \
    ./disk01/

# ...and then another
rsync --exclude-from=/Users/eric/etc/rsync-exclude.txt \
    -avz wilson:/home/eric/ \
    ./home/eric/

After I run this code my local Apple Macintosh Time Capsule automatically copies my content to yet a third spinning disk. I feel much better about my data now that I have started using rsync.

WiLSWorld, 2010

Eric Lease Morgan — Fri, 06 Aug 2010 17:04:39 +0000

I had the recent honor, privilege, and pleasure of attending WiLSWorld (July 21-22, 2010 in Madison, Wisconsin), and this posting outlines my experiences there. In a sentence, I was pleased so see the increasing understanding of “discovery” interfaces defined as indexes as opposed to databases, and it is now my hope we — as a profession — can move beyond search & find towards use & understand.

Wednesday, July 21

With an audience of about 150 librarians of all types from across Wisconsin, the conference began with a keynote speech by Tim Spalding (LibraryThing) entitled “Social cataloging and the future”. The heart of his presentation was a thing he called the Ladder of Social Cataloging which has six “rungs”: 1) personal cataloging, 2) sharing, 3) implicit social cataloging, 4) social networking, 5) explicitly social cataloging, and 6) collaboration. Much of what followed were demonstrations of how each of these things are manifested in LibraryThing. There were a number meaty quotes sprinkled throughout the talk:

…We [LibraryThing] are probably not the biggest book club anymore… Reviews are less about buying books and more about sharing minds… Tagging is not about something for everybody else, but rather about something for yourself… LibraryThing was about my attempt to discuss the things I wanted to discuss in graduate school… We have “flash mobs” cataloging peoples’ books such as the collections of Thomas Jefferson, John Adams, Ernest Hemingway, etc… Traditional subject headings are not manifested in degrees; all LCSH are equally valid… Library data can be combined but separate from patron data.

I was duly impressed with this presentation. It really brought home the power of crowd sourcing and how it can be harnessed in a library setting. Very nice.

Peter Gilbert (Lawrence University) then gave a presentation called “Resource discovery: I know it when I see it”. In his words, “The current problem to solve is to remove all of the solos: books, articles, digitized content, guides to subjects, etc.” The solution, in his opinion, is to implement “discovery systems” similar to Blacklight, eXtensible Catalog, Primo & Primo Central, Summon, VUFind, etc. I couldn’t have said it better myself. He gave a brief overview of each system.

Ken Varnum (University of Michigan Library) described a website redesign process in “Opening what’s closed: Using open source tools to tear down vendor silos”. As he said, “The problem we tried to solve in our website redesign was the overwhelming number of branch library websites. All different. Almost schizophrenic.” The solution grew out of a different premise for websites. “Information not location.” He went on to describe a rather typical redesign process complete with focus group interviews, usability studies, and advisory groups, but there were a couple of very interesting tidbits. First, inserting the names and faces of librarian in search results has proved popular with students. Second, I admired the “participatory design” process he employed. Print a design. Allow patrons to use pencils to add, remove, or comment on aspects of the layout. I also think the addition of a professional graphic designer helped their process.

I then attended Peter Gorman‘s (University of Wisconsin-Madison) “Migration of digital content to Fedora”. Gorman had the desire to amalgamate institutional content, books, multimedia and finding aids (EAD files) into a single application… yet another “discovery system” description. His solution was to store content into Fedora, index the content, and provide services against the index. Again, a presenter after my own heart. Better than anyone had done previously, Gorman described Fedora’s content model complete with identifiers (keys), a sets of properties (relationships, audit trails, etc.), and a data streams (JPEG, XML, TIFF, etc.). His description was clear and very easy to digest. The highlight was a description of Fedora “behaviors”. These are things people are intended to do with data streams. Examples include enlarging a thumbnail image or transforming a online finding aid into something designed for printing. These “behaviors” are very much akin — if not exactly like — the “services against texts” I have been advocating for a few years.

Thursday, July 22

The next day I gave a presentation called “Electronic texts and the evolving definition of librarianship”. This was an extended version of my presentation at ALA given a few weeks ago. To paraphrase, “As we move from databases towards indexes to facilitate search, the problems surrounding find are not as acute. Given the increasing availability of digitized full text content, library systems have the opportunity to employ ‘digital humanities computing techniques’ against collections and enable people to do ‘distant reading’.” I then demonstrated how the simple counting of words and phrases, the use of concordances, and the application of TFIDF can facilitate rudimentary comparing & contrasting of corpora. Giving this presentation was an enjoyable experience because it provided me the chance to verbalize and demonstrate much of my current “great books” research.

Later in the morning helped facilitate a discussion on the process a library could go through to implement the ideas outlined in my presentation, but the vast majority of people attended the presentation by Keith Mountin (Apple Computer, Inc.) called “The iPad and its application in libraries”.

Conclusion

Madison was just as nice as I remember. Youthful. Liberal. Progressive. Thanks go to Deb Shapiro and Mark Beatty. They invited me to sit with them on the capitol lawn and listen to the local orchestra play Beatles music. The whole thing was very refreshing.

The trip back from the conference was a hellacious experience in air travel, but it did give me the chance to have an extended chat with Tim Spalding in the airport. We discussed statistics and statistical measures that can be applied to content we are generating. Many of the things he is doing with metadata I may be able to do with full text. The converse is true as well. Moreover, by combining our datasets we may find that the sum is greater than the parts — all puns intended. Both Tim and I agreed this is something we should both work towards. Afterwards I ate macaroni & cheese with a soft pretzel and a beer. It seemed apropos for Wisconsin.

This was my second or third time attending WiLSWorld. Like the previous meetings, the good folks at WiLS — specifically Tom Zilner, Mark Beatty, and Shirley Schenning — put together a conference providing librarians from across Wisconsin with a set of relatively inexpensive professional development opportunities. Timely presentations. Plenty of time for informal discussions. All in a setting conducive to getting away and thinking a bit outside the box. “Thank you.”

Digital Humanities 2010: A Travelogue

Eric Lease Morgan — Sun, 25 Jul 2010 16:52:10 +0000

I was fortunate enough to be able to attend a conference called Digital Humanities 2010 (London, England) between July 4th and 10th. This posting documents my experiences and take-aways. In a sentence, the conference provided a set of much needed intellectual stimulation and challenges as well as validated the soundness of my current research surrounding the Great Books.

Pre-conference activities

All day Monday, July 5, I participated in a workshop called Text mining in the digital humanities facilitated by Marco Büchler, et al. of the University of Leipzig. A definition of “e-humanities” was given, “The application of computer science to do qualitative evaluation of texts without the use of things like TEI.” I learned that graphing texts illustrates concepts quickly — “A picture is worth a thousand words.” Also, I learned I should consider creating co-occurrence graphs — pictures illustrating what words co-occur with a given word. Finally, according to the Law of Least Effort, the strongest content words in a text are usually the ones that do not occur most frequently, nor the ones occurring the least, but rather the words occurring somewhere in between. A useful quote includes, “Text mining allows one to search even without knowing any search terms.” Much of this workshop’s content came from the eAQUA Project.

On Tuesday I attended the first half of a THATCamp led by Dan Cohen (George Mason University) where I learned THATCamps are expected to be: 1) fun, 2) productive, and 3) collegial. The whole thing came off as a “bar camp” for scholarly conferences. As a part of the ‘Camp I elected to participate in the Developer’s Challenge and submitted an entry called “How ‘great’ is this article?“. My hack compared texts from the English Women’s Journal to the Great Books Coefficient in order to determine “greatness”. My entry did not win. Instead the prize went to Patrick Juola with honorable mentions going to Loretta Auvil, Marco Büchler, and Thomas Eckart.

Wednesday morning I learned more about text mining in a workshop called Introduction to text analysis using JiTR and Voyeur led by Stéfan Sinclair (McMaster University) and Geoffrey Rockwell (University of Alberta). The purpose of the workshop was “to learn how to integrate text analysis into a scholar’s/researcher’s workflow.” More specifically, we learned how to use a tool called Voyeur, an evolution of the TAPoR. The “kewlest” thing I learned was the definition of word density, (U / W) 1000, where U is the total number of unique words in a text and W is the total number of words in a text. The closer the result is to 1000 the richer and more dense a text is. In general, denser documents are more difficult to read. (For a good time, I wrote density.pl — a program to compute density given an arbitrary plain text file.)

In keeping with the broad definition of humanities, I was “seduced” in the afternoon by listening to recordings of a website called CHARM (Center for History and Analysis of Recorded Music). The presentation described and presented digitized classical music from the very beginnings of recorded music. All apropos since the BBC was located just across the street from King’s College where the conference took place. When this was over we retired to the deck for tea and cake. There I learned the significant recording time differences between 10″ and 12″ 78/rpm records. Like many mediums, the recording artist needed to make accommodations accordingly.

Plenty of presentations

The conference officially began Wednesday evening and ended Saturday afternoon. According to my notes, I attended at many as eighteen sessions. (Wow!?) Listed below are summaries of most of the ones I attended:

Charles Henry (Council on Library and Information Resources) and Hold up a mirror – In this keynote presentation Henry compared & contrasted manifestations (oral, written, and digital) of Homer, Beowulf, and a 9-volume set of religious ceremonies compiled in the 18th century. He then asked the question, “How can machines be used to capture the interior of the working mind?” Or, in my own words, “How can computers be used to explore the human condition?” The digital versions of the items listed above were used as example answers, and a purpose of the conference was to address this question in other ways. He said, “There are many types of performance, preservation, and interpretation.”
Patrick Juola (Duquesne University) and Distant reading and mapping genre space via conjecture-based distance measures – Juola began by answering the question, “What do you do with a million books?”, and enumerated a number of things: 1) search, 2) summarize, 3) sample, and 4) visualize. These sorts of proceses against texts is increasingly called “distant reading” and is contrasted with the more traditional “close reading”. He then went on to describe his “Conjecturator” — a system where assertions are randomly generated and then evaluated. He demonstrated this technique against a set of Victorian novels. His presentation was not dissimilar to the presentation he gave at digital humanities conference in Chicago the previous year.
Jan Rybicki (Pedagogical University) and Deeper delta across genres and language: Do we really need the most frequent words? – In short Rybicki said, “Doing simple frequency counts [to do authorship analysis] does not work very well for all languages, and we are evaluating ‘deeper deltas'” — an allusion to the work for J.F. Burrows and D.L. Hoover. Specifically, using a “moving window” of stop words he looked for similarities in authorship between a number of texts and believed his technique has proved to be more or less successful.
David Holms (College of New Jersey) and The Diary of a public man: A Case study in traditional and non-traditional author attribution – Soon after the civil war a book called The Diary Of A Public Man was written by an anonymous author. Using stylometric techniques, Holms asserts the work really was written as a diary and was authored by William Hurlbert.
David Hoover (New York University) and Teasing out authorship and style with t-tests and zeta – Hoover used T-tests and Zeta tests to validated whether or not a particular author finished a particular novel from the 1800s. Using these techniques he was successfully able to illustrate writing styles and how they changed dramatically between one chapter in the book and another chapter. He asserted that such analysis would have been extremely difficult through rudimentary casual reading.
Martin Holmes (University of Victoria) and Using the universal similarity metric to map correspondences between witnesses – Holmes described how he was comparing the similarity between texts through the use of a compression algorithm. Compress texts. Compare their resulting lengths. The closer to lengths the greater the similarity. The process works for a variety of file types, languages, and when there there is no syntactical knowledge.
Dirk Roorda (Data Archiving and Networked Services) and The Ecology of longevity: The Relevance of evolutionary theory for digital preservation – Roorda drew parallels between biology and preservation. For example, biological systems use and retain biological characteristics. Preservation systems re-use and thus preserve content. Biological systems make copies and evolve. Preservation can be about migrating formats forward thus creating different forms. Biological systems employ sexual selections. “Look how attractive I am.” Repositories or digital items displaying “seals of approval” function similarly. Finally, he went on to describe how these principles could be integrated in a preservation system where fees are charged for storing content and providing access to it. He emphasized such systems would not necessarily be designed to handle intellectual property rights.
Lewis Ulman (Ohio State University) & Melanie Schlosser (Ohio State University) and The Specimen case and the garden: Preserving complex digital objects, sustaining digital projects – Ulman and Schlosser described a dichotomy manifesting itself in digital libraries. On one hand there is a practical need for digital library systems to be similar between each other because “boutique” systems are very expensive to curate and maintain. At the same time specialized digital library applications are needed because they represent the frontiers of research. How to accomodate both, that was their question. “No one group (librarians, information technologist, faculty) will be able to do preservation alone. They need to work together. Specifically, they need to connect, support, and curate.”
George Buchanan (City University) and Digital libraries of scholarly editions – Similar to Ulman/Schlosse above, Buchanan said, “It is difficult to provide library services against scholarly editions because each edition is just too much different from the next to create a [single] system.” He advocated the Greenstone digital library system.

Joe Raben (Queens College of the City University of New York) and Humanities computing in an age of social change – In this presentation, given after being honored with the community’s Busa Award, Raben first outlined the history of the digital humanities. It included the work done by Father Busa who collaborated with IBM in the 1960s to create a concordance against some of Thomas Aquinas‘s work. It included a description of a few seminal meetings and the formulation of the Computing in the Humanities journal. He alluded to “machine readable texts” — a term which is no longer in vogue but reminded me of “machine readable cataloging” (MARC) and how the library profession has not moved on. He advocated for a humanities wiki where ideas and objects could be shared. It sounded a lot like the arts-humanities.net website. He discussed the good work of a Dante project hosted at Princeton University, and I was dismayed because Notre Dame’s significant collection of Dante materials has not played a role in this particular digital library. A humanist through and through, he said, “Computers are increasingly controlling our lives and the humanities have not effected how we live in the same way.” To this I say, computers represent close trends compared to the more engrained values of the human condition. The former are quick to change, the later change oh so very slowly yet they are more pervasive. Compared to computer technology, I believe the humanists have had more long-lasting effects on the human condition.
Lynne Siemens (University of Victoria) and A Tale of two cities: Implications of the similarities in collaborative approaches within the digital libraries and digital humanities communities – Siemans reported on the results of survey in an effort to determine how and why digital librarians and digital humanists collaborate. “There are cultural differences between librarians and academics, but teams [including both] are necessary. The solution is to assume the differences rather than the similarities. Everybody brings something to the team.”
Fenella France (Library of Congress) and Challenges of linking digital heritage scientific data with scholarly research: From navigation to politics – France described some of the digital scanning processes of the Library of Congress, and some the consequences. For example, their technique allowed archivists to discover how Thomas Jefferson wrote, crossed out, and then replaced the word “subjects” with “citizens” in a draft of the Declaration of Independence. A couple of interesting quotes included, “We get into the optical archeology of the documents”, and “Digitization is access, not preservation.”
Joshua Sternfeld (National Endowment for the Humanities) and Thinking archivally: Search and metadata as building blocks for a new digital historiography – Sternfeld advocated for different sets of digital library evaluation. “There is a need for more types of reviews against digital resource materials. We need a method for doing: selection, search, and reliability… The idea of provenance — the order of document creation — needs to be implemented in the digital realm.”
Wendell Piez (Mulberry Technologies, Inc.) and Towards hermeneutic markup: An Architectural outline – Hermeneutic markup are annotations against a text that are purely about interpretation. “We don’t really have the ability to do hermeneutic markup… Existing schemas are fine, but every once in a while exceptions need to be made and such things break the standard.” Numerous times Piez alluded to the “overlap problem” — the inability to demarcate something crossing the essentially strict hierarchal nature of XML elements. Textual highlighting is a good example. Piez gave a few examples of how the overlap problem might be resolved and how hermeneutic markup may be achieved.
Jane Hunter (University of Queensland) and The Open Annotation collaboration: A Data model to support sharing and interoperability of scholarly annotations – Working with a number of other researchers, Hunter said, “The problem is that there is an extraordinarily wide variety of tools, lack of consistency, no standards, and no sharable interoperability when it comes to Web-based annotation.” Their goal is to create a data model to enable such functionality. While the model is not complete, it is being based on RDF, SANE, and OATS. See www.openannotation.org.
Susan Brown (University of Alberta and University of Guelph) and How do you visualize a million links? – Brown described a number of ways she is exploring visualization techniques. Examples included link graphs, tag clouds, bread board searches, cityscapes, and something based on “six degrees of separation”.
Lewis Lancaster (University of California, Berkeley) and From text to image to analysis: Visualization of Chinese Buddhist canon – Lancaster has been doing research against a (huge) set of Korean glyphs for quite a number of years. Just like other writing techniques, the glyphs change over time. Through the use digital humanities computing techniques, he has been able to discover much more quickly patterns and bigrams that he was not able to discover previously. “We must present our ideas as images because language is too complex and takes too much time to ingest.”

Take-aways

In the spirit of British fast food, I have a number of take-aways. First and foremost, I learned that my current digital humanities research into the Great Books is right on target. It asks questions of the human condition and tries to answer them through the use of computing techniques. This alone was the worth the total cost of my attendance.

Second, as a relative outsider to the community, I percieved a pervasive us versus them mentality being described. Us digital humanists and those traditional humanists. Us digital humanists and those computer programmers and systems administrators. Us digital humanists and those librarians and archivists. Us digital humanists and those academic bureaucrats. If you consider yourself a digital humanist, then please don’t take this observation the wrong way. I believe communities inherently do this as a matter of fact. It is a process used to define one’s self. The heart of much of this particular differenciation seems to be yet another example of C.P. Snow‘s The Two Cultures. As a humanist myself, I identify with the perception. I think the processes of art and science complement each other, not contradict nor conflict. A balance of both are needed in order to adequantly create a cosmos out of the apparent chaos of our existance — a concept I call arscience.

Third, I had ample opportunities to enjoy myself as a tourist. The day I arrived I played frisbee disc golf with a few “cool dudes” at Lloyd Park in Croydon. On the Monday I went to the National Theater and saw Welcome to Thebes — a depressing tragedy where everybody dies. On the Tuesday I took in Windsor Castle. Another day I carried my Culver Citizen newspaper to have its photograph taken in front of Big Ben. Throughout my time there I experienced interesting food, a myriad of languages & cultures, and the almost overwhelming size of London. Embarassingly, I had forgotten how large the city really is.

Finally, I actually enjoyed reading the formally published conference abstracts — all three pounds and 400 pages of it. It was thorough, complete, and even included an author index. More importantly, I discovered more than a few quotes supporting an idea for library systems that I have been calling “services against texts”:

The challenge is to provide the researcher with a means to perceiving or specifying subsets of data, extracting the relevent information, building the nodes and edges, and then providing the means to navigate the vast number of nodes and edges. (Susan Brown in “How do you visualize a million links” on page 106)

However, current DL [digital library] systems lack critical features: they have too simple a model of documents, and lack scholarly apparatus. (George Buchanan in “Digital libraries of scholarly editions” on page 108.)

This approach takes us to the what F. Moretti (2005) has termed ‘distant reading,’ a method that stresses summarizing large bodies of text rather than focusing on a few texts in detail. (Ian Gregory in “GIS, texts and images: New approaches to landscape appreciation in the Lake District” on page 159).

And the best quote is:

In smart digital libraries, a text should not only be an object but a service: not a static entity but an interactive method. The text should be computationally exploitable so that it can be sampled and used, not simply reproduced in its entirety… the reformulation of the dictionary not as an object, but a service. (Toma Tasovac in “Reimaging the dictionary, or why lexicography needs digital humanities” on page 254)

In conclusion, I feel blessed with the ability to attended the conference. I learned a lot, and I will recommend it to any librarian or humanist.

How “great” is this article?

Eric Lease Morgan — Fri, 09 Jul 2010 07:33:34 +0000

During Digital Humanities 2010 I participated in the THATCamp London Developers’ Challenge and tried to answer the question, “How ‘great’ is this article?” This posting outlines the functionality of my submission, links to a screen capture demonstrating it, and provides access to the source code.

Given any text file — say an article from the English Women’s Journal — my submission tries to answer the question, “How ‘great’ is this article?” It does this by:

returning the most common words in a text
returning the most common bigrams in a text
calculating a few readability scores
comparing the texts to a standardized set of “great ideas”
supporting a concordance for browsing

Functions #1, #2, #3, and #5 are relatively straight-forward and well-understood. Function #4 needs some explanation.

In the 1960’s a set of books was published called the Great Books. The set is based on a set of 102 “great ideas” (such as art, love, honor, truth, justice, wisdom, science, etc.). By summing the TFIDF scores of each of these ideas for each of the books, a “great ideas coefficient” can be computed. Through this process we find that Shakespeare wrote seven of the top ten books when it comes to love. Kant wrote the “greatest book”. The American State’s Articles of Confederation ranks the highest when it come to war. This “coefficient” can then be used as a standard — an index — for comparing other documents. This is exactly what this program does. (See the screen capture for a demonstration.)

The program can be improved a number of ways:

it could be Web-based
it could process non-text files
it could graphically illustrate a text’s “greatness”
it could hyperlink returned words directly to the concordance

Thanks to Gerhard Brey and the folks of the Nineteenth Century Serials Editions for providing the data. Very interesting.

ALA 2010

Eric Lease Morgan — Wed, 30 Jun 2010 19:42:05 +0000

This is the briefest of travelogues describing my experience at the 2010 ALA Annual Meeting in Washington (DC).

Pat Lawton and I gave a presentation at the ~~White House~~ Four Points Hotel on the “Catholic Portal“. Essentially it was a status report. We shared the podium with Jon Miller (University of Southern California) who described the International Mission Photography Archive — an extensive collection of photographs taken by missionaries from many denominations.

I then took the opportunity to visit my mother in Pennsylvania, but the significant point is the way I got out of town. I had lost my maps, and my iPad came to the rescue. The Google Maps application was very, very useful.

On Monday I shared a podium with John Blyberg (Darien Library) and Tim Spalding (LibraryThing) as a part of a Next-Generation Library Catalog Special Interest Group presentation. John provided an overview of the latest and greatest features of SOPAC. He emphasized a lot of user-centered design. Tim described library content and services as not (really) being a part of the Web. In many ways I agree with him. I outlined how a few digital humanities computing techniques could be incorporated into library collections and services in a presentation I called “The Next Next-Generation Library Catalog“. That afternoon I participated in a VUFind users-group meeting, and I learned that I am pretty much on target in regards to the features of this “discovery system”. Afterwards a number of us from the Catholic Research Resources Alliance (CRRA) listened to folks from Crivella West describe their vision of librarianship. The presentation was very interesting because they described how they have taken many collections of content and mined them for answers to questions. This is digital humanities to the extreme. Their software — the Knowledge Kiosk — is being used to analyze the content of John Henry Newman at the Newman Institute.

Tuesday morning was spent more with the CRRA. We ratified next year’s strategic plan. In the afternoon I visited a few of my friends at the Library of Congress (LOC). There I learned a bit how the LOC may be storing and archiving Twitter feeds. Interesting.

Text mining against NGC4Lib

Eric Lease Morgan — Fri, 25 Jun 2010 15:23:51 +0000

I “own” a mailing list called NCG4Lib. It’s purpose is to provide a forum for the discussion of all things “next generation library catalog”. As of this writing, there are about 2,000 subscribers.

Lately I have been asking myself, “What sorts of things get discussed on the list and who participates in the discussion?” I thought I’d try to answer this question with a bit of text mining. This analysis only covers the current year to date, 2010.

Author names

Even though there are as many as 2,000 subscribers, only a tiny few actually post comments. The following pie and line charts illustrate the point without naming any names. As you can see, eleven (11) people contribute 50% of the postings.

11 people post 50% of the messages

The lie chart illustrates the same point differently; a few people post a lot. We definitely have a long tail going on here.

They definitely represent a long tail

Subject lines

The most frequently used individual subject line words more or less reflect traditional library cataloging practices. MARC. MODS. Cataloging. OCLC. But also notice how the word “impasse” is included. This may reflect something about the list.

The subject words look “traditional”

I’m not quite sure what to make of the most commonly used subject word bigrams.

‘Don’t know what to make of these bigrams

Body words

The most frequently used individual words in the body of the postings tell a nice story. Library. Information. Data. HTTP. But notice what is not there — books. I also don’t see things like collections, acquisitions, public, services, nor value or evaluation. Hmm…

These tell a nice story

The most frequently used bigrams in the body of the messages tell an even more interesting story because the they are dominated by the names of people and things.

Names of people and things

The phrases “information services” and “technical services” do not necessarily fit my description. Using a concordance to see how these words were being used, I discovered they were overwhelmingly a part of one or more persons’ email signatures or job descriptions. Not what I was hoping for. (Sigh.)

Conclusions

Based on these observations, as well as my personal experience, I believe the NGC4Lib mailing list needs more balance. It needs more balance in a couple of ways:

There are too few people who post the majority of the content. The opinions of eleven people do not, IMHO, represent the ideas and beliefs of more than 2,000. I am hoping these few people understand this and will moderate themselves accordingly.
The discussion is too much focused, IMHO, on traditional library cataloging. There is so much more to the catalog than metadata. We need to be asking questions about what it contains, how that stuff is selected and how it gets in there, what the stuff is used for, and how all of this fits into the broader, worldwide information environment. We need to be discussing issues of collection and dissemination, not just organization. Put another way, I wish I had not used the word “catalog” in the name of the list because I think the word brings along too many connotations and preconceived ideas.

As the owner of the list, what will I do? Frankly, I don’t know. Your thoughts and comments are welcome.

The Next Next-Generation Library Catalog

Eric Lease Morgan — Thu, 24 Jun 2010 16:24:52 +0000

With the advent of the Internet and wide-scale availability of full-text content, people are overwhelmed with the amount of accessible data and information. Library catalogs can only go so far when it comes to delimiting what is relevant and what is not. Even when the most exact searches return 100’s of hits what is a person to do? Services against texts — digital humanities computing techniques — represent a possible answer. Whether the content is represented by novels, works of literature, or scholarly journal articles the methods of the digital humanities can provide ways to compare & contrast, analyze, and make more useful any type of content. This essay elaborates on these ideas and describes how they can be integrated into the “next, next-generation library catalog”.

(Because this essay is the foundation for a presentation at the 2010 ALA Annual Meeting, this presentation is also available as a one-page handout designed for printing as well as bloated set of slides.)

Find is not the problem

Find is not the problem to be solved. At most, find is a means to an end and not the end itself. Instead, the problem to solve surrounds use. The profession needs to implement automated ways to make it easier users do things against content.

The library profession spends an inordinate amount of time and effort creating catalogs — essentially inventory lists of things a library owns (or licenses). The profession then puts a layer on top of this inventory list — complete with authority lists, controlled vocabularies, and ever-cryptic administrative data — to facilitate discovery. When poorly implemented, this discovery layer is seen by the library user as an impediment to their real goal. Read a book or article. Verify a fact. Learn a procedure. Compare & contrast one idea with another idea. Etc.

In just the past few years the library profession has learned that indexers (as opposed to databases) are the tools to facilitate find. This is true for two reasons. First, indexers reduce the need for users to know how the underlying data is structured. Second, indexers employ statistical analysis to rank it’s output by relevance. Databases are great for creating and maintaining content. Indexers are great for search. Both are needed in equal measures in order to implement the sort of information retrieval systems people have come to expect. For example, many of the profession’s current crop of “discovery” systems (VUFind, Blacklight, Summon, Primo, etc.) all use an open source indexer called Lucene to drive search.

This being the case, we can more or less call the problem of find solved. True, software is never done, and things can always be improved, but improvements in the realm of search will only be incremental.

Instead of focusing on find, the profession needs to focus on the next steps in the process. After a person does a search and gets back a list of results, what do they want to do? First, they will want to peruse the items in the list. After identifying items of interest, they will want to acquire them. Once the selected items are in hand users may want to print, but at the very least they will want to read. During the course of this reading the user may be doing any number of things. Ranking. Reviewing. Annotating. Summarizing. Evaluating. Looking for a specific fact. Extracting the essence of the author’s message. Comparing & contrasting the text to other texts. Looking for sets of themes. Tracing ideas both inside and outside the texts. In other words, find and acquire are just a means to greater ends. Find and acquire are library goals, not the goals of users.

People want to perform actions against the content they acquire. They want to use the content. They want to do stuff with it. By expanding our definition of “information literacy” to include things beyond metadata and bibliography, and by combining it with the power of computers, librarianship can further “save the time of the reader” and thus remain relevant in the current information environment. Focusing on the use and evaluation of information represents a growth opportunity for librarianship.

It starts with counting

The availability of full text content in the form of plain text files combined with the power of computing empowers one to do statistical analysis against corpora. Put another way, computers are great at counting words, and once sets of words are counted there are many things one can do with the results, such as but not limited to:

measuring length
measuring readability, “greatness”, or any other index
measuring frequency of unigrams, n-grams, parts-of-speech, etc.
charting & graphing analysis (word clouds, scatter plots, histograms, etc.)
analyzing measurements and looking for patterns
drawing conclusions and making hypotheses

For example, suppose you did the perfect search and identified all of the works of Plato, Aristotle, and Shakespeare. Then, if you had the full text, you could compute a simple table such as Table 1.

Author	Works	Words	Average	Grade	Flesch
Plato	25	1,162,46	46,499	12-15	54
Aristotle	19	950,078	50,004	13-17	50
Shakespeare	36	856,594	23,794	7-10	72

The table lists who wrote how many works. It lists the number of words in each set of works and the average number of words per work. Finally, based on things like sentence length, it estimates grade and reading levels for the works. Given such information, a library “catalog” could help the patron could answer questions such as:

Which author has the most works?
Which author has the shortest works?
Which author is the most verbose?
Is the author of most works also the author who is the most verbose?
In general, which set of works requires the higher grade level?
Does the estimated grade/reading level of each authors’ work coincide with one’s expectations?
Are there any authors whose works are more or less similar in reading level?

Given the full text, a trivial program can then be written to count the number of words existing in a corpus as well as the number of times each word occurs, as shown in Table 2.

Plato	Aristotle	Shakespeare
will	one	thou
one	will	will
socrates	must	thy
may	also	shall
good	things	lord
said	man	thee
man	may	sir
say	animals	king
true	thing	good
shall	two	now
like	time	come
can	can	well
must	another	enter
another	part	love
men	first	let
now	either	hath
also	like	man
things	good	like
first	case	one
let	nature	upon
nature	motion	know
many	since	say
state	others	make
knowledge	now	may
two	way	yet

Table 2, sans a set of stop words, lists the most frequently used words in the complete works of Plato, Aristotle, and Shakespeare. The patron can then ask and answer questions like:

Are there words in one column that appear frequently in all columns?
Are there words that appear in only one column?
Are the rankings of the words similar between columns?
To what degree are the words in each column a part of larger groups such as: nouns, verbs, adjectives, etc.?
Are there many synonyms or antonyms shared inside or between the columns?

Notice how the words “one”, “good” and “man” appear in all three columns. Does that represent some sort of shared quality between the works?

If one word contains some meaning, then do two words contain twice as much meaning? Here is a list of the most common two-word phrases (bigrams) in each author corpus, Table 3.

Plato	Aristotle	Shakespeare
let us	one another	king henry
one another	something else	thou art
young socrates	let uses	thou hast
just now	takes place	king richard
first place	one thing	mark antony
every one	without qualification	prince henry
like manner	middle term	let us
every man	first figure	king lear
quite true	b belongs	thou shalt
two kinds	take place	duke vincentio
human life	essential nature	dost thou
one thing	every one	sir toby
will make	practical wisdom	art thou
human nature	will belong	henry v
human mind	general rule	richard iii
quite right	anything else	toby belch
modern times	one might	scene ii
young men	first principle	act iv
can hardly	good man	iv scene
will never	two things	exeunt king
will tell	two kinds	don pedro
dare say	first place	mistress quickly
will say	like manner	act iii
false opinion	one kind	thou dost
one else	scientific knowledge	sir john

Notice how the names of people appear frequently in Shakespeare’s works, but very few names appear in the lists of Plato and Aristotle. Notice how the word “thou” appears a lot in Shakespeare’s works. Ask yourself the meaning of the word “thou”, and decide whether or not to update the stop word list. Notice how the common phrases of Plato and Aristotle are akin to ideas, not tangible things. Examples include: human nature, practical wisdom, first principle, false opinion, etc. Is there a pattern here?

If “a picture is worth a thousand words”, then there are about six thousand words represented by Figures 1 through 6.

Words used by Plato	Phrases used by Plato
Words used by Aristotle	Phrases used by Aristotle
Words used by Shakespeare	Phrases used by Shakespeare

Word clouds — “tag clouds” — are an increasingly popular way to illustrate the frequency of words or phrases in a corpus. Because a few of the phrases in a couple of the corpuses were considered outliers, phrases such as “let us”, “one another”, and “something else” are not depicted.

Even without the use of statistics, it appears the use of the phrase “good man” by each author might be interestingly compared & contrasted. A concordance is an excellent tool for such a purpose, and below are a few of the more meaty uses of “good man” by each author.

List 1 – “good man” as used by Plato

  ngth or mere cleverness. To the good man, education is of all things the most pr
   Nothing evil can happen to the good man either in life or death, and his own de
  but one reply: 'The rule of one good man is better than the rule of all the rest
   SOCRATES: A just and pious and good man is the friend of the gods; is he not? P
  ry wise man who happens to be a good man is more than human (daimonion) both in

List 2 – “good man” as used by Aristotle

  ons that shame is felt, and the good man will never voluntarily do bad actions. 
  reatest of goods. Therefore the good man should be a lover of self (for he will 
  hat is best for itself, and the good man obeys his reason. It is true of the goo
  theme If, as I said before, the good man has a right to rule because he is bette
  d prove that in some states the good man and the good citizen are the same, and

List 3 – “good man” as used by Shakespeare

  r to that. SHYLOCK Antonio is a good man. BASSANIO Have you heard any imputation
  p out, the rest I'll whistle. A good man's fortune may grow out at heels: Give y
  t it, Thou canst not hit it, my good man. BOYET An I cannot, cannot, cannot, An 
  hy, look where he comes; and my good man too: he's as far from jealousy as I am 
   mean, that married her, alack, good man! And therefore banish'd -- is a creatur

What sorts of judgements might the patron be able to make based on the snippets listed above? Are Plato, Aristotle, and Shakespeare all defining the meaning of a “good man”? If so, then what are some of the definitions? Are there qualitative similarities and/or differences between the definitions?

Sometimes being as blunt as asking a direct question, like “What is a man?”, can be useful. Lists 4 through 6 try to answer it.

List 4 – “man is” as used by Plato

  stice, he is met by the fact that man is a social being, and he tries to harmoni
  ption of Not-being to difference. Man is a rational animal, and is not -- as man
  ss them. Or, as others have said: Man is man because he has the gift of speech; 
  wise man who happens to be a good man is more than human (daimonion) both in lif
  ied with the Protagorean saying, 'Man is the measure of all things;' and of this

List 5 – “man is” as used by Aristotle

  ronounced by the judgement 'every man is unjust', the same must needs hold good 
  ts are formed from a residue that man is the most naked in body of all animals a
  ated piece at draughts. Now, that man is more of a political animal than bees or
  hese vices later. The magnificent man is like an artist; for he can see what is 
  lement in the essential nature of man is knowledge; the apprehension of animal a

List 6 – “man is” as used by Shakespeare

   what I have said against it; for man is a giddy thing, and this is my conclusio
   of man to say what dream it was: man is but an ass, if he go about to expound t
  e a raven for a dove? The will of man is by his reason sway'd; And reason says y
  n you: let me ask you a question. Man is enemy to virginity; how may we barricad
  er, let us dine and never fret: A man is master of his liberty: Time is their ma

In the 1950s Mortimer Adler and a set of colleagues created a set of works they called The Great Books of the Western World. This 80-volume set included all the works of Plato, Aristotle, and Shakespeare as well as some of the works of Augustine, Aquinas, Milton, Kepler, Galileo, Newton, Melville, Kant, James, and Frued. Prior to the set’s creation, Adler and colleagues enumerated 102 “greatest ideas” including concepts such as: angel, art, beauty, honor, justice, science, truth, wisdom, war, etc. Each book in the series was selected for inclusion by the committee because of the way the books elaborated on the meaning of the “great ideas”.

Given the full text of each of the Great Books as well as a set of keywords (the “great ideas”), it is relatively simple to calculate a relevancy ranking score for each item in a corpus. Love is one of the “great ideas”, and it just so happens it is used most significantly by Shakespeare compared to the use of the other authors in the set. If Shakespeare has the highest “love quotient”, then what does Shakespeare have to say about love? List 7 is a brute force answer to such a question.

List 7 – “love is” as used by Shakespeare

  y attempted? Love is a familiar; Love is a devil: there is no evil angel but Lov
  er. VALENTINE Why? SPEED Because Love is blind. O, that you had mine eyes; or yo
   that. DUKE This very night; for Love is like a child, That longs for every thin
  n can express how much. ROSALIND Love is merely a madness, and, I tell you, dese
  of true minds Admit impediments. Love is not love Which alters when it alteratio

Do these definitions coincide with expectations? Maybe further reading is necessary.

Digital humanities, library science, and “catalogs”

The previous section is just about the most gentle introduction to digital humanities computing possible, but can also be an introduction to a new breed of library science and library catalogs.

It began by assuming the existence of full text content in plain text form — an increasingly reasonable assumption. After denoting a subset of content, it compared & contrasted the sizes and reading levels of the content. By counting individual words and phrases, patterns were discovered in the texts and a particular idea was loosely followed — specifically, the definition of a good man. Finally, the works of a particular author were compared to the works of a larger whole to learn how the author defined a particular “great idea”.

The fundamental tools used in this analysis were a set of rudimentary Perl modules: Lingua::EN::Fathom for calculating the total number of words in a document as well as a document’s reading level, Lingua::EN::Bigram for listing the most frequently occurring words and phrases, and Lingua::Concordance for listing sentence snippets. The Perl programs built on top of these modules are relatively short and include: fathom.pl, words.pl, bigrams.pl and concordance.pl. (If you really wanted to download the full text versions of Plato, Aristotle, and Shakespeare‘s works used in this analysis.) While the programs themselves are really toys, the potential they represent are not. It would not be too difficult to integrate their functionality into a library “catalog”. Assume the existence of significant amount of full text content in a library collection. Do a search against the collection. Create a subset of content. Click a few buttons to implement statistical analysis against the result. Enable the user to “browse” the content and follow a line of thought.

The process outlined in the previous section is not intended to replace rigorous reading, but rather to supplement it. It enables a person to identify trends quickly and easily. It enables a person to read at “Web scale”. Again, find is not the problem to be solved. People can find more information than they require. Instead, people need to use and analyze the content they find. This content can be anything from novels to textbooks, scholarly journal articles to blog postings, data sets to collections of images, etc. The process outlined above is an example of services against texts, a way to “Save the time of the reader” and empower them to make better and more informed decisions. The fundamental processes of librarianship (collection, preservation, organization, and dissemination) need to be expanded to fit the current digital environment. The services described above are examples of how processes can be expanded.

The next “next generation library catalog” is not about find, instead it is about use. Integrating digital humanities computing techniques into library collections and services is just one example of how this can be done.

Measuring the Great Books

Eric Lease Morgan — Tue, 15 Jun 2010 16:48:56 +0000

This posting describes how I am assigning quantitative characteristics to texts in an effort to answer the question, “How ‘great’ are the Great Books?” In the end I make a plea for library science.

Background

With the advent of copious amounts of freely available plain text on the ‘Net comes the ability of “read” entire corpora with a computer and apply statistical processes against the result. In an effort to explore the feasibility of this idea, I am spending time answering the question, “How ‘great’ are the Great Books?”

More specifically, want to assign quantitative characteristics to each of the “books” in the Great Books set, look for patterns in the result, and see whether or not I can draw any conclusions about the corpus. If such processes are proven effective, then the same processes may be applicable to other corpora such as collections of scholarly journal articles, blog postings, mailing list archives, etc. If I get this far, then I hope to integrate these processes into traditional library collections and services in an effort to support their continued relevancy.

On my mark. Get set. Go.

Assigning quantitative characteristics to texts

The Great Books set posits 102 “great ideas” — basic, foundational themes running through the heart of Western civilization. Each of the books in the set were selected for inclusion by the way they expressed the essence of these great ideas. The ideas are grand and ambiguous. They include words such as angel, art, beauty, courage, desire, eternity, god, government, honor, idea, physics, religion, science, space, time, wisdom, etc. (See Appendix B of “How ‘great’ are the Great Books?” for the complete list.)

In a previous posting, “Great Ideas Coefficient“, I outlined the measure I propose to use to determine the books’ “greatness” — essentially a sum of all TFIDF (term frequency / inverse document frequency) scores as calculated against the list of great ideas. TFIDF is defined as:

( c / t ) * log( d / f )

where:

c = number of times a given word appears in a document
t = total number of words in a document
d = total number of documents in a corpus
f = total number of documents containing a given word

Thus, the problem boils down to determining the values for c, t, d, and f for a given great idea, 2) summing the resulting TFIDF scores, 3) saving the results, and 4) repeating the process for each book in the corpus. Here, more exactly, is how I am initially doing such a thing:

Build corpus – In a previous posting, “Collecting the Great Books“, I described how I first collected 223 of the roughly 250 Great Books.
Index corpus – The process used to calculate the TFIDF values of c and t are trivial because any number of computer programs do such a thing quickly and readily. In our case, the value of d is a constant — 223. On the other hand, trivial methods for determining the number of documents containing a given word (f) are not scalable as the size of a corpus increases. Because an index is essentially a list of words combined with the pointers to where the words can be found, an index proves to be a useful tool for determining the value of f. Index a corpus. Search the index for a word. Get back the number of hits and use it as the value for f. Lucene is currently the gold standard when it comes to open source indexers. Solr — an enhanced and Web Services-based interface to Lucene — is the indexer used in this process. The structure of the local index is rudimentary: id, author, title, URL, and full text. Each of the metadata values are pulled out of a previously created index file — great-books.xml — while the full text is read from the file system. The whole lot is then stuffed into Solr. A program called index.pl does this work. Another program called search.pl was created simply for testing the validity of the index.
Count words and determine readability – A Perl module called Lingua::EN::Fathom does a nice job of counting the number of words in a file, thus providing me with a value for t. Along the way it also calculates a number of “readability” scores — values used to determine the necessary education level of a person needed to understand a given text. While I had “opened the patient” I figured it would be a good idea to take note of this information. Given the length of a book as well as its readability scores, I enable myself to answer questions such as, “Are longer books more difficult to read?” Later on, given my Great Ideas Coefficient, I will be able to answer questions such as “Is the length of a book a determining factor in ‘greatness’?” or “Are ‘great’ books more difficult to read?”
Calculate TFIDF – This is the fuzziest and most difficult part of the measurement process. Using Lingua::EN::Fathom again I find all of the unique words in a document, stem them with Lingua::Stem::Snowball, and calculate the number of times each stem occurs. This gives me a value for c. I then loop through each great idea, stem them, and search the index for the stem thus returning a value for f. For each idea I now have values for c, t, d, and f enabling me to calculate TFIDF — ( c / t ) * log( d / f ).
Calculate the Great Ideas Coefficient – This is trivial. Keep a running sum of all the great idea TFIDF scores.
Go to Step #4 – Repeat this process for each of the 102 great ideas.
Save – After all the various scores (number of words, readability scores, TFIDF scores, and Great Ideas Coefficient) have been calculated I save each to my pseudo database file called great-ideas.xml. Each is stored as an attribute associated with a book’s unique identifier. Later I will use the contents of this file as the basis of my statistical analysis.
Go to Step #3 – Repeat this process for each book in the corpus, and in this case 223 times.

Of course I didn’t do all of this by hand, and the program I wrote to do the work is called measure.pl.

The result is my pseudo database file — great-books.xml. This is my data set. It keeps track all of my information in a human-readable, application- and operating system-independent manner. Very nice. If there is only one file you download from this blog posting, then it should be this file. Using it you will be able to create your own corpus and do your own analysis.

The process outlined above is far from perfect. First, there are a few false negatives. For example, the great idea “universe” returned a TFIDF value of zero (0) for every document. Obviously is is incorrect, and I think the error has something to do with the stemming and/or indexing subprocesses. Second, the word “being”, as calculated by TFIDF, is by far and away the “greatest” idea. I believe this is true because the word “being” is… being counted as both a noun as well as a verb. This points to a different problem — the ambiguity of the English language. While all of these issues will knowingly skew the final results, I do not think they negate the possibility of meaningful statistical investigation. At the same time it will be necessary to refine the measurement process to reduce the number of “errors”.

Measurment, the humanities, and library science

Measurement is one of the fundamental qualities of science. The work of Archimedes is the prototypical example. Kepler and Galileo took the process to another level. Newton brought it to full flower. Since Newton the use of measurement — the assignment of mathematical values — applied against observations of the natural world and human interactions have given rise to the physical and social sciences. Unlike studies in the humanities, science is repeatable and independently verifiable. It is objective. Such is not a value judgment, merely a statement of fact. While the sciences seem cold, hard, and dry, the humanities are subjective, appeal to our spirit, give us a sense of purpose, and tend to synthesis our experiences into a meaningful whole. Both of the scientific and humanistic thinking processes are necessary for us to make sense of the world around us. I call these combined processes “arscience“.

The library profession could benefit from the greater application of measurement. In my opinion, too much of the profession’s day-to-day as well as strategic decisions are based on antidotal evidence and gut feelings. Instead of basing our actions on data, actions are based on tradition. “This is the way we have always done it.” This is medieval, and consequently, change comes very slowly. I sincerely believe libraries are not going away any time soon, but I do think the profession will remain relevant longer if librarians were to do two things: 1) truly exploit the use of computers, and 2) base a greater number of their decisions on data — measurment — as opposed to opinion. Let’s call this library science.

Collecting the Great Books

Eric Lease Morgan — Sun, 13 Jun 2010 23:17:11 +0000

In an effort to answer the question, “How ‘great’ are the Great Books?“, I need to mirror the full texts of the Great Books. This posting describes the initial process I am using to do such a thing, but the imporant thing to note is that this process is more about librarianship than it is about software.

Background

The Great Books is/was a 60-volume set of content intended to further a person’s liberal arts education. About 250 “books” in all, it consists of works by Homer, Aristotle, Augustine, Chaucer, Cervantes, Locke, Gibbon, Goethe, Marx, James, Freud, etc. There are a few places on the ‘Net where the complete list of authors/titles can be read. One such place is a previous blog posting of mine. My goal is to use digital humanities computing techniques to statistically describe the works and use these descriptions to supplement a person’s understanding of the texts. I then hope to apply these same techniques to other corpora. To accomplish this goal I first need to acquire full text versions of the Great Books. This posting describes how I am initially going about it.

Mirroring and caching the Great Books

All of the books of the Great Books were written by “old dead white men”. It is safe to assume the texts have been translated into a myriad of languages, including English, and it is safe to assume the majority exist in the public domain. Moreover, with the advent of the Web and various digitizing projects, it is safe to assume quality information gets copied forward and will be available for downloading. All of this has proven to be true. Through the use of Google and a relatively small number of repositories (Project Gutenberg, Alex Catalogue of Electronic Texts, Internet Classics Archive, Christian Classics Ethereal Library, Internet Archive, etc.), I have been able to locate and mirror 223 of the roughly 250 Great Books. Here’s how:

Bookmark texts – Trawl the Web for the Great Books and use Delicious to bookmark links to plain text versions translated into English. Firefox combined with the Delicious extension have proven to be very helpful in this regard. My bookmarks should be located at http://delicious.com/ericmorgan/gb.
Save and edit bookmarks file – Delicious gives you the option to save your bookmarks file locally. The result is a bogus HTML file intended to be imported into Web browsers. It contains the metadata used to describe your bookmarks such as title, notes, and URLs. After exporting my bookmarks to the local file system, I contorted the bogus HTML into rudimentary XML so I could systematically read it for subsequent processing.
Extract URLs – Using a 7-line program called bookmarks2urls.pl, I loop through the edited bookmarks file and output all the URLs.
Mirror content – Because I want/need to retain a pristine version of the original texts, I feed the URLs to wget and copy the texts to a local directory. This use of wget is combined with the output of Step #3 through a brain-dead shell script called mirror.sh.
Create corpus – The mirrored files are poorly named; using just the mirror it is difficult to know what “great book” hides inside files named annals.mb.txt, pg2600.txt, or whatever. Moreover, no metadata is associated with the collection. Consequently I wrote a program — build-corpus.pl — that loops through my edited bookmarks file, extracts the necessary metadata (author, title, and URL), downloads the remote texts, saves them locally with a human-readable filename, creates a rudimentary XHTML page listing each title, and creates an XML file containing all of the metadata generated to date.

The results of this 5-step process include:

The most important file, by far, is the metadata file. It is intended to be a sort of application- and operating system-independent database. Given this file, anybody ought to be able to duplicate the analysis I propose to do later. If there is only one file you download from this blog posting, it should be the metadata file — great-books.xml.

The collection process is not perfect. I was unable to find many of the works of Archimedes, Copernicus, Kepler, Newton, Galileo, or Freud. For all but Freud, I attribute this to the lack of translations, but I suppose I could stoop to the use of poorly OCR’ed texts from Google Books. I attribute the unavailability of Freud to copyright issues. There’s no getting around that one. A few times I located HTML versions of desired texts, but HTML will ultimately skew my analysis. Consequently I used a terminal-based program called lynx to convert and locally save the remote HTML to a plain text file. I then included that file into my corpus. Alas, there are always ways to refine collections. Like software, they are are never done.

Summary — Collection development, acquisitions, and cataloging

The process outlined above is really about librarianship and not software. Specifically, it is about collection development, acquisitions, and cataloging. I first needed to articulate a development policy. While it did not explicitly describe the policy it did outline why I wanted to create the collection as well as a few of each item’s necessary qualities. The process above implemented a way to actually get the content — acquisitions. Finally, I described — “cataloged” — my content, albiet in a very rudimentary form.

It is an understatement to say the Internet has changed the way data, information, and knowledge are collected, preserved, organized, and disseminated. By extension, librarianship needs to change in order to remain relevant with the times. Our profession spends much of its time trying to refine old processes. It is like trying to figure out how to improve the workings of a radio when people have moved on to the use of televisions instead. While traditional library processes are still important, they are not as important as the used to be.

The processes outline above illustrate one possible way librarianship can change the how’s of its work while retaining it’s what’s.

Inaugural Code4Lib “Midwest” Regional Meeting

Eric Lease Morgan — Sat, 12 Jun 2010 20:17:46 +0000

I believe the Inaugural Code4Lib “Midwest” Regional Meeting (June 11 & 12, 2010 at the University of Notre Dame) was a qualified success.

About twenty-six people attended. (At least that was the number of people who went to lunch.) They came from Michigan, Ohio, Iowa, Indiana, and Illinois. Julia Bauder won the prize for coming the furthest distance away — Grinnell, Iowa.

Day #1

We began with Lightning Talks:

ePub files by Michael Kreyche
FRBR and MARC data by Kelley McGrath
Great Books by myself
jQuery and the OPAC by Ken Irwin
Notre Dame and the Big Ten by Michael Witt
Solr & Drupal by Rob Casson
Subject headings via a Web Service by Michael Kreyche
Taverna by Rick Johnson and Banu Lakshminarayanan
VUFind on a hard disk by Julia Bauder

We dined in the University’s South Dining Hall, and toured a bit of the campus on the way back taking in the “giant marble”, the Architecture Library, and the Dome.

In the afternoon we broke up into smaller groups and discussed things including institutional repositories, mobile devices & interfaces, ePub files, and FRBR. In the evening we enjoyed varieties of North Carolina barbecue, and then retreated to the campus bar (Legend’s) for a few beers.

I’m sorry to say the Code4Lib Challenge was not successful. Us hackers were either to engrossed to notice whether or not anybody came to the event, or nobody showed up to challenge us. Maybe next time.

Day #2

There were fewer participants on Day #2. We spent the time listening to Ken elaborate on the uses and benefits of jQuery. I hacked at something I’m calling “The Great Books Survey”.

The event was successful in that it provided plenty of opportunity to discuss shared problems and solutions. Personally, I learned I need to explore statistical correlations, regressions, multi-varient analysis, and principle component analysis to a greater degree.

A good time was had by all, and it is quite possible the next “Midwest” Regional Meeting will be hosted by the good folks in Chicago.

For more detail about Code4Lib “Midwest”, see the wiki: http://wiki.code4lib.org/index.php/Midwest.

How “great” are the Great Books?

Eric Lease Morgan — Fri, 11 Jun 2010 01:08:17 +0000

In the 1952 a set of books called the Great Books of the Western World was published. It was supposed to represent the best of Western literature and enable the reader to further their liberal arts education. Sixty volumes in all, it included works by Plato, Aristotle, Shakespeare, Milton, Galileo, Kepler, Melville, Darwin, etc. (See Appendix A.) These great books were selected based on the way they discussed a set of 102 “great ideas” such as art, astronomy, beauty, evil, evolution, mind, nature, poetry, revolution, science, will, wisdom, etc. (See Appendix B.) How “great” are these books, and how “great” are the ideas expressed in them?

Given full text versions of these books it would be almost trivial to use the “great ideas” as input and apply relevancy ranking algorithms against the texts thus creating a sort of score — a “Great Ideas Coefficient”. Term Frequency/Inverse Document Frequency is a well-established algorithm for computing just this sort of thing:

relevancy = ( c / t ) * log( d / f )

where:

c = number of times a given word appears in a document
t = total number of words in a document
d = total number of documents in a corpus
f = total number of documents containing a given word

Thus, to calculate our Great Ideas Coefficient we would sum the relevancy score for each “great idea” for each “great book”. Plato’s Republic might have a cumulative score of 525 while Aristotle’s On The History Of Animals might have a cumulative score of 251. Books with a larger Coefficient could be considered greater. Given such a score a person could measure a book’s “greatness”. We could then compare the score to the scores of other books. Which book is the “greatest”? We could compare the score to other measurable things such as book’s length or date to see if there were correlations. Are “great books” longer or shorter than others? Do longer books contain more “great ideas”? Are there other books that were not included in the set that maybe should have been included? Instead of summing each relevancy score, maybe the “great ideas” can be grouped into gross categories such as humanities or sciences, and we can sum those scores instead. Thus we may be able to say one set of book is “great” when it comes the expressing the human condition and these others are better at describing the natural world. We could ask ourselves, which number of books represents the best mixture of art and science because their humanities score is almost equal to its sciences score. Expanding the scope beyond general education we could create an alternative set of “great ideas”, say for biology or mathematics or literature, and apply the same techniques to other content such as full text scholarly journal literatures.

The initial goal of this study is to examine the “greatness” of the Great Books, but the ultimate goal is to learn whether or not this quantitative process can be applied other bodies of literature and ultimately assist the student/scholar in their studies/research

Wish me luck.

Appendix A – Authors and titles in the Great Books series

Aeschylus – Prometheus Bound; Seven Against Thebes; The Oresteia; The Persians; The Suppliant Maidens
American State Papers – Articles of Confederation; Declaration of Independence; The Constitution of the United States of America
Apollonius – On Conic Sections
Aquinas – Summa Theologica
Archimedes – Book of Lemmas; Measurement of a Circle; On Conoids and Spheroids; On Floating Bodies; On Spirals; On the Equilibrium of Planes; On the Sphere and Cylinder; The Method Treating of Mechanical Problems; The Quadrature of the Parabola; The Sand-Reckoner
Aristophanes – Ecclesiazousae; Lysistrata; Peace; Plutus; The Acharnians; The Birds; The Clouds; The Frogs; The Knights; The Wasps; Thesmophoriazusae
Aristotle – Categories; History of Animals; Metaphysics; Meteorology; Minor biological works; Nicomachean Ethics; On Generation and Corruption; On Interpretation; On Sophistical Refutations; On the Gait of Animals; On the Generation of Animals; On the Motion of Animals; On the Parts of Animals; On the Soul; Physics; Poetics; Politics; Posterior Analytics; Prior Analytics; Rhetoric; The Athenian Constitution; Topics
Augustine – On Christian Doctrine; The City of God; The Confessions
Aurelius – The Meditations
Bacon – Advancement of Learning; New Atlantis; Novum Organum
Berkeley – The Principles of Human Knowledge
Boswell – The Life of Samuel Johnson, LL.D.
Cervantes – The History of Don Quixote de la Mancha
Chaucer – Troilus and Criseyde; The Canterbury Tales
Copernicus – On the Revolutions of Heavenly Spheres
Dante – The Divine Comedy
Darwin – The Descent of Man and Selection in Relation to Sex; The Origin of Species by Means of Natural Selection
Descartes – Discourse on the Method; Meditations on First Philosophy; Objections Against the Meditations and Replies; Rules for the Direction of the Mind; The Geometry
Dostoevsky – The Brothers Karamazov
Epictetus – The Discourses
Euclid – The Thirteen Books of Euclid’s Elements
Euripides – Alcestis; Andromache; Bacchantes; Cyclops; Electra; Hecuba; Helen; Heracleidae; Heracles Mad; Hippolytus; Ion; Iphigeneia at Aulis; Iphigeneia in Tauris; Medea; Orestes; Phoenician Women; Rhesus; The Suppliants; Trojan Women
Faraday – Experimental Researches in Electricity
Fielding – The History of Tom Jones, a Foundling
Fourier – Analytical Theory of Heat
Freud – A General Introduction to Psycho-Analysis; Beyond the Pleasure Principle; Civilization and Its Discontents; Group Psychology and the Analysis of the Ego; Inhibitions, Symptoms, and Anxiety; Instincts and Their Vicissitudes; New Introductory Lectures on Psycho- Analysis; Observations on “Wild” Psycho-Analysis; On Narcissism; Repression; Selected Papers on Hysteria; The Ego and the Id; The Future Prospects of Psycho-Analytic Therapy; The Interpretation of Dreams; The Origin and Development of Psycho- Analysis; The Sexual Enlightenment of Children; The Unconscious; Thoughts for the Times on War and Death
Galen – On the Natural Faculties
Galileo – Dialogues Concerning the Two New Sciences
Gibbon – The Decline and Fall of the Roman Empire
Gilbert – On the Loadstone and Magnetic Bodies
Goethe – Faust
Hamilton – The Federalist
Harvey – On the Circulation of Blood; On the Generation of Animals; On the Motion of the Heart and Blood in Animals
Hegel – The Philosophy of History; The Philosophy of Right
Herodotus – The History
Hippocrates – Works
Hobbes – Leviathan
Homer – The Iliad; The Odyssey
Hume – An Enquiry Concerning Human Understanding
James – The Principles of Psychology
Kant – Excerpts from The Metaphysics of Morals; Fundamental Principles of the Metaphysic of Morals; General Introduction to the Metaphysic of Morals; Preface and Introduction to the Metaphysical Elements of Ethics with a note on Conscience; The Critique of Judgement; The Critique of Practical Reason; The Critique of Pure Reason; The Science of Right
Kepler – Epitome of Copernican Astronomy; The Harmonies of the World
Lavoisier – Elements of Chemistry
Locke – A Letter Concerning Toleration; An Essay Concerning Human Understanding; Concerning Civil Government, Second Essay
Lucretius – On the Nature of Things
Machiavelli – The Prince
Marx – Capital
Marx and Engels – Manifesto of the Communist Party
Melville – Moby Dick; or, The Whale
Mill – Considerations on Representative Government; On Liberty; Utilitarianism
Milton – Areopagitica; English Minor Poems; Paradise Lost; Samson Agonistes
Montaigne – Essays
Montesquieu – The Spirit of the Laws
Newton – Mathematical Principles of Natural Philosophy; Optics; Twelfth Night; or, What You Will
Christian Huygens; Treatise on Light
Nicomachus – Introduction to Arithmetic
Pascal – Pensées; Scientific and mathematical essays; The Provincial Letters
Plato – Apology; Charmides; Cratylus; Critias; Crito; Euthydemus; Euthyphro; Gorgias; Ion; Laches; Laws; Lysis; Meno; Parmenides; Phaedo; Phaedrus; Philebus; Protagoras; Sophist; Statesman; Symposium; The Republic; The Seventh Letter; Theaetetus; Timaeus
Plotinus – The Six Enneads
Plutarch – The Lives of the Noble Grecians and Romans
Ptolemy – The Almagest
Rabelais – Gargantua and Pantagruel
Rousseau – A Discourse on Political Economy; A Discourse on the Origin of Inequality; The Social Contract
Shakespeare – A Midsummer-Night’s Dream; All’s Well That Ends Well; Antony and Cleopatra; As You Like It; Coriolanus; Cymbeline; Julius Caesar; King Lear; Love’s Labour’s Lost; Macbeth; Measure For Measure; Much Ado About Nothing; Othello, the Moor of Venice; Pericles, Prince of Tyre; Romeo and Juliet; Sonnets; The Comedy of Errors; The Famous History of the Life of King Henry the Eighth; The First Part of King Henry the Fourth; The First Part of King Henry the Sixth; The Life and Death of King John; The Life of King Henry the Fifth; The Merchant of Venice; The Merry Wives of Windsor; The Second Part of King Henry the Fourth; The Second Part of King Henry the Sixth; The Taming of the Shrew; The Tempest; The Third Part of King Henry the Sixth; The Tragedy of Hamlet, Prince of Denmark; The Tragedy of King Richard the Second; The Tragedy of Richard the Third; The Two Gentlemen of Verona; The Winter’s Tale; Timon of Athens; Titus Andronicus; Troilus and Cressida
Smith – An Inquiry into the Nature and Causes of the Wealth of Nations
Sophocles – Ajax; Electra; Philoctetes; The Oedipus Cycle; The Trachiniae
Spinoza – Ethics
Sterne – The Life and Opinions of Tristram Shandy, Gentleman
Swift – Gulliver’s Travels
Tacitus – The Annals; The Histories
Thucydides – The History of the Peloponnesian War
Tolstoy – War and Peace
Virgil – The Aeneid; The Eclogues; The Georgics

Appendix B – The “great” ideas

angel • animal • aristocracy • art • astronomy • beauty • being • cause • chance • change • citizen • constitution • courage • custom & convention • definition • democracy • desire • dialectic • duty • education • element • emotion • eternity • evolution • experience • family • fate • form • god • good & evil • government • habit • happiness • history • honor • hypothesis • idea • immortality • induction • infinity • judgment • justice • knowledge • labor • language • law • liberty • life & death • logic • love • man • mathematics • matter • mechanics • medicine • memory & imagination • metaphysics • mind • monarchy • nature • necessity & contingency • oligarchy • one & many • opinion • opposition • philosophy • physics • pleasure & pain • poetry • principle • progress • prophecy • prudence • punishment • quality • quantity • reasoning • relation • religion • revolution • rhetoric • same & other • science • sense • sign & symbol • sin • slavery • soul • space • state • temperance • theology • time • truth • tyranny • universal & particular • virtue & vice • war & peace • wealth • will • wisdom • world

Not really reading

Eric Lease Morgan — Thu, 10 Jun 2010 03:35:16 +0000

Using a number of rudimentary digital humanities computing techniques, I tried to practice what I preach and extract the essence from a set of journal articles. I feel like the process met with some success, but I was not really reading.

The problem

A set of twenty-one (21) essays on the future of academic librarianship was recently brought to my attention:

Leaders Look Toward the Future – This site compiled by Camila A. Alire and G. Edward Evans offers 21 essays on the future of academic librarianship written by individuals who represent a cross-section of the field from the largest institutions to specialized libraries.

Since I was too lazy to print and read all of the articles mentioned above, I used this as an opportunity to test out some of my “services against text” ideas.

The solution

Specifically, I used a few rudimentary digital humanities computing techniques to glean highlights from the corpus. Here’s how:

First I converted all of the PDF files to plain text files using a program called pdftotext — a part of xpdf. I then concatenated the whole lot together, thus creating my corpus. This process is left up to you — the reader — as an exercise because I don’t have copyright hutzpah.
Next, I used Wordle to create a word cloud. Not a whole lot of new news here, but look how big the word “information” is compared to the word “collections”.
Using a program of my own design, I then created a textual version of the word cloud listing the top fifty most frequently used words and the number of times they appeared in the corpus. Again, not a whole lot of new news. The articles are obviously about academic libraries, but notice how the word “electronic” is listed and not the word “book”.
Things got interesting when I created a list of the most significant two-word phrases (bi-grams). Most of the things are nouns, but I was struck by “will continue” and “libraries will” so I applied a concordance application to these phrases and got lists of snippets. Some of the more interesting ones include: libraries will be “under the gun” financially, libraries will be successful only if they adapt, libraries will continue to be strapped for staffing, libraries will continue to have a role to play, will continue their major role in helping, will continue to be important, will continue to shift toward digital information, will continue to seek new opportunities.

Yes, there may very well be some subtle facts I missed by not reading the full texts, but I think I got a sense of what the articles discussed. It would be interesting to sit a number of people down, have them read the articles, and then have them list out a few salient sentences. To what degree would their result be the same or different from mine?

I was able to write the programs from scratch, do the analysis, and write the post in about two hours, total. It would have taken me that long to read the articles. Just think what a number of librarians could do, and how much time could be saved if this system were expanded to support just about any plain text data.

Cyberinfrastructure Days at the University of Notre Dame

Eric Lease Morgan — Sun, 23 May 2010 14:46:08 +0000

On Thursday and Friday, April 29 and 30, 2010 I attended a Cyberinfrastructure Days event at the University of Notre Dame. Through this process my personal definition of “cyberinfrastructure” was updated, and my basic understanding of “digital humanities computing” was confirmed. This posting documents the experience.

Day #1 – Thursday, April 29

The first day was devoted to cyberinfrastructure and the humanities.

After all of the necessary introductory remarks, John Unsworth (University of Illinois – Urbana/Champagne) gave the opening keynote presentation entitled “Reading at library scale: New methods, attention, prosthetics, evidence, and argument“. In his talk he posited the impossibility of reading everything currently available. There is just too much content. Given some of the computing techniques at our disposal, he advocated additional ways to “read” material, but cautioned the audience in three ways: 1) there needs to be an attention to prosthetics, 2) an appreciation for evidence and statistical significance, and 3) a sense of argument so the skeptic may be able to test the method. To me this sounded a whole lot like applying scientific methods to the process of literary criticism. Unsworth briefly described MONK and elaborated how part of speech tagging had been done against the corpus. He also described how Dunning’s Log-Likelihood statistic can be applied to texts in order to determine what a person does (and doesn’t) include in their writings.

Stéfan Sinclair (McMaster University) followed with “Challenges and opportunities of Web-based analytic tools for the humanities“. He gave a brief history of the digital humanities in terms of computing. Mainframes and concordances. Personal computers and even more concordances. Webbed interfaces and locally hosted texts. He described digital humanities as something that has evolved in cycles since at least 1967. He advocated the new tools will be Web apps — things that can be embedded into Web pages and used against just about any text. His Voyeur Tools were an example. Like Unsworth, he advocated the use of digital humanities computing techniques because they can supplement the analysis of texts. “These tools allow you to see things that are not evident.” Sinclair will be presenting a tutorial at the annual digital humanities conference this July. I hope to attend.

In a bit of change of pace, Russ Hobby (Internet2) elaborated on the nuts & bolts of cyberinfrastructure in “Cyberinfrastructure components and use“. In this presentation I learned that many scientists are interested in the… science, and they don’t really care about the technology supporting it. They have an instrument in the field. It is collecting and generating data. They want to analyze that data. They are not so interested in how it gets transported from one place to another, how it is stored, or in what format. As I knew, they are interested in looking for patterns in the data in order to describe and predict events in the natural world. “Cyberinfrastructure is like a car. ‘Car, take me there.'” Cyberinfrastructure is about controls, security systems, storage sets, computation, visualization, support & training, collaboration tools, publishing, communication, finding, networking, etc. “We are not there to answer the question, but more to ask them.”

In the afternoon I listened to Richard Whaling (University of Chicago) present on “Humanities computing at scale“. Given from the point of view of a computer scientist, this presentation was akin to Hobby’s. On one hand there are people do analysis and there are people who create the analysis tools. Whaley is more like the later. I thought his discussion on the format of texts was most interesting. “XML is good for various types of rendering, but not necessarily so good for analysis. XML does not necessarily go deep enough with the encoding because the encoding is too expensive; XML is not scalable. Nor is SQL. Indexing is the way to go.” This perspective jives with my own experience. Encoding texts in XML (TEI) is so very tedious and the tools to do any analysis against the result are few and far between. Creating the perfect relational database (SQL) is like seeking the Holy Grail, and SQL is not designed to do full text searching nor “relevancy ranking”. Indexing texts and doing retrieval against the result has proven to be much more fruitful or me, but such an approach is an example of “Bag of Words” computing, and thus words (concepts) often get placed out of context. Despite that, I think the indexing approach holds the most promise. Check out Perseus under Philologic and Digital South Asia Library to see some of Whaley’s handiwork.

Chris Clarke (University of Notre Dame), in “Technology horizons for teaching and learning“, enumerated ways the University of Notre Dame is putting into practice many of the things described in the most recent Horizon Report. Examples included the use of ebooks, augmented reality, gesture-based computing, and visual data analysis. I thought the presentation was a great way to bring the forward-thinking report down to Earth and place it into a local context. Very nice.

William Donaruma (also from the University of Notre Dame) described the process he was going through to create 3-D movies in a presentation called “Choreography in a virtual space“. Multiple — very expensive — cameras. Dry ice. Specific positioning of the dancers. Special glasses. All of these things played into the creation of an illusion of three-dimensions on a two-dimensional space. I will not call it three-dimensional until I can walk around the object in question. The definition of three-dimensional needs to be qualified.

The final presentation of the day took place after dinner. The talk, “The Transformation of modern science” was given virtually by Edward Seidel (National Science Foundation). Articulate. Systematic. Thorough. Insightful. These are the sorts of words I use to describe Seidel’s talk. Presented remotely through a desktop camera and displayed on a screen to the audience, we were given a history of science and a description of how it has changed from single-man operations to large-group collaborations. We were shown the volume of information created previously and compared it to the volume of information generated now. All of this led up to the most salient message — “All future National Science Foundation grant proposals must include a data curation plan.” Seidel mentioned libraries, librarians, and librarianship quite a number of times during the talk. Naturally my ears perked up. My profession is about the collection, preservation, organization, and dissemination of data, information, and knowledge. The type of content to which these processes are applied — books, journal articles, multi-media recordings, etc — is irrelevant. Given a collection policy, it can all be important. The data generated by scientists and their machines is no exception. Is our profession up to the challenge, or are we too much wedded to printed, bibliographic materials? It is time for librarians to aggressively step up to the plate, or else. Here is an opportunity being laid at our feet. Let’s pick it up!

Day #2 – Friday, April 30

The second day centered more around the sciences as opposed to the humanities.

The day began with a presentation by Tony Hey (Microsoft Research) called “The Fourth Paradigm: Data-intensive scientific discovery“. Hey described cyberinfrastructure as the new name for e-science. He then echoed much of content of Seidel’s message from the previous evening and described the evolution of science in a set of paradigms: 1) theoretical, 2) experimental, 3) computational, and 4) data-intensive. He elaborated on the infrastructure components necessary for data-intensive science: 1) acquisition, 2) collaboration & visualization, 3) analysis & mining, 4) dissemination & sharing, 5) archiving & preservation. (Gosh, that sounds a whole lot like my definition of librarianship!) He saw Microsoft’s role as one of providing the necessary tools to facilitate e-science (or cyberinfrastructure) and thus the Fourth Paradigm. Hey’s presentation sounded a lot like open access advocacy. More Association of Research Library library directors as well as university administrators need to hear what he has to say.

Boleslaw Syzmanski (Rensselaer Polytechnic Institute) described how better science could be done in a presentation called “Robust asynchronous optimization for volunteer computing grids“. Like Hobby and Whaley mentioned (above), Syzmanski separated the work of the scientist and the work of cyberinfrastructure. “Scientists do not want to be bothered with the computer science of their work.” He then went on to describe a distributed computing technique for studying the galaxy — MilkyWay@home. He advocated cloud computing as a form of asynchronous computing.

The third presentation of the day was entitled “Cyberinfrastructure for small and medium laboratories” by Ian Foster (University of Chicago). The heart of this presentation was advocacy for software as a service (SaaS) computing for scientific laboratories.

Ashok Srivastava (NASA) was the first up in the second session with “Using Web 2.0 and collaborative tools at NASA“. He spoke to one of the basic principles of good science when he said, “Reproducibility is a key aspect of science, and with access to the data this reproducibility is possible.” I’m not quite sure my fellow librarians and humanists understand the importance of such a statement. Unlike work in the humanities — which is often built on subjective and intuitive interpretation — good science relies on the ability for many to come to the same conclusion based on the same evidence. Open access data makes such a thing possible. Much more of Srivastava’s presentation was about DASHlink, “a virtual laboratory for scientists and engineers to disseminate results and collaborate on research problems in health management technologies for aeronautics systems.”

“Scientific workflows and bioinformatics applications” by Ewa Deelman (University of Southern California) was up next. She echoed many of the things I heard from library pundits a few years ago when it came to institutional repositories. In short, “Workflows are what are needed in order for e-science to really work… Instead of moving the data to the computation, you have to move the computation to the data.” This is akin to two ideas. First, like Hey’s idea of providing tools to facilitate cyberinfrastructure, Deelman advocates integrating the cyberinfrastructure tools into the work of scientists. Second, e-science is more than mere infrastructure. It also approaches the “services against text” idea which I have been advocating for a few years.

Jeffrey Layton (Dell, Inc.) rounded out the session with a presentation called “I/O pattern characterization of HPC applications“. In it he described how he used the output of strace commands — which can be quite voluminous — to evaluate storage input/output patterns. “Storage is cheap, but it is only one of a bigger set of problems in the system.”

By this time I was full, my iPad had arrived in the mail, and I went home.

Observations

It just so happens I was given the responsibility of inviting a number of the humanists to the event, specifically: John Unsworth, Stéphan Sinclair, and Richard Whaley. That is was an honor, and I appreciate the opportunity. “Thank you.”

I learned a number of things, and a few other things were re-enforced. First, the word “cyberinfrastructure” is the newly minted term for “e-science”. Many of the presenters used these two words interchangeably. Second, while my experience with the digital humanities is still in its infancy, I am definitely on the right track. Concordances certainly don’t seem to be going out of style any time soon, and my use of indexes is a movement in the right direction. Third, the cyberinfrastructure people see themselves as support to the work of scientists. This is similar to the work of librarians who see themselves supporting their larger communities. Personally, I think this needs to be qualified since I believe it is possible for me to expand the venerable sphere of knowledge too. Providing library (or cyberinfrastructure) services does not preclude me from advancing our understanding of the human condition and/or describing the natural world. Lastly, open source software and open access publishing were common underlying themes but not rarely explicitly stated. I wonder whether or not the the idea of “open” is a four letter word.

About Infomotions Image Gallery: Flickr as cloud computing

Eric Lease Morgan — Sat, 22 May 2010 21:19:34 +0000

This posting describes the whys and wherefores behind the Infomotions Image Gallery.

Photography

I was introduced to photography during library school, specifically, when I took a multi-media class. We were given film and movie cameras, told to use the equipment, and through the process learn about the medium. I took many pictures of very tall smoke stacks and classical-looking buildings. I also made a stop-action movie where I step-by-step folded an origami octopus and underwater sea diver while a computer played the Beatles’ “Octopuses Garden” in the background. I’d love to resurrect that 16mm film.

I was introduced to digital photography around 1995 when Steve Cisler (Apple Computer) gave me a QuickTake camera as a part of a payment for writing a book about Macintosh-based HTTP servers. That camera was pretty much fun. If I remember correctly, it took 8-bit images and could store about twenty-four of them at a time. The equipment worked perfectly until my wife accidentally dropped it into a pond. I still have the camera, somewhere, but it only works if it is plugged into an electrical socket. Since then I’ve owned a few other digital cameras and one or two digital movie cameras. They have all been more than simple point-and-shoot devices, but at the same time, they have always had more features than I’ve ever really exploited.

Over the years I mostly used the cameras to document the places I’ve visited. I continue to photograph buildings. I like to take macro shots of flowers. Venuses are always appealing. Pictures of food are interesting. In the self-portraits one is expected to notice the background, not necessarily the subject of the image. I believe I’m pretty good at composition. When it comes to color I’m only inspired when the sun is shining bright, and that makes some of my shots overexposed. I’ve never been very good at photographing people. I guess that is why I prefer to take pictures of statues. All things library and books are a good time. I wish I could take better advantage of focal lengths in order blur the background but maintain a sharp focus in the foreground. The tool requires practice. I don’t like to doctor the photographs with effects. I don’t believe the result represents reality. Finally, I often ask myself an aesthetic question, “If I was looking through the camera to take the picture, then did I really see what was on the other side?” After all, my perception was filtered through an external piece of equipment. I guess I could ask the same question of all my perceptions since I always wear glasses.

The Infomotions Image Gallery is simply a collection of my photography, sans personal family photos. It is just another example of how I am trying to apply the principles of librarianship to the content I create. Photographs are taken. Individual items are selected, and the collection is curated. Given the available resources, metadata is applied to each item, and the whole is organized into sets. Every year the newly created images are archived to multiple mediums for preservation purposes. (I really ought to make an effort to print more of the images.) Finally, an interface is implemented allowing people to access the collection.

Enjoy.

Fickr as cloud computing

This section describes how the Gallery is currently implemented.

About ten years ago I began to truly manage my photo collection using Apple’s iPhoto. At just about the same time I purchased an iPhoto add-on called BetterHTMLExport. Using a macro language, this add-on enabled me to export sets of images to index and detail pages complete with titles, dates, and basic numeric metadata such as exposure, f-stop, etc. The process worked but the software grew long in the tooth, was sold to another company, and was always a bit cumbersome. Moreover, maintaining the metadata was tedious inhibiting my desire to keep it up to date. Too much editing here, exporting there, and uploading to the third place. To make matters worse, people expect to comment on the photos, put them into their own sets, and watch some sort of slide show. Enter Flickr and a jQuery plug-in called ColorBox.

After learning how to use iPhoto’s ability to publish content to Flickr, and after taking a closer look at Flickr’s application programmer interace (API), I decided to use Flickr to host my images. The idea was to: 1) maintain the content on my local file system, 2) upload the images and metadata to Flickr, and 3) programmatically create in interface to the content on my website. The result was a more streamlined process and a set of Perl scripts implementing a cleaner user interface. I was entering the realm of cloud computing. The workflow is described below:

Take photographs – This process is outlined in the previous section.
Import photographs – Import everything, but weed right away. I’m pretty brutal in this regard. I don’t keep duplicate nor very similar shots. No (or very very few) out-of-focus or poorly composed shots are kept either.
Add titles – Each photo gets some sort of title. Sometimes they are descriptive. Sometimes they are rather generic. After all, how many titles can different pictures of roses have? If I were really thorough I would give narrative descriptions to each photo.
Make sets – Group the imported photos into a set and then give a title to the set. Again, I ought to add narrative descriptions, but I don’t. Too lazy.
Add tags – Using iPhoto’s keywords functionality, I make an effort to “tag” each photograph. Tags are rather generic: flower, venus, church, me, food, etc.
Publish to Flickr – I then use iPhoto’s sharing feature to upload each newly created set to Flickr. This works very well and saves me the time and hassle of converting images. This same functionality works in reverse. If I use Flickr’s online editing functions, changes are reflected on my local file system after a refresh process is done. Very nice.
Re-publish to Infomotions – Using a system of Perl scripts I wrote called flickr2gallery I then create sets of browsable pages from the content saved on Flickr.

Using this process I can focus more on my content and less on my presentation. It makes it easier for me to focus on the images and their metadata and less on how the content will be displayed. Graphic design is not necessarily my forte.

Flickr2gallery is a suite of Perl scripts and plain text files:

tags2gallery.pl – Used to create pages of images based on photos’ tags.
sets2gallery.pl – Used to create pages of image sets as well as the image “database”.
make-home.pl – Used to create the Image Gallery home page.
flickr2gallery.sh – A shell script calling each of the three scripts above and thus (re-)building the entire Image Gallery subsite. Currently, the process takes about sixty seconds.
images.db – A tab-delimited list of each photograph’s local home page, title, and Flickr thumbnail.
Images.pm – A really-rudimentary Perl module containing a single subroutine used to return a list of HTML img elements filled with links to random images.
random-images.pl – Designed to be used as a server-side include, calls Images.pm to display sets of random images from images.db.

I know the Flickr API has been around for quite a while, and I know I’m a Johnny Come Lately when it comes to learning how to use it, but that does not mean it can’t be outlined here. The API provides a whole lot of functionality. Reading and writing of image content and metadata. Reading and writing information about users, groups, and places. Using the REST-like interface the programmer constructs a command in the form of a URL. The URL is sent to Flickr via HTTP. Responses are returned in easy-to-read XML.

A good example is the way I create my pages of images with a given tag. First I denote a constant which is the root of a Flickr tag search. Next, I define the location of the Infomotions pages on Flickr. Then, after getting a list of all of my tags, I search Flickr for images using each tag as a query. These results are then looped through, parsed, and built into a set of image links. Finally, the links are incorporated into a template and saved to a local file. Below lists the heart of the process:

  use constant S => 'http://api.flickr.com/services/rest/?
                                  method=flickr.photos.search&
                                  api_key=YOURKEY&user_id=YOURID&tags=';
  use constant F => 'http://www.flickr.com/photos/infomotions/';
  
  # get list of all tags here
  
  # find photos with this tag
  $request  = HTTP::Request->new( GET => S . $tag );
  $response = $ua->request( $request );
  
  # process each photo
  $parser    = XML::XPath->new( xml => $response->content );
  $nodes     = $parser->find( '//photo' );
  my $cgi    = CGI->new;
  my $images = '';
  foreach my $node ( $nodes->get_nodelist ) {
  
  # parse
  my $id     = $node->getAttribute( 'id' );
  my $title  = $node->getAttribute( 'title' );
  my $farm   = $node->getAttribute( 'farm' );
  my $server = $node->getAttribute( 'server' );
  my $secret = $node->getAttribute( 'secret' );
  
  # build image links
  my $thumb = "http://farm$farm.static.flickr.com/$server/$id" . 
              '_' . $secret . '_s.jpg';
  my $full  = "http://farm$farm.static.flickr.com/$server/$id" . 
              '_' . $secret . '.jpg';
  my $flickr = F . "$id/";
    
  # build list of images
  $images .= $cgi->a({ href => $full, 
                       rel => 'slideshow',
                       title => "Details on Flickr"
                      },
                      $cgi->img({ alt => $title, src => $thumb, 
                      border => 0, hspace => 1, vspace => 1 }));
    
  }
  
  # save image links to file here

Notice the rel attribute (slideshow) in each of the images’ anchor elements. These attributes are used as selectors in a jQuery plug-in called ColorBox. In the head of each generated HTML file is a call to ColorBox:

Using this plug-in I am able to implement a simple slideshow when the user clicks on any image. Each slideshow display consists of simple navigation and title. In my case the title is really a link back to Flickr where the user will be able to view more detail about the image, comment, etc.

Summary and conclusion

I am an amateur photographer, and the fruits of this hobby are online here for sharing. If you use them, then please give credit where credit is due.

The use of Flickr as a “cloud” to host my images is very useful. It enables me to mirror my content in more than one location as well as provide access in multiple ways. When the Library of Congress announced they were going to put some of their image content on Flickr I was a bit taken aback, but after learning how the Flickr API can be exploited I think there are many opportunities for libraries and other organizations to do the same thing. Using the generic Flickr interface is one way to provide access, but enhanced and customized access can be implemented through the API. Lots of food for thought. Now to apply the same process to my movies by exploiting YouTube.

Shiny new website

Eric Lease Morgan — Fri, 21 May 2010 01:21:04 +0000

Infomotions has a shiny new website, and the process to create it was not too difficult.

The problem

A relatively long time ago (in a galaxy far far away), I implemented an Infomotions website look & feel. Tabbed interface across the top. Local navigation down the left-hand side. Content in the middle. Footer along the bottom. Typical. Everything was rather square. And even though I used pretty standard HTML and CSS, its implementation was not really conducive to Internet Explorer. My bad.

Moreover, people’s expectations have increased dramatically since I first implemented my site’s look & feel. Curved lines. Pop-up windows. Interactive AJAX-like user experiences. My site was definitely not Web 2.0 in nature. Static. Not like a desktop application.

Finally, as time went on my site’s look & feel was not as consistently applied as I had hoped. Things were askew and the whole thing needed refreshing.

The solution

My ultimate solution is rooted in jQuery and its canned themes.

As you may or may not know, jQuery is a well-supported Javascript library supporting all sorts of cool things like drag ‘n drop, sliders, many animations, not to mention a myriad of ways to manipulate the Document Object Model (DOM) of HTML pages. An extensible framework, jQuery is also the foundation for many plug-in modules.

Just as importantly, jQuery supports a host of themes — CSS files implementing various looks & feels. These themes are very standards compliant and work well on all browsers. I was particularly enamored with the tabbed menu with rounded corners. (Under the hood, these rounded corners are implemented by a browser framework called Webkit. Let’s keep our eye on that one.) After learning how to implement the tabbed interface without the use of Javascript, I was finally on my way. As Dan Brubakerhorst said to me, “It is nothing but styling.”

None of Infomotions subsites are driven by hand-coded HTML. Everything comes from some sort of script. The Alex Catalogue is a database-driven website with mod-Perl modules. The water collection is supported by a database plus XSLT transformations of XML on the fly. The blog is WordPress. My “musings” are sets of TEI files converted in bulk into HTML. While it took a bit of tweaking in each of these subsites, the process was relatively painless. Insert the necessary divs denoting the menu bar, left-hand navigation, and content into my frameworks. Push the button. Enjoy. If I want to implement a different color scheme or typography, then I simply change a single CSS file for the entire site. In retrospect, the most difficult thing for me to convert was my blog. I had to design my own theme. Not too hard, but definitely a learning curve.

A feature I feel pretty strongly about is printing. The Web is one medium. Content on paper is another medium. They are not the same. In general, websites have more of a landscape orientation. Printed mediums more or less have portrait orientations. In the printed medium there is no need for global navigation, local navigation, nor hyperlinks. Silly. Margins need to be accounted for. Pages need to be signed, dated, and branded. Consequently, I wrote a single print-based CSS file governing the entire site. Pages print quite nicely. So nicely I may very well print every single page from my website and bind the whole thing into a book. Call it preservation.

In many ways I consider myself to be an artist, and the processes of librarianship are my mediums. Graphic design is not my forte, but I feel pretty good about my current implementation. Now I need to get back to the collection, organization, preservation, and dissemination of data, information, and knowledge.

Counting words

Eric Lease Morgan — Sat, 10 Apr 2010 22:33:07 +0000

When I talk about “services against text” I usually get blank stares from people. When I think about it more, many of the services I enumerate are based on the counting of words. Consequently, I spent some time doing just that — counting words.

I wanted to analyze the content of a couple of the mailing lists I own/moderate, specifically Code4Lib and NGC4Lib. Who are the most frequent posters? What words are used most often in the subject lines, and what words are used most often in the body of the messages? Using a hack I wrote (mine-mail.pl) I was able to generate simple tables of data:

I then fed these tables to Wordle to create cool looking images. I also fed these tables to a second hack (dat2cloud.pl) to create not-even-close-to-valid HTML files in the form of hyperlinked tag clouds. Below is are the fruits of these efforts:

image of names	tag cloud of names
image of subjects	tag cloud of subjects
image of words	tag cloud of words

The next step is to plot the simple tables on a Cartesian plane. In other words, graph the data. Wish me luck.

Great Ideas Coefficient

Eric Lease Morgan — Sat, 27 Mar 2010 11:58:07 +0000

This posting outlines a concept I call the Great Ideas Coefficient — an additional type of metadata used to denote the qualities of a text.

Great Ideas Coefficient

In the 1950s a man named Mortimer Adler and colleagues brought together what they thought were the most significant written works of Western civilization. They called this collection the Great Books of the Western World. Before they created the collection they outlined what they thought were the 100 most significant ideas of Western civilization. These are “great ideas” such as but not limited to beauty, courage, education, law, liberty, nature, sin, truth, and wisdom. Interesting.

Suppose you were able to weigh the value of a book based on these “great ideas”. Suppose you had a number of texts and you wanted to rank or list them according to the number of times they mentioned the “great ideas”. Such a thing can be done through the application of TFIDF. Here’s how:

create a list of the “great ideas”
calculate the TFIDF score for each idea in a given book
sum the scores for each idea
assign the score to the book
go to Step #2 for each book in a corpus
sort the corpus based on the total scores

Once the scores are calculated, they can be graphed, and once they are graphed they can be illustrated.

An example of this technique is shown above. For each item in a list of works by Aristotle a Great Ideas Coefficient has been calculated and assigned. The list was the ordered by the score. The score was then plotted graphically. Finally, all the graphs were joining together as an animated GIF image to show the range of scores in the list. Luckily the process seems to work because Aristotle’s Metaphysics ranks at the top with the highest Great Ideas Coefficient, and his History of Animals ranks the lowest. ‘Seems to make sense.

The concept behing the Great Ideas Coefficient is not limited to “great ideas”. Any set of words or phrases could be used. For example, one could create a list of “big names” (Plato, Shakespeare, Galileo, etc.) and calculate a Big Names Coefficient. Alternatively, a person could create a list of other words or phrases for any topic or genre to weigh a set of texts against biology, mathematics, literature, etc.

Find is not the problem that needs to be solved now-a-days. The problem of use and understanding is more pressing. People can find plenty of information. They need (want) assistance in putting the information into context. “Books are for use.” The application of something like the Great Ideas Coefficient may be just one example.

My first ePub file

Eric Lease Morgan — Mon, 22 Mar 2010 01:51:27 +0000

I made available my first ePub file today.

Screen shot

EPub is the current de facto standard file format for ebook readers. After a bit of reading, the format is not too difficult since all the files are plain-text XML files or images. The various metadata files are ePub-specific XML. The content is XHTML. The graphics can be in any number of formats. The whole lot is compressed into a single file using the zip “standard”, and suffixed with a .epub extension.

Since much of my content has been previously saved as TEI files, the process of converting my content into ePub is straight-forward. Use XPath to extract metadata. Use XSLT to transform the TEI to XHTML. Zip up the whole thing and make it available on the Web. I have found the difficult part to be the images. It is hard to figure out where one’s images are saved and then incorporate them into the ePub file. I will have to be a bit more standard with my image locations in the future and/or I will need to do a bit of a retrospective conversion process. (I probably will go the second route. Crazy.)

Loading my ePub into Firefox’s EPUBReader worked just fine. The whole thing rendered pretty well in Stanza too. More importantly, it validated against a Java-based tool called epubcheck. Whew!

While I cogitate how to convert my content, you can download my first ePub file as well as the beginnings of my ePub creation script.

Enjoy?

P.S. I think the Apple iPad is going to have a significant impact on digital reading in the very near future. I’m preparing.

Alex Catalogue Widget

Eric Lease Morgan — Tue, 16 Mar 2010 03:43:34 +0000

I created my first Apple Macintosh Widget today — Alex Catalogue Widget.

The tool is pretty rudimentary. Enter something into the field. Press return or click the Search button. See search results against the Alex Catalogue of Electronic Texts displayed in your browser. The development process reminded me of hacking in HyperCard. Draw things on the screen — buttons, fields, etc. — and assocate actions (events) with each of them.

Download it and try it for yourself.

Michael Hart in Roanoke (Indiana)

Eric Lease Morgan — Sun, 07 Mar 2010 21:54:18 +0000

On Saturday, February 27, Paul Turner and I made our way to Roanoke (Indiana) to listen to Michael Hart tell stories about electronic texts and Project Gutenberg. This posting describes our experience.

Roanoke and the library

To celebrate its 100th birthday, the Roanoke Public Library invited Michael Hart of Project Gutenberg fame to share his experience regarding electronic texts in a presentation called “Books & eBooks: Past, Present & Future Libraries”. The presentation was scheduled to start around 3 o’clock, but Paul Turner and I got there more than an hour early. We wanted to have time to visit the Library before it closed at 2 o’clock. The town of Roanoke (Indiana) — a bit south west of Fort Wayne — was tiny by just about anybody’s standard. It sported a single blinking red light, a grade school, a few churches, one block of shops, and a couple of eating establishments. According to the man in the bar, the town got started because of the locks that had been built around town.

The Library was pretty small too, but it bursted with pride. About 1,800 square feet in size, it was overflowing with books and videos. There were a couple of comfy chairs for adults, a small table, a set of four computers to do Internet things, and at least a few clocks the wall. They were very proud of the fact that they had become an Evergreen library as a part Evergreen Indiana initiative. “Now is is possible to see what is owned in other, nearby libraries, and borrow things from them as well,” said the Library’s Board Director.

Michael Hart

The presentation itself was not held in the Library but in a nearby church. About fifty (50) people attended. We sat in the pews and contemplated the symbolism of the stained glass windows and wondered how the various hardware placed around the alter was going to be incorporated into the presentation.

Full of smiles and joviality, Michael Hart appeared in a tailless tuxedo, cumber bun, and top hat. “I am now going to pull a library out of my hat,” he proclaimed, and proceeded to withdraw a memory chip. “This chip contains 10’s of thousands of books, and now I’m going to pull a million books out of my pocket,” and he proceed to display a USB drive. Before the year 2020 he sees us capable of carrying around a billion books on some sort of portable device. Such was the essence of his presentation — computer technology enables the distribution and acquisition of “books” in ways never before possible. Through this technology he wants to change the world. “I consider myself to be like Johnny Appleseed, and I’m spreading the word,” at which time I raised my hand and told him Johnny Appleseed (John Chapman) was buried just up the road in Fort Wayne.

Mr. Hart displayed and described a lot of antique hardware. A hard drive that must have weighed fifty (50) pounds. Calculators. Portable computers. Etc. He illustrated how storage mediums were getting smaller and smaller while being able to save more and more data. He was interested in the packaging of data and displayed a memory chip a person can buy from Walmart containing “all of the hit songs from the 50’s and 60’s”. (I wonder how the copyright issues around that one had been addressed.) “The same thing,” he said, “could be done for books but there is something wrong with the economics and the publishing industry.”

Roanoke (Indiana)

public library

He outlined how Project Gutenberg works. First a book is identified as a possible candidate for the collection. Second, the legalities of the making the book available are explored. Next, a suitable edition of the book is located. Fourth, the book’s content is transcribed or scanned. Finally, 100’s of people proof-read the result and ultimately make it available. Hart advocated getting the book out sooner rather than later. “It does not have to be perfect, and we can always the fix errors later.”

He described how the first Project Gutenberg item came into existence. In a very round-about and haphazard way, he enrolled in college. Early on he gravitated towards the computer room because it was air conditioned. Through observation he learned how to use the computer, and to do his part in making the expense of the computer worthwhile, he typed out the United States Declaration of the Independence on July 4th, 1971.

“Typing the books is fun,” he said. “It provides a means for reading in ways you had never read them before. It is much more rewarding than scanning.” As a person who recently learned how to bind books and as a person who enjoys writing in books, I asked Mr. Hart to compare & contrast ebooks, electronic texts, and codexes. “The things Project Gutenberg creates are electronic texts, not ebooks. They are small, portable, easily copyable, and readable by any device. If you can’t read a plain text document on your computer, then you have much bigger problems. Moreover, there is an enormous cost-benefit compared to printed books. Electronic texts are cheap.” Unfortunately, he never really answered the question. Maybe I should have phrased it differently and asked him, the way Paul did, to compare the experience of reading physical books and electronic texts. “I don’t care if it looks like a book. Electronic texts allow me to do more reading.”

“Two people invented open source. Me and Richard Stallman,” he said. Well, I don’t think this is exactly true. Rather, Richard Stahlman invented the concept of GNU software, and Michael Hart may have invented the concept of open access publishing. But the subtle differences between open source software and open access publishing are lost on most people. In both cases the content is “free”. I guess I’m too close to the situation. I too see open source software distribution and open access publishing having more things in common than differences.

church

stained glass

“I knew Project Gutenberg was going to be success when I was talking on the telephone with a representative of the Common Knowledge project and heard a loud crash on the other end of the line. It turns out the representative’s son and friends had broken an annorandak chair while clamoring to read an electronic text.” In any case, he was fanatically passionate about giving away electronic texts. He sited the World eBook Fair, and came to the presentation with plenty of CD’s for distribution.

In the end I had my picture taken with Mr. Hart. We then all retired to the basement for punch and cake where we sang Happy Birthday to Michael. Two birthdays celebrated at the same time.

Reflection

Michael and Eric

Many people are drawn to the library profession as a matter of principle. Service to others. Academic freedom. Preservation of the historical record. I must admit that I am very much the same way. I was drawn to librarianship for two reasons. First, as a person with a BA in philosophy, I saw libraries as a places full of ideas, literally. Second, I saw the profession as a growth industry because computers could be used to disseminate the content of books. In many ways my gut feelings were accurate, but at the same time they were misguided because much of librarianship surrounds workflows, processes that are only a couple of steps away from factory work, and the curation of physical items. To me, just like Mr. Hart, the physical item is not as important as what it manifests. It is not about the book. Rather, it is what is inside the book. Us librarians have tied our identities to the physical book in such a way to be limiting. We have pegged ourselves, portrayed a short-sighted vision, and consequently painted ourselves into a corner. It the carpenter a hammer expert? Is the surgeon a scalpel technician? No, they are builders and healers, respectively. Why must librarianship be identified with books?

I have benefited from Mr. Hart’s work. My Alex Catalogue of Electronic Texts contains many Project Gutenberg texts. Unlike the books from the Internet Archive, the texts are much more amenable to digital humanities computing techniques because they have been transcribed by humans and not scanned by computers. At the same time, the Project Gutenberg texts are not formatted as well for printing or screen display as PDF versions of the same. This is why the use of electronic texts and ebooks is not an either/or situation but rather a both/and, especially when it comes to analysis. Read a well-printed book. Identify item of interest. Locate item in electronic version of book. Do analysis. Return to printed book. The process could work just as well the other way around. Ask a question of the electronic text. Get one or more answers. Examine them in the context of the printed word. Both/and, not either/or.

The company was great, and the presentation was inspiring. I applaud Michael Hart for his vision and seemingly undying enthusiasm. His talk made me feel like I really am on the right track, but change takes time. The free distribution of data and information — whether the meaning of free be denoted as liberty or gratis — is the right thing to do for society in general. We all benefit, and therefore the individual benefits as well. The political “realities” of the situation are more like choices and not Platonic truths. They represent immediate objectives as opposed to long-term strategic goals. I guess this is what you get when you mix the corporeal and ideal natures of humanity.

Who would have known that a trip to Roanoke would turn out to be a reflection of what it means to be human.

Preservationists have the most challenging job

Eric Lease Morgan — Sun, 03 Jan 2010 22:11:51 +0000

In the field of librarianship, I think the preservationists have the most challenging job because it is fraught with the greatest number of unknowns.

Twenty-eight (28) CDs

mangled book

As I am writing this posting, I am in the middle of an annual processes — archiving the data I created from the previous year. This is something I have been doing since 1986. It began by putting my writings on 3.5 inch “floppy” disks. After a few years, CDs became more feasible, and I have been using them ever since. The first few CDs contain multiple years’ worth of content. This year I will require 14 CDs, and considering the fact that I create duplicates of every CD, this year I will burn 28. It goes with too much saying, this process takes a long time.

Now, I’m not quite a prolific a writer as 28 CDs sound, but the type of content I archive is large and diverse. It begins with my email which I have been systematically collecting since 1997. (“Can you say, ‘Mr. Serials’?”) No, I do not have all of my email, just the email I think is important; email of a significant nature where I actually say something, or somebody actually says something to me. It includes some attachments in the form of PDF documents and image files. It includes, inquiries I get regarding my work and postings to mailing lists that are longer rather than shorter. By the way, I only send plain text email messages because MIME encodings — the process used to include other than plain text content — adds an extra layer of complexity when it comes to reading and parsing email (mbox) archives. How can I be sure future digital archeologists will be able to compute against such stuff? Likewise, nothing gets tape archived (“tarred”), and nothing gets compressed (“zipped”) for all for the same reasons — an extra layer of complexity. Since I am the “owner” of the Code4Lib, NGC4Lib, and Usability4Lib mailing lists, and since was used to be the official archivist for ACQNET, I systematically collect, organize, archive, index, and provide access to these mailing lists using Mr. Serials. Burning the raw (mbox) email files of these lists as well as their browsable HTML counterparts is a part of my annual email preservation process.

The proces continues with the various types of other writings. Each presentation I give has its own folder complete with invitation, logistics, bio & abstract, as well three versions of my presentation: 1) a plain-text version, a one-page handout in the form of a PDF file, and a Word document. (Ick!) If I’m lucky I will remember to archive the TEI version of my remarks which is always longer than one page long and lives in the Musings section of Infomotions. Other types of writings include the plain text versions of blog postings, various versions of essays for publication, etc. At the very least, everything is saved as plain text. Not Word. Not PDF. Not anything that is platform or software-title specific. Otherwise I can’t guarantee it will be readable into the next decade. I figure that if someone can’t read a plain text file, then they have much bigger problems.

Then there is the software. I write lots of software over the period of one year. At least a couple dozen programs. Some of them are simple hacks. Some of them are “studies”, experiments, or investigations. Some of them are extensive intermediaries between relational databases and people using Web browsers. While many of these programs come to me in bursts of creative energy, I would not have the ability to recreate them if they were lost and gone to Big Byte Heaven. When it comes to computers, your data is your most important assest. Not the hardware. Not the software. The data — the content you create. This is the content you can not get back again. This is the content that is unique. This is the content that needs to be backed up and saved against future calamity.

Because some of my data is saved in relational databases, the annual preservation process includes raw database dumps. Again, these are plain text files but in the form of SQL statements. Thank God for mysqldump. It gives me the opportunity to restore my Musings, my blog, my Alex Catalogue, my water collection, and now my Highlights & Annotations. (More on that later.)

Biblioteca Valenciana

All of the content above fits on a single CD. Easily. Again, I’m not that prolific of a writer.

The hard part is the multimedia. As a part of an Apple Library of Tomorrow grant awarded to me by Steve Cisler, I was given an Apple QuickTake camera in 1994 or so. It could store about 24 pictures in 256 colors. It broke when my wife accidentally dropped it into a pond. It still works, if you have the necessary Macintosh hardware and it is plugged in. Presently, I use a 5 megapixel camera. I take the pictures at the highest resolution. I take movies as well. The pictures get edited. The movies get edited as well. This content currently makes up the bulk of the CDs. Six for the movies saved in the Apple movie (.mov) format. One DVD for actual use. Three for the full-scale JPEG images. Three for the iPhoto CDs. While I feel confident the JPEG files will be readable into the future, I’m not so sure about the .mov files, let alone the DVD. I might feel better about some sort of MPEG format, but it seems to be continually changing. Similarly, I suppose I ought to be saving the JPEG files as PNG files. At least that way more of the metadata may be traveling along with the images. For even better preservation, I ought to be putting the movies on video tape. (There is no compression or encryption there). I ought to be printing the photographs on glossy paper and binding the whole lot into books.

This year I started saving my music. I’ve been recording myself playing guitar since 1984. It began with audio cassette tapes. I have about 30 of them labeled and stored away in plastic boxes. I’ve made a couple attempts to digitize them, but the process is very laborious. It is easier to record yourself digitally in the first place and save the resulting files. This year a rooted through my archives and found a number of recordings. Tests of new recording gear and software. Experiments in production techniques. Background music to home videos. Saved as AIFF files, I hope they will be readable in the future.

Once everything gets burnt to CDs, one copy becomes my working copy. The other copy goes to a CD case not to be touched. Soon I will need a new case.

Finally, everything is not digital. In fact, I print a lot. Print that thought-provoking email message. Print that essay. Print this blog posting. Print the code to that computer program. Sign and date the print out. Put it into the archival box. The number of boxes I’m accumulating is now up to about 10.

What can I say. I enjoy all aspects of librarianship.

Preservation

My world of (digital) preservation is miniscule compared to work of academic preservationists, archivists, and curators. If it takes this much effort to systematically collect, organize, and archive one person’s content, then think how much effort would be required to apply the process against the intellectual output of an entire college or university!

U of MN Archive

Even if so much people-power were available, this is no insurance against the future. How do we go about preserving digital content? What formats should the content be manifested in? What hardware will be needed to read the media where the data is saved? What software will be necessary to read the data? Too many questions. Too many unknowns. Too many things that are unpredictable. Right now, there only seems to be two solutions, and the real solution is probably a combination of the two. First, make sincere efforts to copy non-proprietary formats of content to physical media — a storage artifact that can be read by the widest variety of computer hardware. Plan on migrating the content as well as the physical media forward as technology changes. Think this process as an a type of insurance. Second, make as many copies of the content as possible in as many formats as possible. Print it. Microfilm it. Put it on tape and spinning disks. Make it available on the Web. While the folks at LOCKSS may not have thought the expression would be used in this manner, it is still true — “Lot’s of copies keep stuff safe.”

I sincerely believe we are in the process of creating a Digital Dark Age. “No, you can not read or access that content. It was created during the late 20th and early 21st centuries. It was a time of prolific exploration, few standards, and many legal barriers.” Something needs to happen differently.

Maybe it doesn’t really matter. Maybe the content that is needed is the content that always lives on “spinning disks” and gets automatically migrated forward. Computers make it easier to create lots of junk. It certainly doesn’t all need to be preserved. On the other hand, those letters from the American Civil War were not necessarily considered important at the time. Many of them were written by unknown people. Yet, these letters are important to us today. Not because of who wrote them, but because they reflect the thinking of the time. They provide pieces of a puzzle that can verify facts or provide alternative perspectives. After years and years, information can grow in importance, and consequently, today, we run the risk of throwing away stuff this is of importance tomorrow.

Preservationists have the hardest job in the field of librarianship. More power to them.

How to make a book (#2 of 3)

Eric Lease Morgan — Fri, 01 Jan 2010 16:20:22 +0000

This is the second of a three-part series on how to make a book.

The first posting described and illustrated how to use a thermo-binding machine to make a book. This posting describes and illustrates how to “weave” a book together — folding and cutting (or tearing). The process requires no tools. No glue. No sewing. Just paper. Ingenious. The third posting will be about traditional bookmaking.

Attribution

Like so many things in my life, I learned how to do this by reading a… book, but alas, I have misplaced this particular book and I am unable to provide you with a title/citation. (Pretty bad for a librarian!) In any event, the author of the book explained her love of bookmaking. She described her husband as an engineer who thought all of the traditional cutting, gluing, and sewing were unnecessary. She challenged him to create something better. The result was the technique described below. While what he created was not necessarily “better”, it surely showed ingenuity.

The process

Here is process outlined, but you can also see how it is done on YouTube:

Begin with 12 pieces of paper – I use normal printer paper, but the larger 11.5 x 14 inch pieces of paper make for very nicely sized books.
Fold pairs of paper length-wise – In the end, you will have 6 pairs of paper half as big as the originals.
Draw a line down the center of 3 pairs – Demarcate where you will create “slots” for your book by drawing a line half the size of of the inner crease of 3 pairs of paper.
Draw a line along the outside of 3 pairs – Demarcate where you will create “tabs” for your books by drawing two lines from one quarter along the crease towards the outside of the 3 pairs of paper.
Cut along the lines – Actually create the slots and tabs of your books by cutting along the lines drawn in Steps #3 and #Instead of using scissors, you can tear along the creases. (No tools!)
Create mini-books – Take one pair of paper cut as a tab and insert the tab into the slot of another pair. Do this for all of 3 of the slot-tab pairs. The result will be 3 mini-books simply “woven” together.
Weave together the mini-books – Finally, find the slot of one of your mini-books and insert a tab from another mini-book. Do the same with the remaining mini-book.

The result of your labors should be a fully-functional book complete with 48 pages. I use them for temporary projects — notebooks. Yeah, the cover is not very strong. During the use of your book, put the whole thing in a manila or leather folder. Lastly, I know the process is difficult to understand without pictures. Watch the video.

Good and best open source software

Eric Lease Morgan — Mon, 28 Dec 2009 17:29:30 +0000

What qualities and characteristics make for a “good” piece of open source software? And once that question is answered, then what pieces of library-related open source software can be considered “best”?

I do not believe there is any single, most important characteristic of open source software that qualifies it to be denoted as “best”. Instead, a number of characteristics need to be considered. For example, a program might do one thing and do it well, but if it is bear to install then that counts against it. Similarly, some software might work wonders but it is built on a proprietary infrastructure such as a closed source compiler. Can that software really be considered “open”?

For my own education and cogitation, I have begun to list questions to help me address what I think is the “best” library-related open source software. Your comments would be greatly appreciated. I have listed the questions in (more or less) priority order:

Does the software work as advertised? – If the program says it can do one thing, but never does, then this may be a non-starter. On the other hand, accomplishing a particular goal is sometimes relative. In most cases the software might perform excellently, but in others it performs less so. It is unrealistic to expect any software to be all things to all people.
To what degree is the software supported? – Support, can mean many things. Most obviously, users of the software want to know whether or not there are one or more people behind the software who can answer questions about it. Where is the developer and how can I get in touch with them? Are they approachable? If the developer is not available, then can support be purchased? Do I get what I pay for when I make this purchase? How expensive is it? Is their website easy to use? Support can also allude to software updates. “Software is never done. If it were, then it would be called hardware.” For example, my favorite XSL processor (xsltproc) and some of its friends work great but recommending it to friends comes with hesitation because I wonder about ongoing maintenance and upgrades to the newer versions of the API. Support also means user community. While open source is about “free” software, it relies on communities for sustainability. Do such communities exist? Are there searchable mailing lists with browsable archives? Are there wikis, virtual and real meetings, and/or IRC channels, etc?
Is the documentation thorough? – Is there a man page? A POD? Something that can be printed and annotated? Is there an introduction? FAQ? Glossary of terms? Is there a different guide/section for different types of readers such as systems administrators, programmers, implementors, and/or users? Is the documentation well-written? While I have used plenty of pieces of software and never read the manual, documentation is essencial if the software is expected to be exploited to the highest degree. Few thing in life are truly intuitive. Software is certainly not one of them. Documentation is a form of writing, and writing is something that literally transcends space and time. It is an alternative to having a person giving you instructions.
What are the licence terms? – Personally I place a higher value on the viral nature of a GNU-like license, but BSD-like licenses enable commercial enterprise to a greater degree, and whether I like it or not commercial enterprises are all but necessary in the world I live in. (After all, it enabled the creation of favorite personal computer’s operating system.) At the same time, if the licensing is not GNU-like or BSD-like, then the software is not really open source anyway. Right?
To what degree is the software easy to install? – Since installing software is usually not a process that needs to be repeated, a difficult installation can be overlooked. On the other hand, if tweaking kernels, installing a huge number of dependencies, requiring a second piece of obscure software that is not supported is required, then all this counts against an open source software distribution.
To what degree is the software implemented using the “standard” LAMP stack? – LAMP is an acronym for Linux, Apache, MySQL, and Perl (or PHP, or Python, or just about any other computer language), and the LAMP stack is/was the basis for many pieces of open source applications. The combination is well-supported, well-documented, and easily transportable to different hardware platforms. If the software application is built on LAMP, then the application has a lot going for it.
Is the distribution in question an application/system or a library/module? – It is possible to divide software into two group: 1) software that is designed to build other software — libraries/modules, and 2) software that is an an end-in-itself — applications/systems. The former is akin to a tool in a toolbox used to build applications. The later is something intended for an end user. The former requires a computer programmer to truly exploit. The later usually does not require as much specific expertise. Both the module and the application have their place. Each have their own advantages and disadvantages. Depending on the implementor’s environment one might be better suited.
To what degree does the software satisfy some sort of real library need? – This question is specific to my particular audience, and is dependent on a definition of librarianship. Collection. Preservation. Organization. Dissemination. Books? Catalogs? Circulation? Reading and information literacy? Physical place fostering community? Etc. For example, librarians love to create lists, and in a digital environment lists are well managed through the use of relational databases. Therefore, does MySQL qualify as a piece of library-related software? Similarly, as Roy Tennant was told one time, “Librarians like to search. Everybody else likes to find.” Does this mean indexers like Solr/Lucene ought to qualify? Maybe the question ought to be rephrased. “To what degree does the software satisfy your or your institution’s needs?”

What sorts of things have I left out? Is there anything here that can be measurable or is everything left to subjective judgement? Just as importantly, can we as a community answer these questions in the list of specific software distributions to come up with the “best” of class?

‘More questions than answers.

Valencia and Madrid: A Travelogue

Eric Lease Morgan — Sat, 05 Dec 2009 15:34:12 +0000

I recently had the opportunity to visit Valencia and Madrid (Spain) to share some of my ideas about librarianship. This posting describes some of things I saw and learned along the way.

La Capilla de San Francisco de Borja

Capilla del Santo Cáliz

LIS-EPI Meeting

In Valencia I was honored to give the opening remarks at the 4th International LIS-EPI Meeting. Hosted by the Universidad Politécnica de Valencia and organized by Fernanda Mancebo as well as Antonia Ferrer, the Meeting provided an opportunity for librarians to come together and share their experiences in relation to computer technology. My presentation, “A few possibilities for librarianship by 2015” outlined a few near-term futures for the profession. From the introduction:

The library profession is at a cross roads. Computer technology coupled with the Internet have changed the way content is created, maintained, evaluated, and distributed. While the core principles of librarianship (collection, organization, preservation, and dissemination) are still very much apropos to the current milieu, the exact tasks of the profession are not as necessary as they once were. What is a librarian to do? In my opinion, there are three choices: 1) creating services against content as opposed to simply providing access to it, 2) curating collections that are unique to our local institutions, or 3) providing sets of services that are a combination of #1 and #2.

And from the conclusion:

If libraries are representing a smaller and smaller role in the existing information universe, then two choice present themselves. First, the profession can accept this fact, extend it out to its logical conclusion, and see that libraries will eventually play in insignificant role in society. Libraries will not be libraries at all but more like purchasing agents and middle men. Alternatively, we can embrace the changes in our environment, learn how to take advantage of them, exploit them, and change the direction of the profession. This second choice requires a period of transition and change. It requires resources spent against innovation and experimentation with the understanding that innovation and experimentation more often generate failures as opposed to successes. The second option carries with it greater risk but also greater rewards.

toro	robot sculpture

Josef Hergert

Providing a similar but different vision from my own, Josef Hergert (University of Applied Sciences HTW Chur) described how librarianship ought to be embracing Web 2.0 techniques in a presentation called “Learning and Working in Time of Web 2.0: Reconstructing Information and Knowledge”. To say Hergert was advocating information literacy would be to over-simplify his remarks, yet if you broaden the definition of information literacy to include the use of blogs, wikis, social bookmarking sites — Web 2.0 technologies — then the phrase information literacy is right on target. A number of notable quotes included:

We are experiencing many changes in the environment: non-commercial sharing of content, legislative overkill, and “pirate parties”… The definition of “authorship” is changing.
The teaching of information literacy courses will help overcome some of the problems.
The process of learning is changing because of the Internet… We are now experiencing a greater degree of informal learning as opposed to formal learning… We need as librarians to figure out how to exploit the environment to support learning both formal and informal.
The current environment is more than paper, but also about a network of people, and the librarian can help create these networks with [Web 2.0 tools].
Provide not only the book but the environment and tools to do the work.

As an aside, I have been using networked computer technologies for more than twenty years. Throughout that time a number of truisms have become apparent. “If you don’t want it copied, then don’t put it on the ‘Net; give back to the ‘Net”, “On the Internet nobody knows that you are a dog”, and “It is like trying to drink from a fire hose” are just a few. Hergert used the newest one, “If it is not on the Internet, then it doesn’t exist.” For better or for worse, I think this is true. Convenience is a very powerful elixer. The ease of acquiring networked data and information is so great compared the time and energy needed to get data and information in analog format that people will get what is simple “good enough”. In order to remain relevant, libraries must put their (full text) content on the ‘Net or be seen as an impediment to learning as opposed to learning’s facilitator.

While I would have enjoyed learning what the other Meeting presenters has to say, it was unrealistic for me to attend the balance of the conference. The translators were going back to Switzerland, and I would not have been able to understand what the presenters were saying. In this regard is sort of felt like the Ugly American, but I have come to realize that the use of English is a purely practical matter. It as nothing to do with a desire to understand American culture.

Bibliteca Valenciana

The next day I have a few others had the extraordinary opportunity to get an inside tour of the Bibliteca Valenciana (Valencia Library). Starting out as a monastery, it was transformed into quite a number of other things, such as a prison, before it became a library. We got to go into the archives, see of of their treasures, and learn about the library’s history. They were very proud of their Don Quixote collection, and we saw their oldest book — a treatise on the Black Death which included receipts for treatments.

Biblioteca Nacional de España

In Madrid I believe visited the Biblioteca Nacional de España (National Library of Spain) and went to their museum. It was free, and I saw an exhibition of original Copernicus, Galileo, Brahe, Kepler, and Newton editions embodying Western scientific progress. Very impressive, and very well done, especially considering the admission fee.

Biblioteca Nacional

statue

International Institute

Finally, I shared the presentation from the LIS-EPI Meeting at the International Institute. While I advocated changes in the way’s our profession do its work, the attendees at both venues wondered how to about these changes. “We are expected to provide a certain set of services to our patrons here and now. What do we do to learn these new skills?” My answer was grounded in applied research & development. Time must be spent experimenting and “playing” with the new technologies. This should be considered an investment in the profession and its personnel, an investment that will pay off later in new skills and greater flexibility. We work in academia. It behooves us to work academically. This includes explorations into applying our knowledge in new and different ways.

Acknowledgements

Many thanks go to many people for making this professional adventure possible. I am indebted to Monica Pareja from the United Stated Embassy in Madrid. She kept me out of trouble. I thank Fernanda Mancebo and Antonia Ferrer who invited me to the Meeting. Last and certainly not least, I thank my family for allowing to to go to Spain in the first place since the event happened over the Thanksgiving holiday. “Thank you, one and all.”

alley

fountain

Colloquium on Digital Humanities and Computer Science: A Travelogue

Eric Lease Morgan — Sat, 05 Dec 2009 02:52:30 +0000

On November 14-16, 2009 I attended the 4th Annual Chicago Colloquium on Digital Humanities and Computer Science at the Illinois Institute of Technology in Chicago. This posting outlines my experiences there, but in a phrase, I found the event to be very stimulating. In my opinion, libraries ought to be embracing the techniques described here and integrating them into their collections and services.

IIT	Paul Galvin Library

Day #0 – A pre-conference workshop

Upon arrival I made my way directly to a pre-conference workshop entitled “Machine Learning, Sequence Alignment, and Topic Modeling at ARTFUL” presented by Mark Olsen and Clovis Gladstone. In the workshop they described at least two applications they were using to discover common phrases between texts. The first was called Philomine and the second was called Text::Pair. Both work similarly but Philomine needs to be integrated with Philologic, and Text::Pair is a stand-alone Perl module. Using these tools n-grams are extracted from texts, indexed to the file system, and await searching. By entering phrases into a local search engine, hits are returned that include the phrases and the works where the phrase was found. I believe Text::Pair could be successfully integrated in my Alex Catalogue.

orange, green, and gray

orange and green

Day #1

The Colloquium formally began the next day with an introduction by Russell Betts (Illinois Institute of Chicago). His most notable quote was, “We have infinite computer power at our fingertips, and without much thought you can create an infinite amount of nonsense.” Too true.

Marco Büchler (University of Leipzig) demonstrated textual reuse techniques in a presentation called “Citation Detection and Textual Reuse on Ancient Greek Texts”. More specifically, he used textual reuse to highlight differences between texts, graph ancient history, and explore computer science algorithms. Try www.eaqua.net for more.

Patrick Juola‘s (Duquesne University) “conjecturator” was the heart of the next presentation called “Mapping Genre Spaces via Random Conjectures”. In short, Juola generated thousands and thousands of “facts” in the form of [subject1] uses [subject2] more or less than [subject3]. He then tested each of these facts for truth against a corpus. Ironically, he was doing much of what Betts alluded to in the introduction — creating nonsense. On the other hand, the approach was innovative.

By exploiting a parts-of-speech (POS) parser, Devin Griffiths (Rutgers University) sought the use of analogies as described in “On the Origin of Theories: The Semantic Analysis of Analogy in Scientific Corpus”. Assuming that an analogy can be defined as a noun-verb-noun-conjunction-noun-verb-noun phrase, Griffith looked for analogies in Darwin’s Origin of Species, graphed the number of analogies against locations in the text, and made conclusions accordingly. He asserted that the use of analogy was very important during the Victorian Age, and he tried to demonstrate this assertion through a digital humanities approach.

The use of LSIDs (large screen information displays) was discussed by Geoffrey Rockwell (McMaster University). While I did not take a whole lot of notes from this presentation, I did get a couple of ideas: 1) figure out a way for a person to “step into” a book, or 2) display a graphic representation of a text on a planetarium ceiling. Hmm…

Kurt Fendt (MIT) described a number of ways timelines could be used in the humanities in his presentation called “New Insights: Dynamic Timelines in Digital Humanities”. Through the process I became aware of the SIMILE timeline application/widget. Very nice.

I learned of the existence of a number of digital humanities grants as described by Michael Hall (NEH). They are both start-up grants as well a grants on advanced topics. See: neh.gov/odh/.

The first keynote speech, “Humanities as Information Sciences”, was given by Vasant Honavar (Iowa State University) in the afternoon. Honavar began with a brief history of thinking and philosophy, which he believes lead to computer science. “The heart of information processing is taking one string and transforming it into another.” (Again, think the introductory remarks.) He advocated the creation of symbols, feeding them into a processor, and coming up with solutions out the other end. Language, he posited, is an information-rich artifact and therefore something that can be analyzed with computing techniques. I liked how he compared science with the humanities. Science observes physical objects, and the humanities observe human creations. Honavar was a bit arscient, and therefore someone to be admired.

subway tunnel

skyscraper predecessor

Day #2

In “Computational Phonostylistics: Computing the Sounds of Poetry” Marc Plamondon (Nipissing University) described how he was counting phonemes in both Tennyson’s and Browning’s poetry to validate whether or not Tennyson’s poetry is “musical” or plosive sounding and Browning’s poetry is “harsh” or fricative. To do this he assumed one set of characters are soft and another set are hard. He then counted the number of times each of these sets of characters existed in each of the respective poets’ works. The result was a graph illustrating the “musical” or “harshness” of the poetry. One of the more interesting quotes from Plamondon’s presentation included, “I am interested in quantifying aesthetics.”

In C.W. Forstal‘s (SUNY Buffalo) presentation “Features from Frequency: Authorship and Stylistic Analysis Using Repetitive Sound” we learned how he too is counting sound n-grams to denote style. He applied the technique to D.H. Lawrence as well as to the Iliad and Odyssey, and to his mind the technique works to his satisfaction.

The second keynote presentation was give by Stephen Wolfram (Wolfram Research) via teleconference. It was called “What Can Be Made Computable in the Humanities?” He began by describing Mathematica as a tool he used to explore the world around him. All of this assumes that the world consists of patterns, and these patterns can be described through the use of numbers. He elaborated through something he called the Principle of Computational Equivalency — once you get to a certain threshold systems create a level of complexity. Such a principle puts pressure on having the simplest descriptive model as possible. (Such things are standard scientific/philosophic principles. Nothing new here.) Looking for patterns was the name of his game, and one such game was applied to music. Discover the patterns in a type of music. Feed the patterns to a computer. Have the computer generate the music. Most of the time the output works pretty well. He called this WolframTones. He went on to describe WolframAlpha as an attempt to make the world’s knowledge computable. Essentially a front-end to Mathematica, WolframAlpha is a vast collection of content associated with numbers: people and their birth dates, the agriculture output of countries, the price of gold over time, temperatures from across the world, etc. Queries are accepted into the system. Searches are done against its content. Results are returned in the form of best-guess answers complete with graphs and charts. WolframAlpha exposes mathematical processing to the general public in ways that have not been done previously. Wolfram described two particular challenges in the creation of WolframAlpha. One was the collection of content. Unlike Google, Wolfram Research does not necessarily crawl the Internet. Rather it selectively collects the content of a “reference library” and integrates it into the system. Second, and more challenging, has been the design of the user interface. People do not enter structured queries, but structured output is expected. Interpreting people’s input is a difficult task in and of itself. From my point of view, he is probably learning more about human thought processes than the natural world.

red girder sculpture

gray sculpture

Some thoughts

This meeting was worth every single penny, especially considering the fact that there was absolutely no registration fee. Free, except of the my travel costs, hotel, and the price of the banquet. Unbelievable!

Just as importantly, the presentations given at this meeting demonstrate the maturity of the digital humanities. These things are not just toys but practical tools for evaluating (mostly) texts. Given the increasing amount of full text available in library collections, I see very little reason why these sorts of digital humanities applications could not be incorporate into library collections and services. Collect full text content. Index it. Provide access to the index. Get back a set of search results. Select one or more items. Read them. Select one or more items again, and then select an option such as graph analogies, graph phonemes, or list common phrases between texts. People need to do more than read the texts. People need to use the texts, to analyze them, to compare & contrast them with other texts. The tools described in this conference demonstrate that such things are more than possible. All that has to be done is to integrate them into our current (library) systems.

So many opportunities. So little time.

Alex Catalogue collection policy

Eric Lease Morgan — Sun, 04 Oct 2009 13:12:08 +0000

This page lists the guidelines for including texts in the Alex Catalogue of Electronic Texts. Originally written in 1994, much of it is still valid today.

Purpose

The primary purpose of the Catalogue is to provide me with the means for demonstrating a concept I call arscience through American and English literature as well as Western philosophy. The secondary purpose of the Catalogue is to provide value-added access to some of the world’s great literature in turn providing the means for enhancing education. Consequently, the items in the collection must satisfy either of these two goals.

Qualities

Listed in priority order, texts in the collection must have the following qualities:

Only texts in the public domain or freely distributed texts will be collected.
Only texts that can be classified as American literature, English literature, or Western philosophy will be included.
Only texts that are considered “great” literature will included. Great literature is broadly defined as literature withstanding the test of time and found in authoritative reference works like the Oxford Companions or the Norton Anthologies.
Only complete works will be collected unless a particular work was never completed in the first place. In other words, partially digitized texts will not be included in the Catalogue.
Whenever possible, collections of short stories or poetry will be included as they were originally published. If the items from the originally published collections have been broken up into individual stories or poems, then those items will be included individually.
The texts in the collection must be written in or translated into English. Otherwise I will not be able to evaluate the texts’ quality nor will the indexing and content-searching work correctly.

File formats

Because of technical limitations and the potential long-term integrity of the Catalogue, texts in the collection, listed in order of preference, should have the following formats:

Plain text files are preferred over HTML files.
HTML files are preferred over compressed files.
Compressed files are preferred over “word processor” files.
Word processed files are the least preferable file format.
Texts in unalterable file formats, such as Adobe Acrobat, will not be included.

In all cases, text that have not been divided into parts are preferred over texts that have been divided. If a particular item is deemed especially valuable and the item has been divided into parts, then efforts will be made to concatenate the individual parts and incorporate the result into the collection. The items in the collection are not necessarily intended to be read online.

Alex, the movie!

Eric Lease Morgan — Sun, 04 Oct 2009 12:58:48 +0000

Created circa 1998, this movie describes the purpose and scope of the Alex Catalogue of Electronic Texts. While coming off rather pompous, the gist of what gets said is still valid and correct. Heck, the links even work. “Thanks Berkeley!”

Collecting water and putting it on the Web (Part III of III)

Eric Lease Morgan — Thu, 03 Sep 2009 14:25:29 +0000

This is Part III of an essay about my water collection, specifically a summary, opportunities for future study, and links to the source code. Part I described the collection’s whys and hows. Part II described the process of putting it on the Web.

Summary, future possibilities, and source code

There is no doubt about it. My water collection is eccentric but through my life time I have encountered four other people who also collect water. At least I am not alone.

Putting the collection on the Web is a great study in current technology. It includes relational database design. Doing input/output against the database through a programming language. Exploiting the “extensible” in XML by creating my own mark-up language. Using XSLT to transform the XML for various purposes: display as well as simple transformation. Literally putting the water collection on the map. Undoubtably technology will change, but the technology of my water collection is a representative reflection of the current use of computers to make things available on the Web.

I have made all the software a part of this system available here:

SQL file sans any data – good for study of simple relational database
SQL file complete with data – see how image data is saved in the database
PHP scripts – used to do input/output against the database
waters.xml – a database dump, sans images, in the form of an XML file
waters.xsl – the XSLT used to display the browser interface
waters2markers.xsl – transform water.xml into Google Maps XML file
map.pl – implementation of Google Maps API

My water also embodies characteristics of librarianship. Collection. Acquisition. Preservation. Organization. Dissemination. The only difference is that the content is not bibliographic in nature.

There are many ways access to the collection could be improved. It would be nice to sort by date. It would be nice to index the content and make the collection searchable. I have given thought to transforming the WaterML into FO (Formatting Objects) and feeding the FO to a PDF processor like FOP. This could give me a printed version of the collection complete with high resolution images. I could transform the WaterML into an XML file usable by Google Earth providing another way to view the collection. All of these things are “left up the reader for further study”. Software is never done, nor are library collections.

River Lune

Roman Bath

Ogle Lake

Finally, again, why do I do this? Why do I collect the water? Why have a spent so much time creating a system for providing access to the collection? Ironically, I am unable to answer succinctly. It has something to do with creativity. It has something to do with “arsience“. It has something to do with my passion for the library profession and my ability to manifest it through computers. It has something to do with the medium of my art. It has something to do with my desire to share and expand the sphere of knowledge. “Idea. To be an idea. To be an idea and an example to others… Idea”. I really don’t understand it through and through.

Read all the posts in this series:

Visit the water collection.

Collecting water and putting it on the Web (Part II of III)

Eric Lease Morgan — Thu, 03 Sep 2009 14:25:16 +0000

This is Part II of an essay about my water collection, specifically the process of putting it on the Web. Part I describes the whys and hows of the collection. Part III is a summary, provides opportunities for future study, and links to the source code.

Making the water available on the Web

As a librarian, I am interested in providing access to my collection(s). As a librarian who has the ability to exploit the use of computers, I am especially interested in putting my collection(s) on the Web. Unfortunately, the process is not as easy as the actual collection process, and there have been a number of processes along the way. When I was really into HyperCard I created a “stack” complete with pictures of my water, short descriptions, and an automatic slide show feature that played the sound of running water in the background. (If somebody asks, I will dig up this dinosaur and make it available.) Later I created a Filemaker Pro database of the collection, but that wasn’t as cool as the HyperCard implementation.

Mississippi River

The current implementation is more modern. It takes advantage of quite a number of technologies, including:

a relational database
a set of PHP scripts that do input/output against the database
an image processor to create thumbnail images
an XSL processor to generate a browsable Web presence
the Google Maps API to display content on a world map

The use of each of these technologies is described in the following sections.

Relational database

ER diagram

Since 2002 I have been adding and maintaining newly acquired waters in a relational, MySQL, database. (Someday I hope to get the waters out of those cardboard boxes and add them to the database too. Someday.) The database itself is rather simple. Four tables: one for the waters, one for the collectors, a join table denoting who collected what, and a metadata table consisting of a single record describing the collection as a whole. The entity-relationship diagram illustrates the structure of the database in greater detail.

Probably the most interesting technical characteristic of the database is the image field of type mediumblob in the waters table. When it comes to digital libraries and database design, one of the perennial choices to make is where to save your content. Saving it outside your database makes your database smaller and more complicated but forces you to maintain links to your file system or the Internet where the actual content resides. This can be an ongoing maintenance nightmare and can side-step the preservation issues. On the other hand inserting your content inside the database allows you to keep your content all in once place while “marrying” it to up in your database application. Putting the content in the database also allows you to do raw database dumps making the content more portable and easier to back-up. I’ve designed digital library systems both ways. Each has its own strengths and weaknesses. This is one of the rarer times I’ve put the content into the database itself. Never have I solely relied on maintaining links to off-site content. Too risky. Instead I’ve more often mirrored content locally and maintained two links in the database: one to the local cache and another to the canonical website.

PHP scripts for database input/output

Sets of PHP scripts are used to create, maintain, and report against the waters database. Creating and maintaining database records is tedious but not difficult as long as you keep in mind that there are really only four things you need to do with any database: 1) create records, 2) find records, 3) edit records, and 4) delete records. All that is required is to implement each of these processes against each of the fields in each of the tables. Since PHP was designed for the Web, each of these processes is implemented as a Web page only accessible to myself. The following screen shots illustrate the appearance and functionality of the database maintenance process.

Admin home

Admin waters

Edit water

High-level menus on the right. Sub-menus and data-entry forms in the middle. Simple. One of the nice things about writing applications for oneself is the fact that you don’t have to worry about usability, just functionality.

The really exciting stuff happens when the reports are written against the database. Both of them are XML files. The first is a essentially a database dump — water.xml — complete with the collection’s over-arching metadata record, each of the waters and their metadata, and a list of collectors. The heart of the report-writing process includes:

finding all of the records in the database
converting and saving each water’s image as a thumbnail
initializing the water record
finding all of the water’s collectors
adding each collector to the record
going to Step #5 for each collector
finishing the record
going to Step #2 for each water
saving the resulting XML to the file system

There are two hard parts about this process. The first, “MOGRIFY”, is a shelled out hack to the operating system using an ImageMagik utility to convert the content of the image field into a thumbnail image. Without this utility saving the image from the database to the file system would be problematic. Second, the SELECT statement used to find all the collectors associated with a particular water is a bit tricky. Not really to difficult, just a typical SQL join process. Good for learning relational database design. Below is a code snippet illustrating the heart of this report-writing process:

  # process every found row
  while ($r = mysql_fetch_array($rows)) {
  
    # get, define, save, and convert the image -- needs error checking
    $image     = stripslashes($r['image']);
    $leafname  = explode (' ' ,$r['name']);
    $leafname  = $leafname[0] . '-' . $r['water_id'] . '.jpg';
    $original  = ORIGINALS  . '/' . $leafname;
    $thumbnail = THUMBNAILS . '/' . $leafname;
    writeReport($original, $image);
    copy($original, $thumbnail);
    system(MOGRIFY . $thumbnail);
          
    # initialize and build a water record
    $report .= '';
    $report .= "" . 
               prepareString($r['name']) . '';
    $report .= '';
    $report .= "$r[year]";
    $report .= "$r[month]";
    $report .= "$r[day]";
    $report .= '';
    
    # find all the COLLECTORS associated with this water, and...
    $sql = "SELECT c.*
            FROM waters AS w, collectors AS c, items_for_collectors AS i
            WHERE w.water_id   = i.water_id
            AND c.collector_id = i.collector_id
            AND w.water_id     = $r[water_id]
            ORDER BY c.last_name, c.first_name";
    $all_collectors = mysql_db_query ($gDatabase, $sql);
    checkResults();
    
    # ...process each one of them
    $report .= "";
    while ($c = mysql_fetch_array($all_collectors)) {
    
      $report .= "
                 $c[first_name]
                 $c[last_name]";
      
    }
    $report .= '';
    
    # finish the record
    $report .= '' . stripslashes($r['description']) . 
               '';
  
  }

The result is the following “WaterML” XML content — a complete description of a water, in this case water from Copenhagen:

  
    Canal
      surrounding Kastellet, Copenhagen, Denmark
    
    
      2007
      8
      31
    
    
      
        Eric
        Morgan
    
    
    I had the opportunity to participate in the
      Ticer Digital Library School in Tilburg, The Netherlands.
      While I was there I also had the opportunity to visit the
      folks at 
      Index Data, a company
      that writes and supports open source software for libraries.
      After my visit I toured around Copenhagen very quickly. I
      made it to the castle (Kastellet), but my camera had run out
      of batteries. The entire Tilburg, Copenhagen, Amsterdam
      adventure was quite informative.

When I first created this version of the water collection RSS was just coming on line. Consequently I wrote an RSS feed for the water, but then I got realistic. How many people want to get an RSS feed of my water. Crazy?!

XSL processing

Now that the XML file has been created an the images are saved to the file system, the next step is to make a browser-based interface. This is done though an XSLT style sheet and XSL processor called Apache2::TomKit.

Apache2::TomKit is probably the most eclectic component of my online water collection application. Designed to be a replacement for another XSL processor called AxKit, Apache2::TomKit enables the developer to create CGI-like applications, complete with HTTP GET parameters, in the form of XML/XSLT combinations. Specify the location of your XML files. Denote what XSLT files to use. Configure what XSLT processor to use. (I use LibXSLT.) Define an optional cache location. Done. The result is on-the-fly XSL transformations that work just like CGI scripts. The hard part is writing the XSLT.

The logic of my XSLT style sheet — waters.xsl — goes like this:

Get input – There are two: cmd and id. Cmd is used to denote the desired display function. Id is used to denote which water to display
Initialize output – This is pretty standard stuff. Display XHTML head elements and start the body.
Branch – Depending on the value of cmd, display the home page, a collectors page, all the images, all the waters, or a specific water.
Display the content – This is done with the thorough use of XPath expressions.
Done – Complete the XHTML with a standard footer.

Of all the XSLT style sheets I’ve written in my career, waters.xsl is definitely the most declarative in nature. This is probably because the waters.xml file is really data driven as opposed mixed content. The XSLT file is very elegant but challenging for the typical Perl or PHP hacker to quickly grasp.

Once the integration of the XML file, the XSLT style sheet, and Apache2::TomKit is complete, I was able to design URL’s such as the following:

index.xml?cmd=getwaters – list all waters
index.xml?cmd=getcollectors – list all collectors
index.xml?cmd=getimages – dump all water thumbnail images
index.xml?cmd=getwater&id=79 – display a specific water
index.xml?cmd=getcollector&id=20 – display a specific collector and their waters

Okay. So its not very REST-ful; the URLs are not very “cool”. Sue me. I originally designed this in 2002.

Waters and Google Maps

In 2006 I used my water collection to create my first mash-up. It combined latitudes and longitudes with the Google Maps API.

Inserting maps into your Web pages via the Google API is a three-step process: 1) create an XML file containing latitudes and longitudes, 2) insert a call to the Google Maps javascript into the head of your HTML, and 3) call the javascript from within the body of your HTML.

For me, all I had to do was: 1) create new fields in my database for latitudes and longitudes, 2) go through each record in the database doing latitude and longitude data-entry, 3) write a WaterML file, 4) write an XSLT file transforming the WaterML into an XML file expected of Google Maps, 5) write a CGI script that takes latitudes and longitudes as input, 6) display a map, and 7) create links from my browser-based interface to the maps.

It may sound like a lot of steps, but it is all very logical, and taken bit by bit is relatively easy. Consequently, I am able to display a world map complete with pointers to all of my water. Conversely, I am able to display a water record and link its location to a map. The following two screen dumps illustrate the idea, and I try to get as close to the actual collection point as possible:

World map

Single water

Read all the posts in this series:

Visit the water collection.

Collecting water and putting it on the Web (Part I of III)

Eric Lease Morgan — Thu, 03 Sep 2009 11:23:29 +0000

This is Part I of an essay about my water collection, specifically the whys and hows of it. Part II describes the process of putting the collection on the Web. Part III is a summary, provides opportunities for future study, and links to the source code.

I collect water

It may sound strange, but I have been collecting water since 1978, and to date I believe I have around 200 bottles containing water from all over the world. Most of the water I’ve collected myself, but much of it has also been collected by friends and relatives.

The collection began the summer after I graduated from high school. One of my best friends, Marlin Miller, decided to take me to Ocean City (Maryland) since I had never seen the ocean. We arrived around 2:30 in the morning, and my first impression was the sound. I didn’t see the ocean. I just heard it, and it was loud. The next day I purchased a partially melted glass bottle for 59¢ and put some water, sand, and air inside. I was going keep some of the ocean so I could experience it anytime I desired. (Actually, I believe my first water is/was from the Pacific Ocean, collected by a girl named Cindy Bleacher. She visited there in the late Spring of ’78, and I asked her to bring some back so I could see it too. She did.) That is how the collection got started.

Cape Cod Bay

Robins Bay

Gulf of Mexico

The impetus behind the collection was reinforced in college — Bethany College (Bethany, WV). As a philosophy major I learned about the history of Western ideas. That included Heraclitus who believed the only constant was change, and water was the essencial element of the universe. These ideas were elaborated upon by other philosophers who thought there was not one essencial element, but four: earth, water, air, and fire. I felt like I was on to something, and whenever I heard of somebody going abroad I asked them bring me back some water. Burton Thurston, a Bethany professor, went to the Middle East on a diplomatic mission. He brought back Nile River water and water from the Red Sea. I could almost see Moses floating in his basket and escaping from the Egyptians.

The collection grew significantly in the Fall of 1982 because I went to Europe. During college many of my friends studied abroad. They didn’t do much studying as much as they did traveling. They were seeing and experiencing all of the things I was learning about through books. Great art. Great architecture. Cities whose histories go back millennia. Foreign languages, cultures, and foods. I wanted to see those things too. I wanted to make real the things I learned about in college. I saved my money from my summer peach picking job. My father cashed in a life insurance policy he had taken out on me when I was three weeks old. Living like a turtle with its house on its back, I did the back-packing thing across Europe for a mere six weeks. Along the way I collected water from the Seine at Notre Dame (Paris), the Thames (London), the Eiger Mountain (near Interlaken, Switzerland) where I almost died, the Agean Sea (Ios, Greece), and many other places. My Mediterranean Sea water from Nice is the prettiest. Because of the all the alge, the water from Venice is/was the most biologically active.

Over the subsequent years the collection has grown at a slower but regular pace. Atlantic Ocean (Myrtle Beach, South Carolina) on a day of playing hooky from work. A pond at Versailles while on my honeymoon. Holy water from the River Ganges (India). Water from Lock Ness. I’m going to grow a monster from DNA contained therein. I used to have some of a glacier from the Canadian Rockies, but it melted. I have water from Three Mile Island (Pennsylvania). It glows in the dark. Amazon River water from Peru. Water from the Missouri River where Lewis & Clarke decided it began. Etc.

Many of these waters I haven’t seen in years. Moves from one home to another have relegated them to cardboard boxes that have never been unpacked. Most assuredly some of the bottles have broken and some of the water has evaporated. Such is the life of a water collection.

Lake Huron

Trg Bana Jelacica

Jimmy Carter Water

Why do I collect water? I’m not quite sure. The whole body of water is the second largest thing I know. The first being the sky. Yet the natural bodies of water around the globe are finite. It would be possible to collect water from everywhere, but very difficult. Maybe I like the challenge. Collecting water is cheap, and every place has it. Water makes a great souvenir, and the collection process helps strengthen my memories. When other people collect water for me it builds between us a special relationship — a bond. That feels good.

What do I do with the water? Nothing. It just sits around my house occupying space. In my office and in the cardboard boxes in the basement. I would like to display it, but over all the bottles aren’t very pretty, and they gather dust easily. I sometimes ponder the idea of re-bottling the water into tiny vials and selling it at very expensive prices, but in the process the air would escape, and the item would lose its value. Other times I imagine pouring the water into a tub and taking a bath it it. How many people could say they bathed in the Nile River, Amazon River, Pacific Ocean, Atlantic Ocean, etc. all at the same time.

How water is collected

The actual process of collecting water is almost trivial. Here’s how:

Travel someplace new and different – The world is your oyster.
Identify a body of water – This should be endemic of the locality such as an ocean, sea, lake, pond, river, stream, or even a public fountain. Natural bodies of water a preferable. Processed water is not.
Find a bottle – In earlier years this was difficult, and I usually purchased a bottle of wine with my meal, kept the bottle and cork, and used the combination as my container. Now-a-days it is easier to root round in a trash can for a used water bottle. They’re ubiquitous, and they too are often endemic of the locality.
Collect the water – Just fill the bottle with mostly water but some of what the water is flowing over as well. The air comes along for the ride.
Take a photograph – Hold the bottle at arm’s length and take a picture it. What you are really doing here is two-fold. Documenting the appearance of the bottle but also documenting the authenticity of the place. The picture’s background supports the fact that water really came from where the collector says.
Label the bottle – On a small piece of paper write the name of the body of water, where it came from, who collected it, and when. Anything else is extra.
Save – Keep the water around for posterity, but getting it home is sometimes a challenge. With the advent of 911 it is difficult to get the water through airport security and/or customs. I have recently found myself checking my bags and incurring a handling fee just to bring my water home. Collecting water is not as cheap as it used to be.

Who can collect water for me? Not just anybody. I have to know you. Don’t take it personally, but remember, part of the goal is relationship building. Moreover, getting water from strangers would jeopardize the collection’s authenticity. Is this really the water they say it is? Call it a weird part of the “collection development policy”.

Pacific Ocean

Rock Run

Salton Sea

Read all the posts in this series:

Visit the water collection.

Web-scale discovery services

Eric Lease Morgan — Thu, 27 Aug 2009 14:25:32 +0000

Last week (Tuesday, August 18) Marshall Breeding and I participated in a webcast sponsored by Serials Solutions and Library Journal on the topic of “‘Web-scale’ discovery services”.

Our presentations complimented one another in that we both described the current library technology environment and described how the creation of amalgamated indexes of book and journal article content have the potential to improve access to library materials.

Dodie Ownes summarized the event in an article for Library Journal. From there you can also gain access to an archive of the one-hour webcast. (Free registration required.) I have made my written remarks available on the Hesburgh Libraries website as well as mirrored them locally. From the remarks:

It is quite possible the do-it-yourself creation and maintenance of an index to local book holdings, institutional repository content, and articles/etexts is not feasible. This may be true for any number of reasons. You may not have the full complement of resources to allocate, whether that be time, money, people, or skills. You and your library may have a set of priorities forcing the do-it-yourself approach lower on the to-do list. You might find yourself stuck in never-ending legal negotiations for content from “closed” access providers. You might liken the process of normalizing myriads of data formats into a single index to Hercules cleaning the Augean stables.

technical expertise
money

people with vision
energy

If this be the case, then the purchasing (read, “licensing”) of a single index service might be the next best thing — Plan B.

I sincerely believe the creation of these “Web-scale” indexes is a step in the right direction, but I believe just as strongly that the problem to be solved now-a-days does not revolve around search and discovery, but rather use and context.

“Thank you Serials Solutions and Library Journal for the opportunity to share some of my ideas.”

How to make a book (#1 of 3)

Eric Lease Morgan — Sun, 23 Aug 2009 21:13:55 +0000

This is a series of posts where I will describe and illustrate how to make books. In this first post I will show you how to make a book with a thermo-binding machine. In the second post I will demonstrate how to make a book by simply tearing and folding paper. In the third installment, I will make a traditional book with a traditional cover and binding. The book — or more formally, the codex — is a pretty useful format for containing information.

Fellowes TB 250 thermo-binding machine

The number of full text books found on the Web is increasing at a dramatic pace. A very large number of these books are in the public domain and freely available for downloading. While computers make it easy to pick through smaller parts of books, it is diffcult to read and understand them without printing. Once they are printed you are then empowered to write in the margins, annotate them as you see fit, and share them with your friends. On the other hand, reams of unbound paper is difficult to handle. What to do?

Enter a binding machine, specifically a thermo-binding machine like the Fellowes TB 250. This handy-dandy gizmo allows you to print bunches o’ stuff, encase it in inexpensive covers, and bind it into books. Below is an outline of the binding process and a video demonstration is also available online:

Buy the hardware – The machine costs less than $100 and available from any number of places on the Web. Be sure to purchase covers in a variety of sizes.
Print and gather your papers – Be sure to “jog” your paper nice and neatly.
Turn the machine on – This makes the heating element hot.
Place the paper into the cover – The inside of each cover’s spine is a ribbon of glue. Make sure the paper is touching the glue.
Place the book into the binder – This melts the glue.
Remove the book, and press the glue – The larger the book the more important it is to push the adhesive into the pages.
Go to Step #5, at least once – This makes the pages more secure in the cover.
Remove, and let cool – The glue is hot. Let it set.
Enjoy your book – This is the fun part. Read and scribble in your book to your heart’s content.

Binding and the Alex Catalogue

The Alex Catalogue of Electronic Texts is a collection of fulltext books brought together for the purposes of furthering a person’s liberal arts eduction. While it supports tools for finding, analyzing, and comparing texts, the items are intended to be read in book form as well. Consider printing and binding the PDF or fully transcribed versions of the texts. Your learning will be much more thorough, and you will be able to do more “active” reading.

Binding and libraries

Binding machines are cheap, and they facilitate a person’s learning by enabling users to organize their content. Maybe providing a binding service for library patrons is apropos? Make it easy for people to print things they find in a library. Make it easy for them to use some sort of binding machine. Enable them to take more control over the stuff of their learning, teaching, and research. It certainly sounds like good idea to me. After all, in this day and age, libraries aren’t so much about providing access to information as they are about making information more useful. Binding — books on demand — is just one example.

Book review of Larry McMurtry’s Books

Eric Lease Morgan — Sun, 23 Aug 2009 14:01:15 +0000

I read with interest Larry McMurtry’s Books: A Memoir (Simon & Schuster, 2008), but from my point of view, I would be lying if I said I thought the book had very much to offer.

The book’s 259 pages are divided into 109 chapters. I was able to read the whole thing in six or seven sittings. It is an easy read, but only because the book doesn’t say very much. I found the stories rarely engaging and never very deep. They were full of obscure book titles and the names of “famous” book dealers.

Much of this should not be a surprise, since the book is about one person’s fascination with books as objects, not books as containers of information and knowledge. From page 38 of my edition:

Most young dealers of the Silicon Chip Era regard a reference library as merely a waste of space. Old-timers on the West Coast, such as Peter Howard of Serendipity Books in Berkeley or Lou and Ben Weinstein of the (recently closed) Heritage Books Shop in Los Angeles, seem to retain a fondness of reference books that goes beyond the practical. Everything there is to know about a given volume may be only a click away, but there are still a few of us who’d rather have the book than the click. A bookman’s love of books is a love of books, not merely the information in them.

Herein lies the root of my real problem with the book, it shares with the reader one person’s chronology of a love of books and book selling. It describes various used bookstores and give you an idea of what it is like to be a book dealer. Unfortunately, I believe McMurtry misses the point about books. They are essentially a means to an end. A tool. A medium for the exchange of ideas. The ideas they contain and the way they contain them are the important thing. There are advantages & disadvantages to the book as a technology, and these advantages & disadvantages ought not be revered or exaggerated to dismiss the use of books or computers.

I also think McMurtry’s perception of libraries, which seems to be commonly held in and outside my profession, points to one of librarianship’s pressing issues. From page 221:

But they [computers] don’t really do what books do, and why should they usurp the chief function of a public library, which is to provide readers access to books? Books can accommodate the proximity of computers but it doesn’t seem to work the other way around. Computers now literally drive out books from the place they should, by definition, be books’ own home: the library.

Is the chief function of a public library to provide readers access to books? Are libraries defined as the “home” of books? Such a perception may have been more or less true in an environment where data, information, and knowledge were physically manifested, but in an environment where the access to information is increasingly digital the book as a thing is not as important. Books are not central to the problems to be solved.

Can computers do what books do? Yes and no. Computers can provide access to information. They make it easier to “slice and dice” their content. They make it easier to disseminate content. They make information more findable. The information therein is trivial to duplicate. On the other hand, books require very little technology. They are relatively independent of other technologies, and therefore they are much more portable. Books are easy to annotate. Just write on the text or scribble in the margin. A person can browse the contents of a book much faster than the contents of electronic text. Moreover, books are owned by their keepers, not licensed, which is increasingly the case with digitized material. There are advantages & disadvantages to both computers and books. One is not necessarily better than the other. Each has their place.

As a librarian, I had trouble with the perspectives of Larry McMurtry’s Books: A Memoir. It may be illustrative of the perspectives of book dealers, book sellers, etc., but I think the perspective misses the point. It is not so much about the book as much as it is about what the book contains and how those contents can be used. In this day and age, access to data and information abounds. This is a place where libraries increasingly have little to offer because libraries have historically played the role of middleman. Producers of information can provide direct access to their content much more efficiently than libraries. Consequently a different path for libraries needs to be explored. What does that path look like? Well, I certainly have ideas about that one, but that is a different essay.

Browsing the Alex Catalogue

Eric Lease Morgan — Sat, 22 Aug 2009 01:51:44 +0000

The Alex Catalogue is browsable by author names, subject tags, and titles. Just select a browsable list, then a letter, and finally an item.

Browsability is an important feature of any library catalog. It gives you an opportunity to see what the collection contains without entering a query. It is also possible to use browsability to identify similar names, terms, or titles. “Oh look, I hadn’t thought of that idea, and look at the alternative spellings I can use.”

Creating the browsable list is rather trivial. Since all of the underlying content is saved in a relational database, it is rather easy to loop through the fields of “controlled” vocabulary terms and “authority” lists to identify matching etext titles. These lists include:

The later is probably the most interesting since it gives you an idea of the most common words and two-word phrases used in the corpus. For example, look at the list of words starting with the letter “k” and all the ways the word “kant” has been extracted from collection

Indexing and searching the Alex Catalogue

Eric Lease Morgan — Tue, 18 Aug 2009 01:23:59 +0000

The Alex Catalogue of Electronic Texts uses state-of-the-art software to index both the metadata and full text of its content. While the interface accepts complex Boolean queries, it is easier to enter a single word, a number of words, or a phrase. The underlying software will interpret what you enter and do much of hard query syntax work for you.

Indexing

The Catalogue consists of a number of different types of content harvested from different repositories. Most of the content is in the form of electronic texts (“etexts” as opposed to “ebooks”). Think Project Gutenberg, but also items from a defunct gopher archive from Virginia Tech, and more recently digitized materials from the Internet Archive. All of these items benefit from metadata and full text indexing. In other words, things like title words, author names, and computer-generated subject tags are made searchable as well as the full texts of the items.

The collection is supplemented with additional materials such as open access journal titles, open access journal article titles, some content from the HaitiTrust, as well as photographs taken by myself. Presently the full text of these secondary items is not included, just metadata: titles, authors, notes, and subjects. Search results return pointers to the full texts.

Regardless of content type, all metadata and full text is managed in an underlying MyLibrary database. To make the content searchable reports are written against the database and fed to Solr/Lucene for indexing. The Solr/Lucene data structure is rather simple consisting only of a number of Dublin Core-like fields, a default search field, and three facets (creator, subject/tag, and sub-collection). From a 30,000 foot view, this is the process used to index the content of the Catalogue:

extract metadata and full text records from the database
map each record’s fields to the Solr/Lucene data structure
insert each record into Solr/Lucene; index the record
go to Step #1 until all records have been indexed
optimize the index for faster retrieval

Solr/Lucene works pretty well, and interfacing with it was made much simpler through the use of a set of Perl modules called WebService::Solr. On the other hand, there are many ways the index could be improved such as implementing facilitates for sorting and adding weights to various fields. An indexer’s work is never done.

Searching

Because of people’s expectations, searching the index is a bit more complicated and not as straight-forward, but only because the interface is trying to do you some favors.

Solr/Lucene supports single-word, multiple-word, and phrase searches through the use of single or double quote marks. If multi-word queries are entered without Boolean operators, then a Boolean and is assumed.

Since people often enter multiple-word queries, and it is difficult to know whether or not they are really wanting to do a phrase search, the Alex Catalogue converts ambiguous multiple-word queries into more robust Boolean queries. For example a search for “william shakespeare” (sans the quote marks) will get converted into “(william AND shakespeare) OR ‘william shakespeare'” (again, sans the double quote marks) on behalf of the user. This is considered a feature of the Catalogue.

To some degree Solr/Lucene tokenizes query terms, and consequently searches for “book” and “books” return the same number of hits.

Search results are returned in a relevance ranked order. Some time in the future there will be the option of sorting results by date, author, title, and/or a couple of other criteria. Unlike other catalogs, Alex only has a single display — search results. There is no intermediary detailed display; the Catalogue only displays search results or the full text of the item.

In the hopes of making it easier for the user to refine their search, the results page allows the user to automatically turn queries into subject, author, or title searches. It takes advantage of a thesaurus (WordNet) to suggest alternative queries. The system returns “facets” (author names, subject tags, or material types) allowing the user to limit their query with additional terms and narrow search results. The process is not perfect and there are always ways of improving the interface. Usability is never done either.

Summary

Do not try to out think the Alex Catalogue. Enter a word or two. Refine your query using the links on the resulting page. Read & enjoy the discovered texts. Repeat.

Microsoft Surface at Ball State

Eric Lease Morgan — Fri, 14 Aug 2009 19:17:21 +0000

Me and a number of colleagues from the University of Notre Dame visited folks from Ball State University and Ohio State University to see, touch, and discuss all things Microsoft Surface.

There were plenty of demonstrations surrounding music, photos, and page turners. The folks of Ball State were finishing up applications for the dedication of the new “information commons”. These applications included an exhibit of orchid photos and an interactive map. Move the scroll bar. Get a differnt map based on time. Tap locations. See pictures of buildings. What was really interesting about the later was the way it pulled photographs from the library’s digital repository through sets of Web services. A very nice piece of work. Innovative and interesting. They really took advantage of the technology as well as figured out ways to reuse and repurpose library content. They are truly practicing digital librarianship.

The information commons was nothing to sneeze at either. Plenty of television cameras, video screens, and multi-national news feeds. Just right for a school with a focus on broadcasting.

Ball State University. Hmm…

Automatic metadata generation

Eric Lease Morgan — Fri, 31 Jul 2009 02:22:02 +0000

I have been having a great deal of success extracting keywords and two-word phrases from documents and assigning them as “subject headings” to electronic texts — automatic metadata generation. In many cases but not all, the set of assigned keywords I’ve created are just as good if not better as the controlled vocabulary terms assigned by librarians.

The problem

The Alex Catalogue is a collection of roughly 14,000 electronic texts. The vast majority come from Project Gutenberg. Some come from the Internet Archive. The smallest number come from a defunct etext collection of Virginia Tech. All of the documents are intended to surround the themes of American and English literature and Western philosophy.

With the exception of the non-fiction works from the Internet Archive, none of the electronic texts were associated with subject-related metadata. With the exception of author names (which are yet to be “well-controlled”), it has been difficult learn the “aboutness” of each of the documents. Such a thing is desirable for two reasons: 1) to enable the reader to evaluate the relevance of document, and 2) to provide a browsable interface to the collection. Without some sort of tags, subject headings, or application of clustering techniques, browsability is all but impossible. My goal was to solve this problem in an automated manner.

The solution

A couple of years ago I used tools such as Lingua::EN::Summarize and Open Text Summarizer to extract keywords and summaries from the etexts and assign them as subject terms. The process worked, but not extraordinarily well. I then learned about Term Frequency Inverse Document Frequency (TFIDF) to calculate “relevance”, and T-Score to calculate the probability of two words appearing side-by-side — bi-grams or two-word phrases. Applying these techniques to the etexts of the Alex Catalogue I have been able to create and add meaningful subject “tags” to each of my documents which then paves the way to browsability. Here is the algorithm I used to implement the solution:

Collect documents – This was done through various harvesting techniques. Etexts are saved to the local file system and what metadata does exist gets saved to a database.
Index the collection – Each of the documents is full-text indexed. Not only does this facilitate Steps #3 and #4, below, it makes the collection searchable.
Calculate a relevancy score (TFIDF) for each word – With the exception of parsing each etext into a set of “words”, counting the number of words in a document and the frequency of each word is easy. Determining the total number of documents in the collection is trivial. By searching the index for each word and getting back the number of documents in which it appears is the work of the indexer. With these four values (number of words in a document, frequency of a word in a document, the number of total documents, and the number of documents where the word appears) TFIDF can be calculated for each word.
Calculate a relevancy score for each bi-gram – Instead of extracting words from an etext, bi-grams (two-word phrases) were extracted and TFIDF is calculated for each of them, just like Step #3.
Save – If the score for each word or bi-gram is greater than an arbitrarily denoted lower bounds, and if the word or bi-gram is not a stop word, then assign the word or bi-gram to the etext. This step was the most time-consuming. It required many dry runs of the algorithm to determine an optimal lower-bounds as well as set of stop words. The lower the bounds the greater number of words and phrases are returned, but as the number of words and phrases increases their apparent usefulness decreases. The words become too common among the controlled vocabulary. At the other end of the scale, a stop word list needed to be created to remove meaningless words and phrases. The stop word problem was complicated in Project Gutenberg texts because of the “fine print” and legalese in most of the documents, and by the OCRed (optical character recognized) text from the Internet Archive. Words like “thofe” where the “f” was really an “s” needed to be removed.
Go to Step #3 for each document in the collection.
Done.

The results

Through this process I discovered a number of things.

First, in regards to fictional works, the words or phrases returned are often pronouns, and these were usually the names of characters from the work. An excellent example is Mark Twain’s Adventures of Huckleberry Finn whose currently assigned terms include: huck, tom, joe, injun joe, aunt polly, tom sawyer, muff potter, and injun joe’s.

Second, in regards to works of non-fiction, the words and phrases returned are also nouns, and these are objects referred to often in the etext. A good example includes John Stuart Mill’s Auguste Comte and Positivism where the assigned words are: comte, phaenomena, metaphysical, science, mankind, social, scientific, philosophy, and sciences.

Third, automatically generated keywords and phrases were many times just as useful as the librarian-assigned Library of Congress Subject headings. Many of the items harvested from the Internet Archive were complete with MARC records. Some of those records included subject headings. During Step #5 (above), I spent time observing the output and comparing it to previously assigned terms. Take for example a work called Universalism in America: A History by Richard Eddy. Its assigned headings included:

Universalism United States History
Unitarian Universalist churches United States

My automatically generated terms/phrases are:

universalist
ballou
hosea ballou
boston
universalist church
sermon
convention
first universalist
universalist quarterly
doctrine
universalist society
restorationist controversy
thomas whittemore
delivered
abner kneeland
sermon delivered
church
universalist meeting
universalist magazine
universal salvation
america
hosea ballon
vers alism
edward turner
general convention
universalism

Granted, the generated list is not perfect. For example, Hosea Ballou is mentioned twice, and the second was probably caused by an OCR error. On the other hand, how was a person to know that Hosea Ballou was even a part of the etext if it weren’t for this process? The same goes for the other people: Thomas Whittemore, Abner Kneeland, and Edward Turner. In defense of controlled vocabulary, the terms “church”, “sermon”, “doctrine”, and “american” could all be assumed from the (rather) hierarchal nature of LCSH, but unless a person understands the nature of LCSH such a thing is not obvious.

As a librarian I understand the power of a controlled vocabulary, but since I am not limited to three to five subject headings per entry, and because controlled vocabularies are often very specific, I have retained the LCSH in each record whenever possible. The more the merrier.

Next steps

Now that the collection has richer metadata, the next steps will be to exploit it. Some of those nexts steps include:

Normalize the data – Each of the subjects are currently saved in a single database field. They need to be normalized across the database to enable database joins and make it easier to generate reports.
Create a browsable interface – Write a set of static Web pages linking keywords and phrases to etexts. This will make it easier to see at a glance the type of content in the collection.
Re-index – Trivial. Send all the data and metadata back to the indexer ultimately improving the precision/recall ratio.
Enhance search experience – Extract the keywords and phrases from search results and display them to the user. Make them linkable to easily “find more like this one.” Extract the same keywords and phrases and use them to implement the increasingly popular browsable facets feature.
Enhance linked data – Generate a report against the database to create (better) RDF files complete with more meaningful (subject) tags. Link these tags to external vocabularies such as WordNet through the use of linked data thus contributing to the Semantic Web and enabling others to benefit from my labors. (Infomotions Man says, ‘Give back to the ‘Net”.)

Fun! Combining traditional librarianship with computer applications; not automating existing workflows as much as exploiting the inherent functions of a computer. Using mathematics to solve large-scale problems. Making it easier to do learning and research. It is the not what of librarianship that needs to change as much as the how.

Alex on Google

Eric Lease Morgan — Fri, 24 Jul 2009 11:41:06 +0000

I don’t exactly know how or why Google sometimes creates nice little screen shots of Web home pages, but it created one for my Alex Catalogue of Electronic Texts. I’ve seen them for other sites on the Web, and some of them even contain search boxes.

I wish I could get Google to make one of these for a greater number of my sites, and I wish I could get the Google Search Appliance to do the same. It is a nifty feature, to say the least.

Top Tech Trends for ALA Annual, Summer 2009

Eric Lease Morgan — Mon, 20 Jul 2009 11:32:22 +0000

This is a list of Top Tech Trends for the ALA Annual Meeting, Summer 2009.*

Green computing

The amount of computing that gets done on our planet has a measurable carbon footprint, and many of us, myself included, do not know exactly how much heat our computers put off and how much energy they consume. With the help from some folks from the University of Notre Dame’s Center for Research Computing, I learned my laptop computer spikes at 30 watts on boot, slows down to 20 watts during normal use, idles at 2 watts during sleep, and zooms up to 34 watts when the screen saver kicks in. Just think how much energy and heat your computer consumes and generates while waiting for the nightly update from your systems department. But realistically, it is our servers that make the biggest impact, and while energy consumption is one way to be more green, another is to figure out ways to harness the heat the computers generate. One trend is to put computers in places that need to be heated up, like green houses in the winter. Another idea is to put them in places where cool air is exhausted, like building ventilation ducks. What can you do? Turn your computer off when it is not in use since the computer electronics and such are not as sensitive to power on, power off cycles as they used to be.

“Digital Humanities”

There seems to be a growing number of humanities scholars who understand that computers can be applied to their research. See the Digital Humanities Manifesto as an example. With the advent of all the electronic texts being made available, it is not possible to read each and every text individually. In an effort to analyze large copra more quickly, people can create word clouds against these documents to summarize them. They can extract the statistically significant words and phrases to determine their “aboutness”. They can easily compute Fog, Flesch, and Flesch-Kincaid scores denoting the complexity of documents. (“Remember, ‘Why Johnny can’t read’?”) These people understand that humanities scholarship is not necessarily done in isolation, and the codex is not necessarily the medium of the day. They understand the advantages of open access publishing. For our profession, it is difficult to overstate the number of opportunities this trend affords librarianship. Anybody can find information. What people need now are tools to make information easier to analyze and use.

Tweeting with Twitter

Microblogging (think Twitter) is definitely hot. In some situations it can be a really useful application of computer technology. Frankly, I think the fascination will wear off and its functionality will become similar to the use of cellphone photographs at news-breaking events. Tweet, tweet, tweet.

Discovery interfaces and mega-indexes

If I were to pick the hottest trend in library technology, it would be fledgling implementation of large, all-encompassing indexes of journal and book content — integrating mega-indexes into the “discovery” interface. This is exemplified by Serials Solutions’ Summa, hinted at by an OCLC/EBSCO collaboration, and thought about by other library vendors. Google Scholar comes close but could benefit by adding more complete bibliographic data of books. OAIster worked for OAI-accessible content but needed to be indexed with a less proprietary tool. The folks at Index Data created something similar and included additional content, but the idea never seemed to catch on. Federated (broadcast) search tried and has yet to fulfill the promise. The driver behind this idea is the knowledge that many data silos don’t meet the needs of our users. Instead people want one box, one button, and one data set. Combine journal bibliographic data with book bibliographic data into a single index (not database). Sort search results by relevance. Provide a set of time-saving services against the result. In order for this technological technique to work each data set must be normalized into a single data structure and indexed (probably with an open source indexer called Lucene). In other words, there will be a large set of core elements such a title, author, note, subject, etc. All bibliographic data from all sets will be mapped to these fields and what doesn’t fall neatly into any one of them will be mapped to free text fields. Not perfect, not 100 percent, but hugely functional, and it meets user’s expectations. To see how this can be done with the volumes and volumes of medically-related open access content see the good work done by OpenPHI and their HealthLibrarian.

* This posting was originally “published” as a part of litablog.org, and it is duplicated here because many copies keep stuff safe.

Mass Digitization Mini-Symposium: A Reverse Travelogue

Eric Lease Morgan — Wed, 01 Jul 2009 17:23:48 +0000

The Professional Development Committee of the Hesburgh Libraries at the University of Notre Dame a “mini-symposium” on the topic of mass digitization on Thursday, May 21, 2009. This text documents some of what the speakers had to say. Given the increasingly wide availability of free full text information provided through mass digitization, the forum offered an opportunity for participants to learn how such a thing might affect learning, teaching, and scholarship. *

Setting the Stage

Presenters and organizers

After introductions by Leslie Morgan, I gave a talk called “Mass digitization in 15 minutes” where I described some of the types of library services and digital humanities processes that could be applied to digitized literature. “What might libraries be like if 51% or more of our collections were available in full text?”

Maura Marx

The Symposium really got underway with the remarks of Maura Marx (Executive Director of the Open Knowledge Commons) in a talk called “Mass Digitization and Access to Books Online.” She began by giving an overview of mass digitization (such as the efforts of the Google Books Project and the Internet Archive) and compared it with large-scale digitization efforts. “None of this is new,” she said, and gave examples including Project Gutenberg, the Library of Congress Digital Library, and the Million Books Project. Because the Open Knowledge Commons is an outgrowth of the Open Content Alliance, she was able to describe in detail the mechanical digitizing process of the Internet Archive with its costs approaching 10¢/page. Along the way she advocated the HathiTrust as a preservation and sharing method, and she described it as a type of “radical collaboration.” “Why is mass digitization so important?” She went on to list and elaborate upon six reasons: 1) search, 2) access, 3) enhanced scholarship, 4) new scholarship, 5) public good, and 6) the democratization of information.

The second half of Ms. Marx’s presentation outlined three key issues regarding the Google Books Settlement. Specifically, the settlement will give Google a sort of “most favored nation” status because it prevents Google from getting sued in the future, but it does not protect other possible digitizers the same way. Second, it circumvents, through contract law, the problem of orphan works; the settlement sidesteps many of the issues regarding copyright. Third, the settlement is akin to a class action suit, but in reality the majority of people affected by the suit are unknown since they fall into the class of orphan works holders. To paraphrase, “How can a group of unknown authors and publishers pull together a class action suit?”

She closed her presentation with a more thorough description of Open Knowledge Commons agenda which includes: 1) the production of digitized materials, 2) the preservation of said materials, and 3) and the building of tools to make the materials increasingly useful. Throughout her presentation I was repeatedly struck by the idea of the public good the Open Knowledge Commons was trying to create. At the same time, her ideas were not so naive to ignore the new business models that are coming into play and the necessity for libraries to consider new ways to provide library services. “We are a part of a cyber infrastructure where the key word is ‘shared.’ We are not alone.”

Gary Charbonneau

Gary Charbonneau (Systems Librarian, Indiana University – Bloomington) was next and gave his presentation called “The Google Books Project at Indiana University“.

Indiana University, in conjunction with a number of other CIC (Committee on Institutional Cooperation) libraries have begun working with Google on the Google Books Project. Like many previous Google Book Partners, Charbonneau was not authorized to share many details regarding the Project; he was only authorized “to paint a picture” with the metaphoric “broad brush.” He described the digitization process as rather straightforward: 1) pull books from a candidate list, 2) charge them out to Google, 3) put the books on a truck, 4) wait for them to return in few weeks or so, and 5) charge the books back into the library. In return for this work they get: 1) attribution, 2) access to snippets, and 3) sets of digital files which are in the public domain. About 95% of the works are still under copyright and none of the books come from their rare book library — the Lilly Library.

Charbonneau thought the real value of the Google Book search was the deep indexing, something mentioned by Marx as well.

Again, not 100% of the library’s collection is being digitized, but there are plans to get closer to that goal. For example, they are considering plans to digitize their “Collections of Distinction” as well as some of their government documents. Like Marx, he advocated the HathiTrust but he also suspected commercial content might make its way into its archives.

One of the more interesting things Charbonneau mentioned was in regards to URLs. Specifically, there are currently no plans to insert the URLs of digitized materials into the 856 $u field of MARC records denoting the location of items. Instead they plan to use an API (application programmer interface) to display the location of files on the fly.

Indiana University hopes to complete their participation in the Google Books Project by 2013.

Sian Meikle

The final presentation of the day was given by Sian Meikle (Digital Services Librarian, University of Toronto Libraries) whose comments were quite simply entitled “Mass Digitization.”

The massive (no pun intended) University of Toronto library system consisting of a whopping 18 million volumes spread out over 45 libraries on three campuses began working with the Internet Archive to digitize books in the Fall of 2004. With their machines (the “scribes”) they are able to scan about 500 pages/hour and, considering the average book is about 300 pages long, they are scanning at a rate of about 100,000 books/year. Like Indiana and the Google Books Project, not all books are being digitized. For example, they can’t be too large, too small, brittle, tightly bound, etc. Of all the public domain materials, only 9% or so do not get scanned. Unlike the output of the Google Book Project, the deliverables from their scanning process include images of the texts, a PDF file of the text, an OCRed version of the text, a “flip book” version of the text, and a number of XML files complete with various types of metadata.

Considering Meikle’s experience with mass digitized materials, she was able to make a number of observations and distinctions. For example, we — the library profession — need to understand the difference between “born digital” materials and digitized materials. Because of formatting, technology, errors in OCR, etc, the different manifestations have different strengths and weaknesses. Some things are more easily searched. Some things are displayed better on screens. Some things are designed for paper and binding. Another distinction is access. According to some of her calculations, materials that are in electronic form get “used” more than their printed form. In this case “used” means borrowed or downloaded. Sometimes the ratio is as high as 300-to-1. There are three hundred downloads to one borrow. Furthermore, she has found that proportionately, English language items are not used as heavily as materials in other languages. One possible explanation is that material in other languages can be harder to locate in print. Yet another difference is the type of reading one format offers over another; compare and contrast “intentional reading” with “functional reading.” Books on computers make it easy to find facts and snippets. Books on paper tend to lend themselves better to the understanding of bigger ideas.

Lastly, Meikle alluded to ways the digitized content will be made available to users. Specifically, she imagines it will become a part of an initiative called the Scholar’s Portal — a single index of journal article literature, full text books, and bibliographic metadata. In my mind, such an idea is the heart of the “next generation” library catalog.

Summary and Conclusion

The symposium was attended by approximately 125 people. Most were from the Hesburgh Libraries of the University of Notre Dame. Some were from regional libraries. There were a few University faculty in attendance. The event was a success in that it raised the awareness of what mass digitization is all about, and it fostered communication during the breaks as well as after the event was over.

The opportunities for librarianship and scholarship in general are almost boundless considering the availability of full text content. The opportunities are even greater when the content is free of licensing restrictions. While the idea of complete collections totally free of restrictions is a fantasy, the idea of significant amounts of freely available full text content is easily within our grasp. During the final question and answer period, someone asked, “What skills and resources are necessary to do this work?” The answer was agreed upon by the speakers, “What is needed? An understanding that the perfect answer is not necessary prior to implementation.” There were general nods of agreement from the audience.

Now is a good time to consider the possibilities of mass digitization and to be prepared to deal with them before they become the norm as opposed to the exception. This symposium, generously sponsored by the Hesburgh Libraries Professional Development Committee, as well as library administration, provided the opportunity to consider these issues. “Thank you!”

Notes

* This posting was orignally “published” as a part of the Hesburgh Libraries of the University of Notre Dame website, and it is duplicated here because “Lot’s of copies keep stuff safe.”

Lingua::EN::Bigram (version 0.01)

Eric Lease Morgan — Tue, 23 Jun 2009 13:41:36 +0000

Below is the POD (Plain O’ Documentation) file describing a Perl module I wrote called Lingua::EN::Bigram.

The purpose of the module is to: 1) extract all of the two-word phrases from a given text, and 2) rank each phrase according to its probability of occurance. Very nice for doing textual analysis. For example, by applying this module to Mark Twain’s Adventures of Tom Sawyer it becomes evident that the signifcant two-word phrases are names of characters in the story. On the other hand, Ralph Waldo Emerson’s Essays: First Series returns action statements — instructions. On the other hand Henry David Thoreau’s Walden returns “walden pond” and descriptions of pine trees. Interesting.

The code is available here or on CPAN.

NAME

Lingua::EN::Bigram – Calculate significant two-word phrases based on frequency and/or T-Score

SYNOPSIS

  use Lingua::EN::Bigram;
  $bigram = Lingua::EN::Bigram->new;
  $bigram->text( 'All men by nature desire to know. An indication of this...' );
  $tscore = $bigram->tscore;
  foreach ( sort { $$tscore{ $b } <=> $$tscore{ $a } } keys %$tscore ) {
  
    print "$$tscore{ $_ }\t" . "$_\n";
  
  }

DESCRIPTION

This module is designed to: 1) pull out all of the two-word phrases (collocations or “bigrams”) in a given text, and 2) list these phrases according to thier frequency and/or T-Score. Using this module is it possible to create list of the most common two-word phrases in a text as well as order them by their probable occurance, thus implying significance.

METHODS

new

Create a new, empty bigram object:

  # initalize
  $bigram = Lingua::EN::Bigram->new;

text

Set or get the text to be analyzed:

  # set the attribute
  $bigram->text( 'All good things must come to an end...' );
  
  # get the attribute
  $text = $bigram->text;

words

Return a list of all the tokens in a text. Each token will be a word or puncutation mark:

  # get words
  @words = $bigram->words;

word_count

Return a reference to a hash whose keys are a token and whose values are the number of times the token occurs in the text:

  # get word count
  $word_count = $bigram->word_count;
  
  # list the words according to frequency
  foreach ( sort { $$word_count{ $b } <=> $$word_count{ $a } } keys %$word_count ) {
  
    print $$word_count{ $_ }, "\t$_\n";
  
  }

bigrams

Return a list of all bigrams in the text. Each item will be a pair of tokens and the tokens may consist of words or puncutation marks:

  # get bigrams
  @bigrams = $bigram->bigrams;

bigram_count

Return a reference to a hash whose keys are a bigram and whose values are the frequency of the bigram in the text:

  # get bigram count
  $bigram_count = $bigram->bigram_count;
  
  # list the bigrams according to frequency
  foreach ( sort { $$bigram_count{ $b } <=> $$bigram_count{ $a } } keys %$bigram_count ) {
  
    print $$bigram_count{ $_ }, "\t$_\n";
  
  }

tscore

Return a reference to a hash whose keys are a bigram and whose values are a T-Score — a probabalistic calculation determining the significance of bigram occuring in the text:

  # get t-score
  $tscore = $bigram->tscore;
  
  # list bigrams according to t-score
  foreach ( sort { $$tscore{ $b } <=> $$tscore{ $a } } keys %$tscore ) {
  
    print "$$tscore{ $_ }\t" . "$_\n";
  
  }

DISCUSSION

Given the increasing availability of full text materials, this module is intended to help “digital humanists” apply mathematical methods to the analysis of texts. For example, the developer can extract the high-frequency words using the word_count method and allow the user to search for those words in a concordance. The bigram_count method simply returns the frequency of a given bigram, but the tscore method can order them in a more finely tuned manner.

Consider using T-Score-weighted bigrams as classification terms to supplement the “aboutness” of texts. Concatonate many texts together and look for common phrases written by the author. Compare these commonly used phrases to the commonly used phrases of other authors.

Each bigram includes punctuation. This is intentional. Developers may need want to remove bigrams containing such values from the output. Similarly, no effort has been made to remove commonly used words — stop words — from the methods. Consider the use of Lingua::StopWords, Lingua::EN::StopWords, or the creation of your own stop word list to make output more meaningful. The distribution came with a script (bin/bigrams.pl) demonstrating how to remove puncutation and stop words from the displayed output.

Finally, this is not the only module supporting bigram extraction. See also Text::NSP which supports n-gram extraction.

TODO

There are probably a number of ways the module can be improved:

the constructor method could take a scalar as input, thus reducing the need for the text method
the distribution’s license should probably be changed to the Perl Aristic License
the addition of alternative T-Score calculations would be nice
it would be nice to support n-grams
make sure the module works with character sets beyond ASCII

ACKNOWLEDGEMENTS

T-Score is calculated as per Nugues, P. M. (2006). An introduction to language processing with Perl and Prolog: An outline of theories, implementation, and application with special consideration of English, French, and German. Cognitive technologies. Berlin: Springer. Page 109.

AUTHOR

Eric Lease Morgan

Lingua::Concordance (version 0.01)

Eric Lease Morgan — Wed, 10 Jun 2009 17:05:37 +0000

Below is a man page describing a Perl I module I recently wrote called Lingua::Concordance (version 0.01).

Given the increasing availability of full text books and journals, I think it behooves the library profession to aggressively explore the possibilities of providing services against text as a means of making the proverbial fire hose of information more useful. Providing concordance-like functions against texts is just one example.

The distribution is available from this blog as well as CPAN.

NAME

Lingua::Concordance – Keyword-in-context (KWIC) search interface

SYNOPSIS

  use Lingua::Concordance;
  $concordance = Lingua::Concordance->new;
  $concordance->text( 'A long time ago, in a galaxy far far away...' );
  $concordance->query( 'far' );
  foreach ( $concordance->lines ) { print "$_\n" }

DESCRIPTION

Given a scalar (such as the content of a plain text electronic book or journal article) and a regular expression, this module implements a simple keyword-in-context (KWIC) search interface — a concordance. Its purpose is to return lists of lines from a text containing the given expression. See the Discussion section, below, for more detail.

METHODS

new

Create a new, empty concordance object:

  $concordance = Lingua::Concordance->new;

text

Set or get the value of the concordance’s text attribute where the input is expected to be a scalar containing some large amount of content, like an electronic book or journal article:

  # set text attribute
  $concordance->text( 'Call me Ishmael. Some years ago- never mind how long...' );

  # get the text attribute
  $text = $concordance->text;

Note: The scalar passed to this method gets internally normalized, specifically, all carriage returns are changed to spaces, and multiple spaces are changed to single spaces.

query

Set or get the value of the concordance’s query attribute. The input is expected to be a regular expression but a simple word or phrase will work just fine:

  # set query attribute
  $concordance->query( 'Ishmael' );

  # get query attribute
  $query = $concordance->query;

See the Discussion section, below, for ways to make the most of this method through the use of powerful regular expressions. This is where the fun it.

radius

Set or get the length of each line returned from the lines method, below. Each line will be padded on the left and the right of the query with the number of characters necessary to equal the value of radius. This makes it easier to sort the lines:

  # set radius attribute
  $concordance->radius( $integer );

  # get radius attribute
  $integer = $concordance->query;

For terminal-based applications it is usually not reasonable to set this value to greater than 30. Web-based applications can use arbitrarily large numbers. The internally set default value is 20.

sort

Set or get the type of line sorting:

  # set sort attribute
  $concordance->sort( 'left' );

  # get sort attribute
  $sort = $concordance->sort;

Valid values include:

none – the default value; sorts lines in the order they appear in the text — no sorting
left – sorts lines by the (ordinal) word to the left of the query, as defined the ordinal method, below
right – sorts lines by the (ordinal) word to the right of the query, as defined the ordinal method, below
match – sorts lines by the value of the query (mostly)

This is good for looking for patterns in texts, such as collocations (phrases, bi-grams, and n-grams). Again, see the Discussion section for hints.

ordinal

Set or get the number of words to the left or right of the query to be used for sorting purposes. The internally set default value is 1:

  # set ordinal attribute
  $concordance->ordinal( 2 );

  # get ordinal attribute
  $integer = $concordance->ordinal;

Used in combination with the sort method, above, this is good for looking for textual patterns. See the Discussion section for more information.

lines

Return a list of lines from the text matching the query. Our reason de existance:

  @lines = $concordance->lines;

DISCUSSION

[Elaborate upon a number of things here such as but not limited to: 1) the history of concordances and concordance systems, 2) the usefulness of concordances in the study of linguistics, 3) how to exploit regular expressions to get the most out of a text and find interesting snippets, and 4) how the module might be implemented in scripts and programs.]

BUGS

The internal _by_match subroutine, the one used to sort results by the matching regular expression, does not work exactly as expected. Instead of sorting by the matching regular expression, it sorts by the string exactly to the right of the matched regular expression. Consequently, for queries such as ‘human’, it correctly matches and sorts on human, humanity, and humans, but matches such as Humanity do not necessarily come before humanity.

TODO

Write Discussion section.
Implement error checking.
Fix the _by_match bug.
Enable all of the configuration methods (text, query, radius, sort, and ordinal) to be specified in the constructor.
Require the text and query attributes to be specified as a part of the constructor, maybe.
Remove line-feed characters while normalizing text to accomdate Windows-based text streams, maybe.
Write an example CGI script, to accompany the distribution’s terminal-based script, demonstrating how the module can be implemented in a Web interface.
Write a full-featured terminal-based script enhancing the one found in the distribution.

ACKNOWLEDGEMENTS

The module implements, almost verbatim, the concordance programs and subroutines described in Bilisoly, R. (2008). Practical text mining with Perl. Wiley series on methods and applications in data mining. Hoboken, N.J.: Wiley. pgs: 169-185. “Thanks Roger. I couldn’t have done it without your book!”

EAD2MARC

Eric Lease Morgan — Fri, 05 Jun 2009 16:28:45 +0000

This posting simply shares three hacks I’ve written to enable me to convert EAD files to MARC records, and ultimately add them to my “discovery” layer — VUFind — for the Catholic Portal:

ead2marcxml.sh – Using xsltproc and a modified version of Terry Reese’s XSL stylesheet, converts all the EAD/.xml files in the current directory into MARCXML files. “Thanks Terry!”
marcxml2marc.sh – Using yaz-marcdump, convert all .marcxml files in the current directory into “real” MARC records.
add-001.pl – A hack to add 001 fields to MARC records. Sometimes necessary since the EAD files do not always have unique identifiers.

The distribution is available in the archives, and distributed under the GNU Public License.

Now, off to go fishing.

Text mining: Books and Perl modules

Eric Lease Morgan — Thu, 04 Jun 2009 02:14:55 +0000

This posting simply lists some of the books I’ve read and Perl modules I’ve explored in regards to the field of text mining.

Through my explorations of term frequency/inverse document frequency (TFIDF) I became aware of a relatively new field of study called text mining. In many ways, text mining is similar to data mining only applied to unstructured texts instead of database rows and columns. Think plain text books such as items from Project Gutenberg or the Open Content Alliance. Text mining is a process including automatic classification, clustering (similar but distinct from classification), indexing and searching, entity extraction (names, places, organization, dates, etc.), statistically significant keyword and phrase extraction, parts of speech tagging, and summarization.

As a librarian, I found the whole thing extremely fascinating, consequently I read more.

Books

I have found the following four books helpful. They have enabled me to learn about the principles of text mining.

Bilisoly, R. (2008). Practical text mining with Perl. Wiley series on methods and applications in data mining. Hoboken, N.J.: Wiley. – Of all the books listed here, this one includes the most Perl programming examples, and it is not as scholarly as the balance of the list. Much of the book surrounds the description of regular expressions against texts. Its strongest suit is the creation of terminal-based concordance scripts. Very nice. Lot’s of fun. The concordances return very interesting results. The book does describe clustering techniques too, but the on the overall topic of automatic metadata generation the book is not very strong.
Konchady, M. (2006). Text mining application programming. Charles River Media programming series. Boston, Mass: Charles River Media. – This book is a readable survey of text mining covering parts of speech (POS) tagging, information extraction, search engines, clustering, classification, summarization, and question/answer processing. Many models for each aspect of text mining are described, compared, and contrasted. To put the author’s knowledge into practice, the book comes with a CD containing a Perl library for text mining, sample applications, and CGI scripts. This library is freely available on the Web.
Nugues, P. M. (2006). An introduction to language processing with Perl and Prolog: An outline of theories, implementation, and application with special consideration of English, French, and German. Cognitive technologies. Berlin: Springer. – Of the four books listed here, this one is probably the most dense. I found its Perl scripts used to parse text more useful than the ones in Bilisoly, but this one included no concordance applications. I also found the description of n-grams to be very interesting — the extraction of multi-word phrases. I suspect the model they describe can be extended to n number of words. This book also discusses parts of speech (POS) processing but it is the only one that describes how to really parse language. Think semantics, lexicons, discourse, and dialog. After the first couple of chapters the Perl examples disappear and give way to exclusively Prologue examples.
Weiss, S. M. (2005). Text mining: Predictive methods for analyzing unstructured information. New York: Springer. – The complexity of this book lies between Konchady and Nugues; it includes a greater number of mathematical models than Konchady, but it is easier to read than Nugues. Broad topics include textual documents as numeric vectors, using text for prediction, information retrieval, clustering & classification, and looking for information in documents. Each chapter includes a section called “Historical and Bibliographical Remarks” which has proved to be very interesting reading.

When it comes to the process of text mining I found each of these books useful in their own right. Each provided me with ways to reading texts, parsing texts, counting words, counting phrases, and through the application of statistical analysis create lists and readable summaries denoting the “aboutness” of given documents.

Perl modules

As a Perl hacker I am interested in writing scripts putting into practice some of the things I learn. Listed here are a number of modules that have gotten me further along in regard to text mining:

Lingua::EN::Fathom – This library outputs interesting statistics regarding a given document: number of words and the number of times each occurs, number of sentences, complexity of words, number of paragraphs, etc. Of greatest interest are numbers (Fog, Flesch, and Flesch-Kincaid) denoting the readability of the text. Quick. Easy. Useful.
Lingua::EN::Keywords – Given a text, this library outputs a list of what it thinks are the most significant individual words in a document, sans stop words. Not fancy.
Lingua::EN::NamedEntity – Given a text, I believe this library comes pre-trained to extract names, places, and organizations from texts. It returns a Perl data structure listing the probabilities of a word or phrase being any particular entity. It may need to be re-trained to work for your corpus.
Lingua::EN::Semtags::Engine – Given text this module will return words and phrases in a relevancy ranked order. Initially, I have had some problems using this module because it seems to take a long time to return. On the other hand, it looks promising since it returns both individual words as well as phrases.
Lingua::EN::Summarize – Given a text this library returns sentences it thinks encapsulates the essence of the document. The result is readable — grammatically correct. The process it uses to accomplish its task is self-proclaimed as unscientific.
Lingua::EN::Tagger – This library marks up a document in pseudo XML with tags denoting parts of speech in a given document. To do this work it also can extract words, noun phrases, and sentences from a text. Zippy. Probability-based. Developers are expected to parse the tagged output and do analysis against it, such as count the number of times particular parts of speech occur.
Lingua::StopWords – Returns a simple list of stop words. Easy, but I can’t figure out how customizable it is. “One person’s stop word list is another person research topic.”
Net::Dict – A network interface to DICT (dictionary) servers. While the DICT protocol is a bit long in the tooth, and not quite as cool as Web interfaces to things like Google or Wikipedia, this module does provide a handy way to look up definitions, a complimentary functionality to WordNet.
Text::Aspell – A Perl interface to GNU Aspell which is great for spell-checking applications.
TextMine – This is a set of modules written by Manu Konchady the author of Text Mining Application Programming. It includes submodules named Cluster, Entity, Index, Pos, Quanda (Q & A), Summary, Tokens, and WordNet. While this set of modules is the most comprehensive I’ve seen, and while they are probably the most theoretically based interfacing with things like WordNet to be thorough, my initial experience has been a bit frustrating since scripts written against the libraries do not turn very quickly. Maybe I’m feeding them documents that are too large and if so, then the libraries are not necessarily scalable.
WordNet – There are a bevy of modules providing functionality against WordNet — a “lexical database of English… Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations.” Any truly thorough text mining application of English will take advantage of WordNet.

Text mining and librarianship

Given the volume of “born digital” material being created, it is not possible to apply traditional library methods against them. The hand-crafted, heavy human touch process is not scalable. Given the amounts of mass digitized text being generated from the Google Books Project and the Open Content Alliance, new opportunities for literary analysis make themselves evident. Again, the traditional library processes can not fill the bill in these regards.

Text mining techniques offer possible solutions to these problems. Count words. Count phrases. Compare these words, phrases, and counts to other texts. Determine their statistical significance. Assign them to documents in the form of subject headings, keywords, author names, and other added entries in our metadata formats. Given large numbers of books, articles, and other “wordy” documents, learn how to “save the time of the reader” by summarizing these documents and ranking them in some sort of order in addition to alphabetical or date. Compare and contrast full text works by learning what words and types of words are used in documents. Are the words religious in nature? Mathematic and scientific? Poetic? Such things will provide additional means for understanding and interpreting everything from scholarly journal articles to works of classic fiction and philosophy. These techniques are not intended to replace existing methods of understanding and organization, but rather to supplement and build upon them. This is an evolutionary process.`

If libraries and librarians desire to remain relevant in the evolving information environment, then they will need to do the good work they do differently. The problem to be solved now-a-days is less about access and more about use. Text mining is one way of making the content of libraries more useful.

Interent Archive content in “discovery” systems

Eric Lease Morgan — Tue, 02 Jun 2009 12:59:08 +0000

This quick posting describes how Internet Archive content, specifically, content from the Open Content Alliance can be quickly and easily incorporated into local library “discovery” systems. VuFind is used here as the particular example:

Get keys – The first step is to get a set of keys describing the content you desire. This can be acquired through the Internet Archive’s advanced search interface.
Convert keys – The next step is to convert the keys into sets of URLs pointing to the content you want to download. Fortunately, all the URLs have a similar shape: http://www.archive.org/download/KEY/KEY.pdf, http://www.archive.org/download/KEY/KEY_meta.mrc, or http://www.archive.org/download/KEY/KEY__djvu.txt.
Download – Feed the resulting URLs to your favorite spidering/mirroring application. I use wget.
Update – Enhance the downloaded MARC records with 856$u valued denoting the location of your local PDF copy as well as the original (cononical) version.
Index – Add the resulting MARC records to your “discovery” system.

Linked here is a small distribution of shell and Perl scripts that do this work for me and incorporate the content into VuFind. Here is how they can be used:

  $ getkeys.sh > catholic.keys
  $ keys2urls.pl catholic.keys > catholic.urls
  $ mirror.sh catholic.urls
  $ updatemarc.pl
  $ find /usr/var/html/etexts -name '*.marc' /
  -exec cat {} >> /usr/local/vufind/marc/archive.marc \;
  $ cd /usr/local/vufind
  $ ./import.sh marc/archive.marc
  $ sudo ./vufind.sh restart

Cool next steps would be use text mining techniques against the downloaded plain text versions of the documents to create summaries, extract named entities, and identify possible subjects. These items could then be inserted into the MARC records to enhance retrieval. Ideally the full text would be indexed, but alas, MARC does not accomodate that. “MARC must die.”

TFIDF In Libraries: Part III of III (For thinkers)

Eric Lease Morgan — Sun, 31 May 2009 20:30:39 +0000

This is the third of the three-part series on the topic of TFIDF in libraries. In Part I the why’s and wherefore’s of TFIDF were outlined. In Part II TFIDF subroutines and programs written in Perl were used to demonstrate how search results can be sorted by relevance and automatic classification can be done. In this last part a few more subroutines and a couple more programs are presented which: 1) weigh search results given an underlying set of themes, and 2) determine similarity between files in a corpus. A distribution including the library of subroutines, Perl scripts, and sample data are available online.

Big Names and Great Ideas

As an intellectual humanist, I have always been interested in “great” ideas. In fact, one of the reasons I became I librarian was because of the profundity of ideas physically located libraries. Manifested in books, libraries are chock full of ideas. Truth. Beauty. Love. Courage. Art. Science. Justice. Etc. As the same time, it is important to understand that books are not source of ideas, nor are they the true source of data, information, knowledge, or wisdom. Instead, people are the real sources of these things. Consequently, I have also always been interested in “big names” too. Plato. Aristotle. Shakespeare. Milton. Newton. Copernicus. And so on.

As a librarian and a liberal artist (all puns intended) I recognize many of these “big names” and “great ideas” are represented in a set of books called the Great Books of the Western World. I then ask myself, “Is there someway I can use my skills as a librarian to help support other people’s understanding and perception of the human condition?” The simple answer is to collection, organize, preserve, and disseminate the things — books — manifesting great ideas and big names. This is a lot what my Alex Catalogue of Electronic Texts is all about. On the other hand, a better answer to my question is to apply and exploit the tools and processes of librarianship to ultimately “save the time of the reader”. This is where the use of computers, computer technology, and TFIDF come into play.

Part II of this series demonstrated how to weigh search results based on the relevancy ranked score of a search term. But what if you were keenly interested in “big names” and “great ideas” as they related to a search term? What if you wanted to know about librarianship and how it related to some of these themes? What if you wanted to learn about the essence of sculpture and how it may (or may not) represent some of the core concepts of Western civilization? To answer such questions a person would have to search for terms like sculpture or three-dimensional works of art in addition to all the words representing the “big names” and “great ideas”. Such a process would be laborious to enter by hand, but trivial with the use of a computer.

Here’s a potential solution. Create a list of “big names” and “great ideas” by copying them from a place such as the Great Books of the Western World. Save the list much like you would save a stop word list. Allow a person to do a search. Calculate the relevancy ranking score for each search result. Loop through the list of names and ideas searching for each of them. Calculate their relevancey. Sum the weight of search terms with the weight of name/ideas terms. Return the weighted list. The result will be a relevancy ranked list reflecting not only the value of the search term but also the values of the names/ideas. This second set of values I call the Great Ideas Coefficient.

To implement this idea, the following subroutine, called great_ideas, was created. Given an index, a list of files, and a set of ideas, it loops through each file calculating the TFIDF score for each name/idea:

  sub great_ideas {
  
    my $index = shift;
    my $files = shift;
    my $ideas = shift;
    
    my %coefficients = ();
    
    # process each file
    foreach $file ( @$files ) {
    
      my $words = $$index{ $file };
      my $coefficient = 0;
      
      # process each big idea
      foreach my $idea ( keys %$ideas ) {
      
        # get n and t for tdidf
        my $n = $$words{ $idea };
        my $t = 0;
        foreach my $word ( keys %$words ) { $t = $t + $$words{ $word } }
        
          # calculate; sum all tfidf scores for all ideas
          $coefficient = $coefficient + &tfidf( $n, $t, keys %$index, scalar @$files );
        
        }
      
      # assign the coefficient to the file
      $coefficients{ $file } = $coefficient;
    
    }
    
    return \%coefficients;
  
  }

A Perl script, ideas.pl, was then written taking advantage of the great_ideas subroutine. As described above, it applies the query to an index, calculates TFIDF for the search terms as well as the names/ideas, sums the results, and lists the results accordingly:

  # define
  use constant STOPWORDS => 'stopwords.inc';
  use constant IDEAS     => 'ideas.inc';
  
  # use/require
  use strict;
  require 'subroutines.pl';
  
  # get the input
  my $q = lc( $ARGV[ 0 ] );

  # index, sans stopwords
  my %index = ();
  foreach my $file ( &corpus ) { $index{ $file } = &index( $file, &slurp_words( STOPWORDS ) ) }
  
  # search
  my ( $hits, @files ) = &search( \%index, $q );
  print "Your search found $hits hit(s)\n";
  
  # rank
  my $ranks = &rank( \%index, [ @files ], $q );
  
  # calculate great idea coefficients
  my $coefficients = &great_ideas( \%index, [ @files ], &slurp_words( IDEAS ) );
  
  # combine ranks and coefficients
  my %scores = ();
  foreach ( keys %$ranks ) { $scores{ $_ } = $$ranks{ $_ } + $$coefficients{ $_ } }
  
  # sort by score and display
  foreach ( sort { $scores{ $b } <=> $scores{ $a } } keys %scores ) {
  
    print "\t", $scores{ $_ }, "\t", $_, "\n"
  
  }

Using the query tool described in Part II, a search for librarianship returns the following results:

  $ ./search.pl books
  Your search found 3 hit(s)
    0.00206045818083232   librarianship.txt
    0.000300606222548807  mississippi.txt
    5.91505974210339e-05  hegel.txt

Using the new program, ideas.pl, the same set of results are returned but in a different order, an order reflecting the existence of “big ideas” and “great ideas” in the texts:

  $ ./ideas.pl books
  Your search found 3 hit(s)
    0.101886904057731   hegel.txt
    0.0420767249559441  librarianship.txt
    0.0279062776599476  mississippi.txt

When it comes to books and “great” ideas, maybe I’d rather read hegel.txt as opposed to librarianship.txt. Hmmm…

Think of the great_ideas subroutine as embodying the opposite functionality as a stop word list. Instead of excluding the words in a given list from search results, use the words to skew search results in a particular direction.

The beauty of the the great_ideas subroutine is that anybody can create their own set of “big names” or “great ideas”. They could be from any topic. Biology. Mathematics. A particular subset of literature. Just as different sets of stop words are used in different domains, so can the application of a Great Ideas Coefficient.

Similarity between documents

TFIDF can be applied to the problem of finding more documents like this one.

The process of finding more documents like this is perennial. The problem is addressed in the field of traditional librarianship through the application of controlled vocabulary terms, author/title authority lists, the collocation of physical materials through the use of classification numbers, and bibliographic instruction as well as information literacy classes.

In the field of information retrieval, the problem is addressed through the application of mathematics. More specifically but simply stated, by plotting the TFIDF scores of two or more terms from a set of documents on a Cartesian plane it is possible to calculate the similarity between said documents by comparing the angle and length of the resulting vectors — a measure called “cosine similarity”. By extending the process to any number of documents and any number of dimensions it is relatively easy to find more documents like this one.

Suppose we have two documents: A and B. Suppose each document contains many words but those words were only science and art. Furthermore, suppose document A contains the word science 9 times and the word art 10 times. Given these values, we can plot the relationship between science and art on a graph, below. Document B can be plotted similarly supposing science occurs 6 times and the word art occurs 14 times. The resulting lines, beginning at the graph’s origin (O) to their end-points (A and B), are called “vectors” and they represent our documents on a Cartesian plane:

  s    |
  c  9 |         * A 
  i    |        *     
  e    |       *       
  n  6 |      *      * B
  c    |     *     *
  e    |    *    *
       |   *   *
       |  *  *   
       | * * 
       O-----------------------
                10   14
                
                  art
                
  Documents A and B represented as vectors

If the lines OA and OB were on top of each other and had the same length, then the documents would be considered equal — exactly similar. In other words, the smaller the angle AOB is as well as the smaller the difference between the length lines OA and OB the more likely the given documents are the same. Conversely, the greater the angle of AOB and the greater the difference of the lengths of lines OA and OB the more unlike the two documents.

This comparison is literally expressed as the inner (dot) product of the vectors divided by the product of the Euclidian magnitudes of the vectors. Mathematically, it is stated in the following form and is called “cosine similarity”:

( ( A.B ) / ( ||A|| * ||B|| ) )

Cosine similarity will return a value between 0 and 1. The closer the result is to 1 the more similar the vectors (documents) compare.

Most cosine similarity applications apply the comparison to every word in a document. Consequently each vector has a large number of dimensions making calculations time consuming. For the purposes of this series, I am only interested in the “big names” and “great ideas”, and since The Great Books of the Western World includes about 150 of such terms, the application of cosine similarity is simplified.

To implement cosine similarity in Perl three additional subroutines needed to be written. One to calculate the inner (dot) product of two vectors. Another was needed to calculate the Euclidian length of a vector. These subroutines are listed below:

  sub dot {
  
    # dot product = (a1*b1 + a2*b2 ... ) where a and b are equally sized arrays (vectors)
    my $a = shift;
    my $b = shift;
    my $d = 0;
    for ( my $i = 0; $i <= $#$a; $i++ ) { $d = $d + ( $$a[ $i ] * $$b[ $i ] ) }
    return $d;
  
  }

  sub euclidian {
  
    # Euclidian length = sqrt( a1^2 + a2^2 ... ) where a is an array (vector)
    my $a = shift;
    my $e = 0;
    for ( my $i = 0; $i <= $#$a; $i++ ) { $e = $e + ( $$a[ $i ] * $$a[ $i ] ) }
    return sqrt( $e );
  
  }

The subroutine that does the actual comparison is listed below. Given a reference to an array of two books, stop words, and ideas, it indexes each book sans stop words, searches each book for a great idea, uses the resulting TFIDF score to build the vectors, and computes similarity:

  sub compare {
  
    my $books     = shift;
    my $stopwords = shift;
    my $ideas     = shift;
    
    my %index = ();
    my @a     = ();
    my @b     = ();
    
    # index
    foreach my $book ( @$books ) { $index{ $book } = &index( $book, $stopwords ) }
    
    # process each idea
    foreach my $idea ( sort( keys( %$ideas ))) {
    
      # search
      my ( $hits, @files ) = &search( \%index, $idea );
      
      # rank
      my $ranks = &rank( \%index, [ @files ], $idea );
      
      # build vectors, a & b
      my $index = 0;
      foreach my $file ( @$books ) {
      
        if    ( $index == 0 ) { push @a, $$ranks{ $file }}
        elsif ( $index == 1 ) { push @b, $$ranks{ $file }}
        $index++;
        
        }
      
      }
      
      # compare; scores closer to 1 approach similarity
      return ( cos( &dot( [ @a ], [ @b ] ) / ( &euclidian( [ @a ] ) * &euclidian( [ @b ] ))));
  
  }

Finally, a script, compare.pl, was written glueing the whole thing together. It’s heart is listed here:

  # compare each document...
  for ( my $a = 0; $a <= $#corpus; $a++ ) {
  
    print "\td", $a + 1;
    
    # ...to every other document
    for ( my $b = 0; $b <= $#corpus; $b++ ) {
    
      # avoid redundant comparisons
      if ( $b <= $a ) { print "\t - " }
      
      # process next two documents
      else {
                      
        # (re-)initialize
        my @books = sort( $corpus[ $a ], $corpus[ $b ] );
        
        # do the work; scores closer to 1000 approach similarity
        print "\t", int(( &compare( [ @books ], $stopwords, $ideas )) * 1000 );
      
      }
    
    }
    
    # next line
    print "\n";
  
  }

In a nutshell, compare.pl loops through each document in a corpus and compares it to every other document in the corpus while skipping duplicate comparisons. Remember, only the dimensions representing “big names” and “great ideas” are calculated. Finally, it displays a similarity score for each pair of documents. Scores are multiplied by 1000 to make them easier to read. Given the sample data from the distribution, the following matrix is produced:

  $ ./compare.pl 
    Comparison: scores closer to 1000 approach similarity
    
        d1   d2   d3   d4   d5   d6
    
    d1   -  922  896  858  857  948
    d2   -   -   887  969  944  971
    d3   -   -    -   951  954  964
    d4   -   -    -    -   768  905
    d5   -   -    -    -    -   933
    d6   -   -    -    -    -    - 
    
    d1 = aristotle.txt
    d2 = hegel.txt
    d3 = kant.txt
    d4 = librarianship.txt
    d5 = mississippi.txt
    d6 = plato.txt

From the matrix is it obvious that documents d2 (hegel.txt) and d6 (plato.txt) are the most similar since their score is the closest to 1000. This means the vectors representing these documents are closer to congruency than the other documents. Notice how all the documents are very close to 1000. This makes sense since all of the documents come from the Alex Catalogue and the Alex Catalogue documents are selected because of the “great idea-ness”. The documents should be similar. Notice which documents are the least similar: d4 (librarianship.txt) and d5 (mississippi.txt). The first is a history of librarianship. The second is a novel called Life on the Mississippi. Intuitively, we would expect this to be true; neither one of these documents are the topic of “great ideas”.

(Argg! Something is incorrect with my trigonometry. When I duplicate a document and run compare.pl the resulting cosine similarity value between the exact same documents is 540, not 1000. What am I doing wrong?)

Summary

This last part in the series demonstrated ways term frequency/inverse document frequency (TFIDF) can be applied to over-arching (or underlying) themes in a corpus of documents, specifically the “big names” and “great ideas” of Western civilization. It also demonstrated how TFIDF scores can be used to create vectors representing documents. These vectors can then be compared for similarity, and, by extension, the documents they represent can be compared for similarity.

The purpose of the entire series was to bring to light and take the magic out of a typical relevancy ranking algorithm. A distribution including all the source code and sample documents is available online. Use the distribution as a learning tool for your own explorations.

As alluded to previously, TFIDF is like any good folk song. It has many variations and applications. TFIDF is also like milled grain because it is a fundemental ingredient to many recipes. Some of these recipies are for bread, but some of them are for pies or just thickener. Librarians and libraries need to incorporate more mathematical methods into their processes. There needs to be a stronger marriage between the social characteristics of librarianship and the logic of mathematics. (Think arscience.) The application of TFIDF in libraries is just one example.

The decline of books

Eric Lease Morgan — Fri, 08 May 2009 13:41:04 +0000

[This posting is in response to a tiny thread on the NGC4Lib mailing list about the decline of books. –ELM]

Yes, books are on the decline, but in order to keep this trend in perspective it is important to not confuse the medium with the message. The issue is not necessarily about books as much as it is about the stuff inside the books.

Books — codexes — are a particular type of technology. Print words and pictures on leaves of paper. Number the pages. Add an outline of the book’s contents — a table of contents. Make the book somewhat searchable by adding an index. Wrap the whole thing between a couple of boards. The result is a thing that is portable, durable, long- lasting, and relatively free-standing as well as independent of other technology. But all of this is really a transport medium, a container for the content.

Consider the content of books. Upon close examination it is a recorded manifestation of humanity. Books — just like the Web — are a reflection of humankind because just anything you can think of can be manifested in printed form. Birth. Growth. Love. Marriage. Aging. Death. Poetry. Prose. Mathematics. Astronomy. Business. Instructions. Facts. Directories. Gardening. Theses and dissertations. News. White papers. Plans. History. Descriptions. Dreams. Weather. Stock quotes. The price of gold. Things for sale. Stories both real and fictional. Etc. Etc. Etc.

Consider the length of time humankind has been recording things in written form. Maybe five thousand years. What were the mediums used? Stone and clay tablets? Papyrus scrolls. Vellum. Paper. To what extent did people bemoan the death of clay tablets? To what extent did they bemoan the movement from scrolls to codexes? Probably the cultures who valued verbal traditions as opposed to written traditions (think of the American Indians) had more to complain about than the migration from one written from to another. The medium is not as important as the message.

Different types of content lend themselves to different mediums. Music can be communicated via the written score, but music is really intended to be experienced through hearing. Sculpture is, by definition, a three-dimensional medium, yet we take photographs of it, a two-dimensional medium. The poetry and prose lend themselves very well to the written word, but they can be seen as forms of storytelling, and while there are many advantages to stories being written down, there are disadvantages as well. No sound effects. Where to put the emphasis on phrases? Hand gestures to communicate subtle distinctions are lost. It is for all of these reasons that libraries (and museums and archives) also collect the mediums that better represent this content. Paintings. Sound recordings. Artifacts. CDs and DVDs.

The containers of information will continue to change, but I assert that the content will not. The content will continue to be a reflection of humankind. It will represent all of the things that it means to be men, woman, and children. It will continue to be an exposition of our collective thoughts, feelings, beliefs, and experiences.

Libraries and other “cultural heritage institutions” do not have and never did have a monopoly on recorded content, but now, more than ever, and as we have moved away from an industrial-based economy to a more service-based economy whose communication channels are electronic and global, the delivery of recorded content, in whatever form, is more profitable. Consequently there is more competition. Libraries need to get a grip on what they are all about. If it is about the medium — books, CDs, articles — then the future is grim. If it is about content and making that content useful to their clientele, then the opportunities are wide open. Shifting a person’s focus from the how to the what is challenging. Looking at the forest from the trees is sometimes overwhelming. Anybody can get information these days. We are still drinking from the proverbial fire hose. The problem to be solved is less about discovery and more about use. It is about placing content in context. Providing a means to understanding it, manipulating it, and using it to solve the problems revolving around what it means to be human.

We are a set of educated people. If we put our collective minds to the problem, then I sincerely believe libraries can and will remain relevant. In fact, that is why I instituted this [the NGC4Lib] mailing list.

Code4Lib Software Award: Loose ends

Eric Lease Morgan — Mon, 27 Apr 2009 12:42:44 +0000

Loose ends make me feel uncomfortable, and one of the loose ends in my professional life is the Code4Lib Software Award.

Code4Lib began as a mailing list in 2003 and has grown to about 1,200 subscribers from all over the world. New people subscribe to the list almost daily. Its Web presence started up in 2005. Our conferences have been stimulating, informative, and productive for all three years of their existence. Our latest venture — the journal — records, documents, and shares the practical experience of our community. Underlying all of this is an IRC channel where answers to library-related computer problems can be answered in real-time. Heck, there even exists three for four Code4Lib “franchises”. In sum, by exploiting both traditional and less traditional mediums the Code4Lib Community has grown and matured quickly over the past five years. In doing so it has provided valuable and long-lasting services to itself as well as the greater library profession.

It is for the reasons outlined above that I believe our community is ripe for an award. Good things happen in Code4Lib. These things begin with individuals, and I believe the good code written by these individuals ought to be formally recognized. Unfortunately, ever since I put forward the idea, I have heard more negative things than positive. To paraphrase, “It would be seen as an endorsement, and we don’t endorse… It would turn out to be just a popularity contest… There are so many characteristics of good software that any decision would seem arbitrary.”

Apparently the place for an award is not as obvious to others as it is to me. Apparently our community is not as ready for an award as I thought we were. That is why, for the time being, I am withdrawing my offer to sponsor one. Considering who I am, I simply don’t have the political wherewithal to make the award a reality, but I do predict there will be an award at some time, just not right now. The idea needs to ferment for a while longer.

TFIDF In Libraries: Part II of III (For programmers)

Eric Lease Morgan — Tue, 21 Apr 2009 02:42:39 +0000

This is the second of a three-part series called TFIDF In Libraries, where relevancy ranking techniques are explored through a set of simple Perl programs. In Part I relevancy ranking was introduced and explained. In Part III additional word/document weighting techiques will be explored to the end of filtering search results or addressing the perennial task of “finding more documents like this one.” In the end it is the hoped to demonstrate that relevancy ranking is not magic nor mysterious but rather the process of applying statistical techiques to textual objects.

TFIDF, again

As described in Part I, term frequency/inverse document frequency (TFIDF) is a process of counting words in a document as well as throughout a corpus of documents to the end of sorting documents in statistically relevent ways.

Term frequency (TF) is essencially a percentage denoting the number of times a word appears in a document. It is mathematically expressed as C / T, where C is the number of times a word appears in a document and T is the total number of words in the same document.

Inverse document frequency (IDF) takes into acount that many words occur many times in many documents. Stop words and the word “human” in the MEDLINE database are very good examples. IDF is mathematically expressed as D / DF, where D is the total number of documents in a corpus and DF is the number of document in which a particular word is found. As D / DF increases so does the significance of the given word.

Given these two factors, TFIDF is literally the product of TF and IDF:

TFIDF = ( C / T ) * ( D / DF )

This is the basic form that has been used to denote relevance ranking for more than forty years, and please take note that it requires no advanced mathematical knowledge — basic arithmatic.

Like any good recipe or folk song, TFIDF has many variations. Google, for example, adds additional factors into their weighting scheme based on the popularity of documents. Other possibilities could include factors denoting the characteristics of the person using the texts. In order to accomodate for the wide variety of document sizes, the natural log of IDF will be employed throughout the balance of this demonstration. Therefore, for the purposes used here, TFIDF will be defined thus:

TFIDF = ( C / T ) * log( D / DF )

Simple Perl subroutines

In order to put theory into practice, I wrote a number of Perl subroutines implementing various aspects of relevancy ranking techniques. I then wrote a number of scripts exploiting the subroutines, essencially wrapping them in a user interface.

Two of the routines are trivial and will not be explained in any greater detail than below:

corpus – Returns an array of all the .txt files in the current directory, and is used to denote the library of content to be analyzed.
slurp_words – Returns a reference to a hash of all the words in a file, specifically for the purposes of implementing a stop word list.

Two more of the routines are used to support indexing and searching the corpus. Again, since neither is the focus of this posting, each will only be outlined:

index – Given a file name and a list of stop words, this routine returns a reference to a hash containing all of the words in the file (san stop words) as well as the number of times each word occurs. Strictly speaking, this hash is not an index but it serves our given purpose adequately.
search – Given an “index” and a query, this routine returns the number of times the query was found in the index as well as an array of files listing where the term was found. Search is limited. It only supports single-term queries, and there are no fields for limiting.

The heart of the library of subroutines is used to calculate TFIDF, ranks search results, and classify documents. Of course the TFIDF calculation is absolutely necessary, but ironically, it is the most straight-forward routine in the collection. Given values for C, T, D, and DF it returns decimal between 0 and 1. Trivial:

  # calculate tfidf
  sub tfidf {
  
    my $n = shift;  # C
    my $t = shift;  # T
    my $d = shift;  # D
    my $h = shift;  # DF
    
    my $tfidf = 0;
    
    if ( $d == $h ) { $tfidf = ( $n / $t ) }
    else { $tfidf = ( $n / $t ) * log( $d / $h ) }
    
    return $tfidf;
    
  }

Many readers will probably be most interested in the rank routine. Given an index, a list of files, and a query, this code calculates TFIDF for each file and returns the results as a reference to a hash. It does this by repeatedly calculating the values for C, T, D, and DF for each of the files and calling tfidf:

  # assign a rank to a given file for a given query
  sub rank {
  
    my $index = shift;
    my $files = shift;
    my $query = shift;
    
    my %ranks = ();
    
    foreach my $file ( @$files ) {
    
      # calculate n
      my $words = $$index{ $file };
      my $n = $$words{ $query };
      
      # calculate t
      my $t = 0;
      foreach my $word ( keys %$words ) { $t = $t + $$words{ $word } }
      
      # assign tfidf to file  
      $ranks{ $file } = &tfidf( $n, $t, keys %$index, scalar @$files );
    
    }
    
    return \%ranks;

  }

The classify routine is an added bonus. Given the index, a file, and the corpus of files, this function calculates TFIDF for each word in the file and returns a refernece to a hash containing each word and its TFIDF value. In other words, instead of calculating TFIDF for a given query in a subset of documents, it calculates TFIDF for each word in an entire corpus. This proves useful in regards to automatic classification. Like rank, it repeatedly determines values for C, T, D, and DF and calls tfidf:

  # rank each word in a given document compared to a corpus
  sub classify {
  
    my $index  = shift;
    my $file   = shift;
    my $corpus = shift;
    
    my %tags = ();
    
    foreach my $words ( $$index{ $file } ) {
    
      # calculate t
      my $t = 0;
      foreach my $word ( keys %$words ) { $t = $t + $$words{ $word } }
      
      foreach my $word ( keys %$words ) {
      
        # get n
        my $n = $$words{ $word };
        
        # calculate h
        my ( $h, @files ) = &search( $index, $word );
        
        # assign tfidf to word
        $tags{ $word } = &tfidf( $n, $t, scalar @$corpus, $h );
      
      }
    
    }
    
    return \%tags;
  
  }

Search.pl

Two simple Perl scripts are presented, below, taking advantage of the routines described, above. The first is search.pl. Given a single term as input this script indexes the .txt files in the current directory, searches them for the term, assigns TFIDF to each of the results, and displays the results in a relevancy ranked order. The essencial aspects of the script are listed here:

  # define
  use constant STOPWORDS => 'stopwords.inc';
  
  # include
  require 'subroutines.pl';
    
  # get the query
  my $q = lc( $ARGV[ 0 ] );

  # index
  my %index = ();
  foreach my $file ( &corpus ) { $index{ $file } = &index( $file, &slurp_words( STOPWORDS ) ) }
  
  # search
  my ( $hits, @files ) = &search( \%index, $q );
  print "Your search found $hits hit(s)\n";
  
  # rank
  my $ranks = &rank( \%index, [ @files ], $q );
  
  # sort by rank and display
  foreach my $file ( sort { $$ranks{ $b } <=> $$ranks{ $a } } keys %$ranks ) {
  
    print "\t", $$ranks{ $file }, "\t", $file, "\n"
  
  }
  
  # done
  print "\n";
  exit;

Output from the script looks something like this:

  $ ./search.pl knowledge
  Your search found 6 hit(s)
    0.0193061840120664    plato.txt
    0.00558586078987563   kant.txt
    0.00299602568022012   aristotle.txt
    0.0010031177985631    librarianship.txt
    0.00059150597421034   hegel.txt
    0.000150303111274403  mississippi.txt

From these results you can see that the document named plato.txt is the most relevent because it has the highest score, in fact, it is almost four times more relevant than the second hit, kant.txt. For extra credit, ask yourself, “At what point do the scores become useless, or when do the scores tell you there is nothing of significance here?”

Classify.pl

As alluded to in Part I of this series, TFIDF can be turned on its head to do automatic classification. Weigh each term in a corpus of documents, and list the most significant words for a given document. Classify.pl does this by denoting a lower bounds for TFIDF scores, indexing an entire corpus, weighing each term, and outputing all the terms whose scores are greater than the lower bounds. If no terms are greater than the lower bounds, then it lists the top N scores as defined by a configuration. The essencial aspects of classify.pl are listed below:

  # define
  use constant STOPWORDS    => 'stopwords.inc';
  use constant LOWERBOUNDS  => .02;
  use constant NUMBEROFTAGS => 5;
  
  # require
  require 'subroutines.pl';
  
  # initialize
  my @corpus = &corpus;
  
  # index
  my %index = ();
  foreach my $file (@corpus ) { $index{ $file } = &index( $file, &slurp_words( STOPWORDS ) ) }
  
  # classify each document
  foreach my $file ( @corpus ) {
  
    print $file, "\n";
    
    # list tags greater than a given score
    my $tags  = &classify( \%index, $file, [ @corpus ] );
    my $found = 0;
    foreach my $tag ( sort { $$tags{ $b } <=> $$tags{ $a } } keys %$tags ) {
    
      if ( $$tags{ $tag } > LOWERBOUNDS ) {
      
        print "\t", $$tags{ $tag }, "\t$tag\n";
        $found = 1;
      
      }
      
      else { last }
      
    }
      
    # accomodate tags with low scores
    if ( ! $found ) {
    
      my $n = 0;
      foreach my $tag ( sort { $$tags{ $b } <=> $$tags{ $a } } keys %$tags ) {
      
        print "\t", $$tags{ $tag }, "\t$tag\n";
        $n++;
        last if ( $n == NUMBEROFTAGS );
      
      }
  
    }
    
    print "\n";
  
  }
  
  # done
  exit;

For example, sample, yet truncated, output from classify.pl looks like this:

  aristotle.txt
    0.0180678691531642  being
    0.0112840859266579  substances
    0.0110363803118312  number
    0.0106083766432284  matter
    0.0098440843778661  sense
  
  mississippi.txt
    0.00499714142455761  mississippi
    0.00429324597184886  boat
    0.00418922035591656  orleans
    0.00374087743616293  day
    0.00333830388445574  river

Thus, assuming a lower TFIDF bounds of 0.02, the words being, substance, number, matter, and sense are the most significant in the document named aristotle.txt. But since none of the words in mississippi.txt have a score that high the top five words are returned instead. For more extra credit, think of ways classify.pl can be improved by answering, “How can the output be mapped to controlled vocabulary terms or expanded through the use of some other thesarus?”

Summary

The Perl subroutines and scripts described here implement TFIDF to do rudimentary ranking of search results and automatic classification. They are not designed to be production applications, just example tools for the purposes of learning. Turning the ideas implemented in these scripts into production applications have been the fodder for many people’s careers and entire branches of computer science.

You can download the scripts, subroutines, and sample data in order for you to learn more. You are encouraged to remove the .txt files from the distribution and replace them with your own data. I think your search results and automatic classification output will confirm in your mind that TFIDF is well-worth the time and effort of the library community. Given the amounts of full text books and journal articles freely available on the Internet, it behooves the library profession to learn to exploit these concepts because our traditional practices simply: 1) do not scale, or 2) do not meet with our user’s expectations. Furthermore, farming these sorts of solutions out to vendors is irresponsible.

Ralph Waldo Emerson’s Essays

Eric Lease Morgan — Sun, 19 Apr 2009 22:32:47 +0000

It was with great anticipation that I read Ralph Waldo Emerson’s Essays (both the First Series as well as the Second Series), but my expectations were not met. In a sentence I thought Emerson used too many words to say things that could have been expressed more succinctly.

The Essays themselves are a set of unsystematic short pieces of literature describing what one man thinks of various classic themes, such as but not limited to: history, intellect, art, experience, gifts, nature, etc. The genre itself — the literary essay or “attempts” — was apparently first popularized by Montaigne and mimicked by other “great” authors in the Western tradition including Bacon, Rousseau, and Thoreau. Considering this, maybe the poetic and circuitous nature of Emerson’s “attempts” should not be considered a fault.

Art

Because it was evident that later essays did not necessarily build on previous ones, I jumped around from chapter to chapter as whimsy dictated. Probably one of the first I read was “Art” where he describes the subject as the product of men detached from society.

It is the habit of certain minds to give an all-excluding fulness to the objects, the thought, the world, they alight upon, and to make that for the time the deputy of the world. These are the artists, the orators, the leaders of society. The power to detach and to magnify by detaching, is the essence of rhetoric in the hands of the orator and the poet.

But at the same time he seems to contradict himself earlier when he says:

No man can quite emancipate himself from the age and country, or produce a model in which the education, the religion, the politics, usages, and arts, of his times shall have not share. Though he were never so original, never so wilful and fantastic, he cannot wipe out of his work every trace of the thoughts amidst which it grew.

How can something be the product of a thing detached from society when it is not possible become detached in the first place?

Intellect

I, myself, being a person of mind more than heart, was keenly interested in the essay entitled “Intellect” where Emerson describes it as something:

…void of affection, and sees an object as it stands in the light of science, cool and disengaged… Intellect pierces the form, overlaps the wall, detects intrinsic likeness between remote things, and reduces all things into a few principles.

At the same time, intellect is not necessarily genius, since genius also requires spontaneity:

…but the power of picture or expression, in the most enriched and flowing nature, implies a mixture of will, a certain control over the spontaneous states, without which no production is possible. It is a conversation of all nature into the rhetoric of thought under the eye of judgement, with the strenuous exercise of choice. And yet the imaginative vocabulary seems to be spontaneous also. It does not flow from experience only or mainly, but from a richer source. Not by any conscious imitation of particular forms are the grand strokes of the painter executed, but by repairing to the fountain-head of all forms in his mind.

The Poet

Emerson apparently carried around his journal wherever he went. He made a living writing and giving talks. Considering this, and considering the nature of his writing, I purposely left his essay entitled “The Poet” until last. Not surprisingly, he had a lot to say on the subject, and I found this to be the hilight of my readings:

The poet is the person in whom these powers [the reproduction of senses] are in balance, the man without impediment, who sees and handles that which others dream of, traverses the whole scale of experience, and is representative of man, in virtue offering the largest power to receive and to impart… The poet is the sayer, the namer, and represents beauty… The poet does not wait for the hero or the sage, but as they act and think primarily, so he writes pirmarily what will and must be spoken, reckoning the others, though primaries also, yet, in repsect to him, secondaries and servants.

I found it encouraging that science was mentioned a few times during his discourse on the poet, since I believe a better understanding of one’s environment comes from the ability to think both artistically as well as scientifically, an idea I call arscience:

…science always goes abreast with the just elevation of the man, keeping step with religion and metaphysics; or, the state of science is an index of our self-knowledge… All the facts of the animal economy, — sex, nutriment, gestation, birth, growth — are symbols of passage of the world into the soul of man, to suffer there a change, and reappear a new and higher fact. He uses forms according to the life, and not according to the form. This is true science.

Back to the beginning

I think Emerson must have been a bit frustrated (or belittling himself in order be percieved as more believable) with a search for truth when he says, “I look in vain for the poet whom I describe.” But later on he summarizes much of what the Essays describe when he says, “Art is the path of the creator to his work,” and he then goes on to say what I said at the beginning of this review:

The poet pours out verses in every solitude. Most of the things he says are conventional, no doubt; but by and by he says something which is original and beautiful. That charms him.

I was hoping to find more inspriation regarding the definition of Unitarianism throughout the book, but alas, the term was only mentioned a couple of times. Instead, I learnd more indirectly that Emerson affected my thinking in more subtle ways. I have incorporated much of his thought into my own without knowing it. Funny how one’s education manifests itself.

Word cloud

Use this word cloud of the combined Essays to get an idea of what they are “about”:

nature men life world good shall soul great thought like love power know let mind truth make society persons day old character heart genius god come beauty law being history fact true makes work virtue better art laws self form right eye best action poet friend think feel eyes beautiful words human spirit little light facts speak person state natural intellect sense live force use seen thou long water people house certain individual end comes whilst divine property experience look forms hour read place present fine wise moral works air poor need earth hand common word thy conversation young stand

And since a picture is worth a thousand words, here is a simple graph illustrating how the 100 most frequently used words in the Essays (sans stop words) compare to one another:

TFIDF In Libraries: Part I of III (For Librarians)

Eric Lease Morgan — Mon, 13 Apr 2009 23:57:38 +0000

This is the first of a three-part series called TFIDF In Libraries, where “relevancy ranking” will be introduced. In this part, term frequency/inverse document frequency (TFIDF) — a common mathematical method of weighing texts for automatic classification and sorting search results — will be described. Part II will illustrate an automatic classification system and simple search engine using TFIDF through a computer program written in Perl. Part III will explore the possibility of filtering search results by applying TFIDF against sets of pre-defined “Big Names” and/or “Big Ideas” — an idea apparently called “champion lists”.

The problem, straight Boolean logic

To many of us the phrase “relevancy ranked search results” is a mystery. What does it mean to be “relevant”? How can anybody determine relevance for me? Well, a better phrase might have been “statistically significant search results”. Taking such an approach — the application of statistical analysis against texts — does have its information retrieval advantages over straight Boolean logic. Take for example, the following three documents consisting of a number of words, Table #1:

Document #1	Document #2	Document #3
Word	Word	Word
airplane	book	building
blue	car	car
chair	chair	carpet
computer	justice	ceiling
forest	milton	chair
justice	newton	cleaning
love	pond	justice
might	rose	libraries
perl	shakespeare	newton
rose	slavery	perl
shoe	thesis	rose
thesis	truck	science

A search for “rose” against the corpus will return three hits, but which one should I start reading? The newest document? The document by a particular author or in a particular format? Even if the corpus contained 2,000,000 documents and a search for “rose” returned a mere 100 the problem would remain. Which ones should I spend my valuable time accessing? Yes, I could limit my search in any number of ways, but unless I am doing a known item search it is quite likely the search results will return more than I can use, and information literacy skills will only go so far. Ranked search results — a list of hits based on term weighting — has proven to be an effective way of addressing this problem. All it requires is the application of basic arithmetic against the documents being searched.

Simple counting

We can begin by counting the number of times each of the words appear in each of the documents, Table #2:

Document #1		Document #2		Document #3
Word	C	Word	C	Word	C
airplane	5	book	3	building	6
blue	1	car	7	car	1
chair	7	chair	4	carpet	3
computer	3	justice	2	ceiling	4
forest	2	milton	6	chair	6
justice	7	newton	3	cleaning	4
love	2	pond	2	justice	8
might	2	rose	5	libraries	2
perl	5	shakespeare	4	newton	2
rose	6	slavery	2	perl	5
shoe	4	thesis	2	rose	7
thesis	2	truck	1	science	1
Totals (T)	46		41		49

Given this simple counting method, searches for “rose” can be sorted by its “term frequency” (TF) — the quotient of the number of times a word appears in each document (C), and the total number of words in the document (T) — TF = C / T. In the first case, rose has a TF value of 0.13. In the second case TF is 0.12, and in the third case it is 0.14. Thus, by this rudimentary analysis, Document #3 is most significant in terms of the word “rose”, and Document #2 is the least. Document #3 has the highest percentage of content containing the word “rose”.

Accounting for common words

Unfortunately, this simple analysis needs to be offset considering frequently occurring terms across the entire corpus. Good examples are stop words or the word “human” in MEDLINE. Such words are nearly meaningless because they appear so often. Consider Table #3 which includes the number of times each word is found in the entire corpus (DF), and the quotient of the total number of documents (D or in this case, 3) and DF — IDF = D / DF. Words with higher scores are more significant across the entire corpus. Search terms whose IDF (“inverse document frequency”) score approach 1 are close to useless because they exist in just about every document:

Document #1			Document #2			Document #3
Word	DF	IDF	Word	DF	IDF	Word	DF	IDF
airplane	1	3.0	book	1	3.0	building	1	3.0
blue	1	3.0	car	2	1.5	car	2	1.5
chair	3	1.0	chair	3	1.0	carpet	1	3.0
computer	1	3.0	justice	3	1.0	ceiling	1	3.0
forest	1	3.0	milton	1	3.0	chair	3	1.0
justice	3	1.0	newton	2	1.5	cleaning	1	3.0
love	1	3.0	pond	1	3.0	justice	3	1.0
might	1	3.0	rose	3	1.0	libraries	1	3.0
perl	2	1.5	shakespeare	1	3.0	newton	2	1.5
rose	3	1.0	slavery	1	3.0	perl	2	1.5
shoe	1	3.0	thesis	2	1.5	rose	3	1.0
thesis	2	1.5	truck	1	3.0	science	1	3.0

Term frequency/inverse document frequency (TFIDF)

By taking into account these two factors — term frequency (TF) and inverse document frequency (IDF) — it is possible to assign “weights” to search results and therefore ordering them statistically. Put another way, a search result’s score (“ranking”) is the product of TF and IDF:

TFIDF = TF * IDF where:

TF = C / T where C = number of times a given word appears in a document and T = total number of words in a document

IDF = D / DF where D = total number of documents in a corpus, and DF = total number of documents containing a given word

Table #4 is a combination of all the previous tables with the addition of the TFIDF score for each term:

Document #1
Word	C	T	TF	D	DF	IDF	TFIDF
airplane	5	46	0.109	3	1	3.0	0.326
blue	1	46	0.022	3	1	3.0	0.065
chair	7	46	0.152	3	3	1.0	0.152
computer	3	46	0.065	3	1	3.0	0.196
forest	2	46	0.043	3	1	3.0	0.130
justice	7	46	0.152	3	3	1.0	0.152
love	2	46	0.043	3	1	3.0	0.130
might	2	46	0.043	3	1	3.0	0.130
perl	5	46	0.109	3	2	1.5	0.163
rose	6	46	0.130	3	3	1.0	0.130
shoe	4	46	0.087	3	1	3.0	0.261
thesis	2	46	0.043	3	2	1.5	0.065
Document #2
Word	C	T	TF	D	DF	IDF	TFIDF
book	3	41	0.073	3	1	3.0	0.220
car	7	41	0.171	3	2	1.5	0.256
chair	4	41	0.098	3	3	1.0	0.098
justice	2	41	0.049	3	3	1.0	0.049
milton	6	41	0.146	3	1	3.0	0.439
newton	3	41	0.073	3	2	1.5	0.110
pond	2	41	0.049	3	1	3.0	0.146
rose	5	41	0.122	3	3	1.0	0.122
shakespeare	4	41	0.098	3	1	3.0	0.293
slavery	2	41	0.049	3	1	3.0	0.146
thesis	2	41	0.049	3	2	1.5	0.073
truck	1	41	0.024	3	1	3.0	0.073
Document #3
Word	C	T	TF	D	DF	IDF	TFIDF
building	6	49	0.122	3	1	3.0	0.367
car	1	49	0.020	3	2	1.5	0.031
carpet	3	49	0.061	3	1	3.0	0.184
ceiling	4	49	0.082	3	1	3.0	0.245
chair	6	49	0.122	3	3	1.0	0.122
cleaning	4	49	0.082	3	1	3.0	0.245
justice	8	49	0.163	3	3	1.0	0.163
libraries	2	49	0.041	3	1	3.0	0.122
newton	2	49	0.041	3	2	1.5	0.061
perl	5	49	0.102	3	2	1.5	0.153
rose	7	49	0.143	3	3	1.0	0.143
science	1	49	0.020	3	1	3.0	0.061

Given TFIDF, a search for “rose” still returns three documents ordered by Documents #3, #1, and #2. A search for “newton” returns only two items ordered by Documents #2 (0.110) and #3 (0.061). In the later case, Document #2 is almost one and a half times more “relevant” than document #3. TFIDF scores can be summed to take into account Boolean unions (or) or intersections (and).

Automatic classification

TDIDF can also be applied a priori to indexing/searching to create browsable lists — hence, automatic classification. Consider Table #5 where each word is listed in a sorted TFIDF order:

Document #1		Document #2		Document #3
Word	TFIDF	Word	TFIDF	Word	TFIDF
airplane	0.326	milton	0.439	building	0.367
shoe	0.261	shakespeare	0.293	ceiling	0.245
computer	0.196	car	0.256	cleaning	0.245
perl	0.163	book	0.220	carpet	0.184
chair	0.152	pond	0.146	justice	0.163
justice	0.152	slavery	0.146	perl	0.153
forest	0.130	rose	0.122	rose	0.143
love	0.130	newton	0.110	chair	0.122
might	0.130	chair	0.098	libraries	0.122
rose	0.130	thesis	0.073	newton	0.061
blue	0.065	truck	0.073	science	0.061
thesis	0.065	justice	0.049	car	0.031

Given such a list it would be possible to take the first three terms from each document and call them the most significant subject “tags”. Thus, Document #1 is about airplanes, shoes, and computers. Document #2 is about Milton, Shakespeare, and cars. Document #3 is about buildings, ceilings, and cleaning.

Probably a better way to assign “aboutness” to each document is to first denote a TFIDF lower bounds and then assign terms with greater than that score to each document. Assuming a lower bounds of 0.2, Document #1 is about airplanes and shoes. Document #2 is about Milton, Shakespeare, cars, and books. Document #3 is about buildings, ceilings, and cleaning.

Discussion and conclusion

Since the beginning, librarianship has focused on the semantics of words in order to create a cosmos from an apparent chaos. “What is this work about? Read the descriptive information regarding a work (author, title, publisher date, notes, etc.) to workout in your mind its importance.” Unfortunately, this approach leaves much up to interpretation. One person says this document is about horses, and the next person says it is about husbandry.

The mathematic approach is more objective and much more scalable. While not perfect, there is much less interpretation required with TFIDF. It is just about mathematics. Moreover, it is language independent; it is possible to weigh terms and provide relevance ranking without knowing the meaning of a single word in the index.

In actuality, the whole thing is not an either/or sort of question, but instead a both/and sort of question. Human interpretation provides an added value, definitely. At the same time the application of mathematics (“Can you say ‘science?'”) proves to be quite useful too. The approaches compliment each other — they are arscient. Much of how we have used computers in libraries has simply been to automate existing processes. We have still to learn how to truly take advantage of a computer’s functionality. It can remember things a whole lot better than we can. It can add a whole lot faster than we can. Because of this it is almost trivial to calculate ( C / T ) * ( D / DF ) over an entire corpus of 2,000,000 MARC records or even 1,000,000 full text documents.

None of these ideas are new. It is possible to read articles describing these techniques going back about 40 years. Why has our profession not used them to our advantage. Why is it taking us so long? If you have an answer, then enter it in the comment box below.

This first posting has focused on the fundamentals of TFIDF. Part II will describe a Perl program implementing relevancy ranking and automatic classification against sets of given text files. Part III will explore the idea of using TFIDF to enable users to find documents alluding to “great ideas” or “great people”.

A day at CIL 2009

Eric Lease Morgan — Sat, 04 Apr 2009 00:40:41 +0000

This documents my day-long experiences at the Computers in Libraries annual conference, March 31, 2009. In a sentence, the meeting was well-attended and covered a wide range of technology issues.

Washington Monument

The day began with an interview-style keynote address featuring Paul Holdengraber (New York Public Library) interviewed by Erik Boekesteijn (Library Concept Center). As the Director of Public Programs at the Public Library, Holdengraber’s self-defined task is to “levitate the library and make the lions on the front steps roar.” Well-educated, articulate, creative, innovative, humorous, and cosmopolitan, he facilitates sets of programs in the library’s reading room called “Live from the New York Public Library” where he interviews people in an effort to make the library — a cultural heritage institution — less like a mausoleum for the Old Masters and more like a place where great ideas flow freely. A couple of notable quotes included “My mother always told me to be porous because you have two ears and only one mouth” and “I want to take the books from the closed stacks and make people desire them.” Holdengraber’s enthusiasm for his job is contagious. Very engaging as well as interesting.

During the first of the concurrent sessions I gave a presentation called “Open source software: Controlling your computing environment” where I first outlined a number of definitions and core principles of open source software. I then tried to draw a number of parallels between open source software and librarianship. Finally, I described how open source software can be applied in libraries. During the presentation I listed four skills a library needs to become proficient in in order to take advantage of open source software (namely, relational databases, XML, indexing, and some sort of programming language), but in retrospect I believe basic systems administration skills are the things really required since the majority of open source software is simply installed, configured, and used. Few people feel the need to modify its functionality and therefore the aforementioned skills are not critical, only desirable.

Lincoln Memorial

In “Designing the Digital Experience” by David King (Topeka & Shawnee County Public Library) attendees were presented with ways websites can be created in a way that digitally supplements the physical presents of a library. He outlined the structural approaches to Web design such as the ones promoted by Jesse James Garrett, David Armano and 37Signals. He then compared & contrasted these approaches to the “community path” approaches which endeavor to create a memorable experience. Such things can be done, King says, through conversations, invitations, participation, creating a sense of familiarity, and the telling of stories. It is interesting to note that these techniques are not dependent on Web 2.0 widgets, but can certainly be implemented through their use. Throughout the presentation he brought all of his ideas home through the use of examples from the websites of Harley-Davidson, Starbucks, American Girl, and Webkinz. Not ironically, Holdengraber was doing the same thing for the Public Library except in the real world, not through a website.

In a session after lunch called “Go Where The Client Is” Natalie Collins (NRC-CISTI) described how she and a few co-workers converted library catalog data containing institutional repository information as well as SWETS bibliographic data into NLM XML and made it available for indexing by Google Scholar. In the end, she discovered that this approach was much more useful to her constituents when compared to the cool (“kewl”) Web Services-based implementation they had created previously. Holly Hibner (Salem-South Lyon District Library) compared & contrasted the use of tablet PC’s with iPods for use during roaming reference services. My two take-aways from this presentation were cool (“kewl”) services called drop.io and LinkBunch, websites making it easier to convert data from one format into another and bundle lists of link together into a single URL, respectively.

Jefferson Memorial

The last session for me that day was one on open source software implementations of “next generation” library catalogs, specifically Evergreen. Karen Collier and Andrea Neiman (both of Kent County Public Library) outlined their implementation process of Evergreen in rural Michigan. Apparently it began with the re-upping the of their contract for their computer hardware. Such a thing would cost more than they expected. This led to more investigations which ultimately resulted in the selection of Evergreen. “Open source seemd like a logical conclusion.” They appear to be very happy with their decision. Karen Schneider (Equinox Software) gave a five-minute “lightning talk” on the who and what of Equinox and Evergreen. Straight to the point. Very nice. Ruth Dukelow (Michigan Library Consortium) described how participating libraries have been brought on board with Evergreen, and she outlined the reasons why Evergreen fit the bill: it supported MLCat compliance, it offered an affordable hosted integrated library system, it provided access to high quality MARC records, and it offered a functional system to non-technical staff.

I enjoyed my time there in Washington, DC at the conference. Thanks go to Ellyssa Kroski, Steven Cohen, and Jane Dysart for inviting me, and allowing me to share some of my ideas. The attendees at the conference were not as technical as you might find at Access, Code4Lib, and certainly not JCDL nor ECDL. This is not a bad thing. The people were genuinely interested in the things presented, but I did overhear one person say, “This is completely over my head.” The highlight for me took a place during the last session where people were singing the praise of open source software for all the same reasons I had been expressing them over the past twelve years. “It is so much like the principles of librarianship,” she said. That made my day.

Quick Trip to Purdue

Eric Lease Morgan — Thu, 02 Apr 2009 02:00:33 +0000

Last Friday, March 27, I was invited by Michael Witt (Interdisciplinary Research Librarian) at Purdue University to give a presentation to the library faculty on the topic of “next generation” library catalogs. During the presentation I made an effort to have the participants ask and answer questions such as “What is the catalog?”, “What is it expected to contain?”, “What functions is it expected to perform and for whom?”, and most importantly, “What problems is it expected to solve?”

I then described how most of the current “next generation” library catalog thingees are very similar. Acquire metadata records. Optionally store them in a database. Index them (with Lucene). Provide services against the index (search and browse). I then brought the idea home by describing in more detail how things like VuFind, Primo, Koha, Evergreen, etc. all use this model. I then made an attempt to describe how our “next generation” library catalogs could go so much further by providing services against the texts as well as services against the index. “Discovery is not the problem that needs to be solved.”

Afterwards a number of us went to lunch where we compared & contrasted libraries. It is a shame the Purdue University, University of Indiana, and University of Notre Dame libraries do not work more closely together. Our strengths compliment each other in so many ways.

“Michael, thanks for the opportunity!”

Something I saw on the way back home.

Library Technology Conference, 2009: A Travelogue

Eric Lease Morgan — Thu, 02 Apr 2009 01:31:40 +0000

This posting documents my experiences at the Library Technology Conference at Macalester College (St. Paul, Minnesota) on March 18-19, 2009. In a sentence, this well-organized regional conference provided professionals from near-by states an opportunity to listen, share, and discuss ideas concerning the use of computers in libraries.

Wallace Library

Dayton Center

Day #1, Wednesday

The Conference, sponsored by Macalester College — a small, well-respected liberal arts college in St. Paul — began with a keynote presentation by Stacey Greenwell (University of Kentucky) called “Applying the information commons concept in your library”. In her remarks the contagiously energetic Ms. Greenwell described how she and her colleagues implemented the “Hub“, an “active learning place” set in the library. After significant amounts of planning, focus group interviews, committee work, and on-going cooperation with the campus computing center, the Hub opened in March of 2007. The whole thing is designed to be a fun, collaborative learning commons equipped with computer technology and supported by librarian and computer consultant expertise. Some of the real winners in her implementation include the use of white boards, putting every piece of furniture on wheels, including “video walls” (displaying items from special collections, student art, basketball games, etc.), and hosting parties where as many as 800 students attend. Greenswell’s enthusiasm was inspiring.

Most of the Conference was made up of sets of concurrent sessions, and the first one I attended was given by Jason Roy and Shane Nackerund (both of the University of Minnesota) called “What’s cooking in the lab?” Roy began by describing both a top-down and bottom-up approach to the curation and maintenance of special collections content. Technically, their current implementation includes a usual cast of characters (DSpace, finding aids managed with DLXS, sets of images, and staff), but sometime in the near future he plans on implementing a more streamlined approach consisting of Fedora for the storage of content with sets of Web Services on top to provide access. It was also interesting to note their support for user-contributed content. Users supply images. Users tag content. Images and tags are used to supplement more curated content.

Nackerund demonstrated a number of tools he has been working on to provide enhanced library services. One was the Assignment Calculator — a tool to outline what steps need to be done to complete library-related, classroom-related tasks. He has helped implement a mobile library home page by exploiting Web Service interfaces to this underlying systems. While the Web Service APIs are proprietary, they are a step in the right direction for further exploitation. He has implementing sets of course pages — as opposed to subject guides — too. “I am in this class, what library resources should I be using?” (The creation of course guide seems to be a trend.) Finally, he is creating a recommender service of which the core is the creation of “affinity strings” — a set of codes used to denote the characteristics of an individual as opposed to specific identifiers. Of all the things from the Conference, the idea of affinity strings struck me the hardest. Very nice work, and documented in a Code4Lib Journal article too boot.

In the afternoon I gave a presentation called “Technology Trends and Libraries: So many opportunities“. In it I described why mobile computing, content “born digital”, the Semantic Web, search as more important than browse, and the wisdom of crowds represent significant future directions for librarianship. I also described the importance of not loosing the sight of the forest from the trees. Collection, organization, preservation, and dissemination of library content and services are still the core of the profession, and we simply need to figure out new ways to do the work we have traditionally done. “Libraries are not warehouses of data and information as much as they are gateways to learning and knowledge. We must learn to build on the past and evolve, instead of clinging to it like a comfortable sweater.”

Later in the afternoon Marian Rengal and Eric Celeste (both of the Minnesota Digital Library) described the status of the Minnesota Digital Library in a presentation called “Where we are”. Using ContentDM as the software foundation of their implementation, the library includes many images supported by “mostly volunteers just trying to do the right thing for Minnesota.” What was really interesting about their implementation is the way they have employed a building block approach. PMWiki to collaborate. The Flickr API to share. Pachyderm to create learning objects. One of the most notable quotes from the presentation was “Institutions need to let go of their content to a greater degree; let them have a life of their own.” I think this is something that needs to be heard by many of us in cultural heritage institutions. If we make our content freely available, then we will be facilitating the use of the content in unimagined ways. Such is a good thing.

St. Paul Cathedral

Balboa facade

Day #2, Thursday

The next day was filled with concurrent sessions. I first attended one by Alec Sonsteby (Concordia College) entitled “VuFind: the MnPALS Experience” where I learned how MnPALS — a library consortium — brought up VuFind as their “discovery” interface. They launched VuFind in August of 2008, and they seem pretty much satisfied with the results.

During the second round of sessions I lead a discussion/workshop regarding “next generation” library catalogs. In it we asked and tried to answer questions such as “What is the catalog?”, “What does it contain?”, “What functions is it expected to fulfill and for whom?”, and most importantly, “What is the problem it is expected to solve?” I then described how many of current crop of implementations function very similarly. Dump metadata records. Often store them in a database. Index them (with Lucene). Provide services against the index (search and browse). I then tried to outline how “next generation” library catalogs could do more, namely provide services against the texts as well as the index.

The last session I attended was about ERMs — Electronic Resource Management systems. Don Zhou (William Mitchel College of Law) described how he implemented Innovative Interface’s ERM. “The hard part was getting the data in.” Dani Roach and Carolyn DeLuca (both of University of St. Thomas) described how they implemented a Serials Solutions… solution. “You need to be adaptive; we decided to do things one way and then went another… It is complex, not difficult, just complex. There have to be many tools to do ERM.” Finally, Galadriel Chilton (University of Wisconsin – La Crosse) described an open source implementation written in Microsoft Access, but “it does not do electronic journals.”

In the afternoon Eric C. was gracious enough to tour me around the Twin Cities. We saw the Cathedral of Saint Paul, the Mississippi River, and a facade by Balboa. But the most impressive thing I saw was the University of Minnesota’s “cave” — an onsite storage facility for the University’s libraries. All the books they want to withdraw go here where they are sorted by size, placed into cardboard boxes assigned to a bar code, and put into rooms 100 yards long and three stories high. The facility is manned by two people, and in ten years they have only lost two books out of the 1.3 million. The place is so huge you can literally drive a tractor trail truck into the place. Very impressive, and I got a personal tour. “Thanks Eric!”

Eric and Eric

St. Anthony Falls

Summary

I sincerely enjoyed the opportunity to attend this conference. Whenever I give talks I feel the need to write up a one-page handout. That process forces me to articulate my ideas in writing. When I give the presentation it is not all about me, but rather learning about the environments of my peers. It is an education all around. This particular regional conference was the right size, about 250. Many of the attendees knew each other. They caught up and learned things along the way. “Good job Ron Joslin!” The only thing I missed was a photograph of Mary Tyler Moore. Maybe next time.

Code4Lib Open Source Software Award

Eric Lease Morgan — Fri, 06 Mar 2009 00:13:28 +0000

As a community, let’s establish the Code4Lib Open Source Software Award.

Lot’s of good work gets produced by the Code4Lib community, and I believe it is time to acknowledge these efforts in some tangible manner. Our profession is full of awards for leadership, particular aspects of librarianship, scholarship, etc. Why not an award for the creation of software? After all, the use of computers and computer software is an essential part of our day-to-day work. Let’s grant an award for something we value — good, quality, open source software.

While I think the idea of an award is a laudable one, I have more questions than answers about the process of implementing it. Is such a thing sustainable, and if so, then how? Who is eligible for the award? Only individuals? Teams? Corporate entities? How are awardees selected? Nomination? Vote? A combination of the two? What qualities should the software exemplify? Something that solves a problem for many people? Something with a high “cool factor”? Great documentation? Easy to install? Well-supported with a large user base? Developed within the past year?

As a straw man for discussion, I suggest something like the following:

Regarding selection, I suggest there be a committee who solicits nominations and selects the awardee(s). As the years go by an individual from the committee drops off and the/an awardee becomes a member.
Regarding who is eligible, I suggest it be individuals, teams, or corporate entities. Awardees must be willing to serve on the next year’s nominating committee.
Regarding what is eligible, I suggest the software be open source, directly library-related, and developed within the past two years.
Regarding the timing, I suggest this be an annual award given at each Code4Lib conference.

These are just suggestions to get us started. What do you think? Consider sharing your thoughts as comments below, in channel, or on the Code4Lib mailing list.

Code4Lib Conference, Providence (Rhode Island) 2009

Eric Lease Morgan — Wed, 04 Mar 2009 01:16:15 +0000

This posting documents my experience at the Code4Lib Conference in Providence, Rhode Island between February 23-26, 2009. To summarize my experiences, I went away with a better understanding of linked data, it is an honor to be a part of this growing and maturing community, and finally, this conference is yet another example of the how the number of opportunities for libraries exist if only you are to think more about the whats of librarianship as opposed to the hows.

Day #0 (Monday, February 23) – Pre-conferences

On the first day I facilitated a half-day pre-conference workshop, one of many, called XML In Libraries. Designed as a full-day event, this workshop was not one of my better efforts. (“I sincerely apologize.”) Everybody brought their own computer, but some of them could not get on the ‘Net. The first half of the workshop should be trimmed down significantly since many of the attendees knew what was being explained. Finally, the hands-on part of the workshop with JEdit was less than successful because it refused to work for me and many of the participants. Lessons learned, and things to keep in mind for next time.

For the better part of the afternoon, I sat in on the WorldCat Grid Services pre-conference where we were given an overview of SRU from Ralph Levan. There was then a discussion on how the Grid Services could be put into use.

During the last part of the pre-conference afternoon I attended the linked data session. Loosely structured and by far the best attended event, I garnered an overview of what linked data services are and what are some of the best practices for implementing them. I had a very nice chat with Ross Singer who helped me bring some of these concepts home to my Alex Catalogue. Ironically, the Catalogue is well on its way to being exposed via a linked data model since I have previously written sets of RDF/XML files against its underlying content. The key seems to be to link together as many HTTP-based URIs as possible while providing content-negotiation services in order to disseminate your information in the most readable/usable formats possible.

Day #1 (Tuesday, February 24)

Code4Lib is a single-track conference, and its 300 or so attendees gathered in a refurbished Masonic Lodge — in the shadows of the Rhode Island State House — for the first day of the conference.

Roy Tennant played Master of Ceremonies for the Day #1 and opened the event with an outline of what he sees as the values of the Code4Lib community: egalitarianism, participation, democracy, anarchy, informality, and playfulness. From my point of view, that sums things up pretty well. In an introduction for first-timers, Mark Matienzo (aka anarchist) described the community as “a bit clique-ish”, a place where there are a lot of inside jokes (think bacon, neck beards, and ++), and a venue where “social capital” is highly valued. Many of these things can be most definitely been seen “in channel” by participating in the IRC #code4lib chat room.

In his keynote address, A Bookless Future For Libraries, Stefano Mazzocchi encouraged the audience to think of the “iPod for books” as an ecosystem necessity, not a possibility. He did this by first chronicling the evolution of information technology (speech to cave drawing to clay tablets to fiber to printing to electronic publishing). He outlined the characteristics of electronic publishing: dense, widely available, network accessible, distributed business models, no batteries, lots of equipment, next to zero marginal costs, and poor resolution. He advocated the Semantic Web (a common theme throughout the conference), and used Freebase as a real-world example. One of the most intriguing pieces of information I took away from this presentation was the idea of making games out of data entry in order to get people to contribute content. For example, make it fun to guess whether or not a person was live, dead, male, or female. Based on the aggregate responses of the “crowd” it is possible to make pretty reasonable guesses as to the truth of facts.

Next, Andres Soderback described his implementation of the Semantic Web world in Why Libraries Should Embrace Linked Data. More specifically, he said library catalogs should be: open, linkable, provide links, be a part of the network, not an end of themselves, and hackable. He went on to say that “APIs suck” because they are: specific, take too much control, not hackable enough, and not really “Web-able”. Not incidentally, he had previously exposed his entire library catalog — the National Library of Sweden — as a set of linked data, but it broke after the short-lived lcsh.info site by Ed Summers had been taken down.

Ross Singer described an implementation and extension to the Atom Publishing Protocol in his Like A Can Opener For Your Data Silo: Simple Access Through AtomPub and Jangle. I believe the core of his presentation can be best described through an illustration where an Atom client speaks to Jangle through Atom/RSS, Jangle communicates with (ILS-) specific applications through “connectors”, and the results are returned back to the client:

                   +--------+       +-----------+ 
  +--------+       |        | <---> | connector |
  | client | <---> | Jangle |       +-----------+ 
  +--------+       |        | <---> | connector |  
                   +--------+       +-----------+

I was particularly impressed with Glen Newton‘s LuSql: (Quickly And Easily) Getting Your Data From Your DBMS Into Lucene because it described a Java-based command-line interface for querying SQL databases and feeding the results to the community’s currently favorite indexer — Lucene. Very nice.

Terence Ingram‘s presentation RESTafarian-ism At The NLA can be summarized in the phrase “use REST in moderation” because too many REST-ful services linked together are difficult to debug, trouble shoot, and fall prey to over-engineering.

Based on the the number of comments in previous blog postings, Birkin James Diana‘s presentation The Dashboard Initiative was a hit. It described sets of simple configurable “widgets” used to report trends against particular library systems and services.

In Open Up Your Repository With A SWORD Ed Summers and Mike Giarlo described a protocol developed through the funding of the good folks at JISC used to deposit materials into an (institutional) repository through the use of AtomPub protocol.

In an effort view editorial changes over time against sets of EAD files, Mark Matienzo tried to apply version control software techniques against his finding aids. He described these efforts in How Anarchivist Got His Groove Back 2: DVCS, Archival Description, And Workflow but it seems as if he wasn’t as successful as he had hoped because of the hierarchal nature his source (XML) data.

Godmar Back in LibX 2.0 described how he was enhancing the LibX API to allow for greater functionality by enhancing its ability to interact with an increased number of external services such as the ones from Amazon.com. Personally, I wonder how well content providers will accept the idea of having content inserted into “their” pages by the LibX extension.

The last formal presentation of the day, djatoka For djummies, was given by Kevin Clark and John Fereira. In it they described the features, functions, advantages, and disadvantages of a specific JPEG2000 image server. Interesting technology that could be exploited more if there were a 100% open source solution.

Day #1 then gave way to about a dozen five-minute “lightning talks”. In this session I shared the state of the Alex Catalogue in Alex4: Yet Another Implementation, and in retrospect I realize I didn’t say a single word about technology but only things about functionality. Hmmm…

Day #2 (Wednesday, February 25)

On the second day of the conference I had the honor of introducing the keynote speaker, Sebastian Hammer. Having known him for at least a few years, I described him as the co-author of the venerable open source Yaz and Zebra software — the same Z39.50 software that drives quite a number of such implementations across Library Land. I also alluded to the time I visited him and his co-workers at Index Data in Copenhagen where we talked shop and shared a very nice lunch in their dot-com-like flat. I thought there were a number of meaty quotes from his presentation. “If you have something to say, then say it in code… I like to write code but have fun along the way… We are focusing our efforts on creating tools instead of applications… We try to create tools to enable libraries to do the work that they do. We think this is fun… APIs are glorified loyalty schemes… We need to surrender our data freely… Standardization is hard and boring but essential… Hackers must become advocates within our organizations.” Throughout his talk he advocated local libraries that: preserve cultural heritage, converge authoritative information, support learning & research, and are pillars of democracy.

Timothy McGeary gave an update on the OLE Project in A New Frontier – The Open Library Environment (OLE). He stressed that the Project is not about the integrated library system but bigger: special collections, video collections, institutional repositories, etc. Moreover, he emphasized that all these things are expected to be built around a Service Oriented Architecture and there is a push to use existing tools for traditional library functions such as the purchasing department for acquisitions or identity management systems for patron files. Throughout his present he stressed that this project is all about putting into action a “community source process”.

In Blacklight As A Unified Discovery Platform Bess Sadler described Blacklight as “yet another ‘next-generation’ library catalog”. This seemingly off-hand comment should not be taken as such because the system implements many of the up-and-coming ideas our fledgling “discovery” tools espouse.

Joshua Ferraro walked us through the steps for creating open bibliographic (MARC) data using a free, browser-based cataloging service in a presentation called A New Platform for Open Data – Introducing ±biblios.net Web Services. Are these sort of services, freely provided by the likes of LibLime and the Open Library, the sorts of services that make OCLC reluctant to freely distribute “their” sets of MARC records?

Building on LibLime’s work, Chris Catalfo described and demonstrated a plug-in for creating Dublin Core metadata records using ±biblios.net Web Services in Extending ±biblios, The Open Source Web Based Metadata Editor.

Jodi Schneider and William Denton gave the best presentation I’ve ever heard on FRBR in their What We Talk About When We Talk About FRBR. More specifically, they described “strong” FRBR-ization complete with Works, Manifestations, Expressions, and Items owned by Persons, Families, and Corporate Bodies and having subjects grouped into Concepts, Objects, and Events. Very thorough and easy to understand. schneider++ & denton++ # for a job well-done

In Complete Faceting Toke Eskildsen described his institutions’s implementation called Summa from the State and University Library of Denmark.

Erik Hatcher outlined a number of ways Solr can be optimized for better performance in The Rising Sun: Making The Most Of Solr Power. Solr certainly seems to be on its way to becoming the norm for indexing in the Code4Lib community.

A citation parsing application was described by Chris Shoemaker in FreeCite – An Open Source Free-Text Citation Parser. His technique did not seem to be based so much on punctuation (syntax) as much as word groupings. I think we have something to learn from his technique.

Richard Wallis advocated the use of a Javascript library to update and insert added functionality to OPAC screens in his Great Facets, Like Your Relevance, But Can I Have Links To Amazon And Google Book Search? His tool — Juice — shares OPAC-specific information.

The Semantic Web came full-circle through Sean Hannan‘s Freebasing For Fun And Enhancement. One of the take-aways I got from this conference is to learn more ways Freebase and be used (exploited) in my everyday work.

During the Lightning Talks I very briefly outlined an idea that has been brewing in my head for a few years, specifically, the idea of an Annual Code4Lib Open Source Software Award. I don’t exactly know how such a thing would get established or be made sustainable, but I do think our community is ripe for such recognition. Good work is done by our people, and I believe it needs to be tangibly acknowledged. I am willing to commit to making this a reality by this time next year at Code4Lib Conference 2010.

Summary

I did not have the luxury for staying the last day of the Conference. I’m sure I missed some significant presentations. Yet, the things I did see where impressive. They demonstrated ingenuity, creativity, and as the same time, practicality — the desire to solve real-world, present-day problems. These things require the use of both sides of a person’s brain. Systematic thinking and intuition; an attention to detail but the ability to see the big picture at the same time. In other words, arscience.

code4lib++

Henry David Thoreau’s Walden

Eric Lease Morgan — Mon, 09 Feb 2009 04:09:17 +0000

As I sit here beside my fire at the cabin, I reflect on the experiences documented by Henry David Thoreau in his book entitled Walden.

Being human

On one level, the book is about a man who goes off to live in a small cabin by a pond named Walden. It describes how be built his home, tended his garden, and walked through the woods. On another level, it is collection of self-observations and reflections on what it means to be human. “I went to the woods because I wished to live deliberately, to front only the essential facts of life, and see if I could not learn what it has to teach, and not, when I came to die, discover that I had not lived… I wanted to live deep and suck out all the marrow of life, to live so sturdily and Spartan-like as to put to rout all that was not life, to cut a broad swath and shave close, to drive life into a corner, and reduce it to its lowest terms, and, if it proved to be mean, why then to get the whole and genuine meanness of it, and publish its meanness to the world.”

Selected chapters

The book doesn’t really have beginning, a middle, and an end. There is no hero, no protagonist, no conflict, and no climax. Instead, the book is made up of little stories amassed over the period of one and a half years while living alone. Economy — an outline of the necessities of life such as clothing, shelter, and food. It cost him $28 to built his cabin, and he grew much of his own food. “Yet men have come to such a pass that they frequently starve not for want of necessities, but for want of luxuries.”

I also enjoyed the chapter called “The Bean-Field”. “I have come to love my rows, my beans, though so many more than I wanted.” Apparently he had as many as seven miles of beans, if they were all strung in a row. Even over two acres of ground, I find that hard to believe. He mentions woodchucks often in the chapter as well as throuhout the book, and he dislikes them because they eat his crop. I always thought woodchucks — ground hogs — were particularly interesting since they were abundant around the property where I grew up. In relation to economy, Thoreau spent just less than $14 on gardening expenses, and after selling his crop made a profit of almost $9. “Daily the beans saw me come to their rescue armed with a hoe, and thin their ranks of the enemies, filling up the trenches with weedy dead.”

The chapter called “Sounds” is full of them or allusions to them: voice, rattle, whistle, scream, shout, ring, announce, hissing, bells, sung, lowing, serenaded, music chanted, cluck, buzzing, screech, wailing, trilled, sighs, hymns, threnodies, gurgling, hooting, baying, trump, bellowing, crow, bark, laughing, cackle, creaking, and snapped. Almost a cacophony, but at the same time a possible symphony. It depends on your perspective.

While he lived alone, he was never seemingly lonely. In fact, he seemed to attract visitors or sought them out himself. Consider the wood chopper who was extra skilled at this job. Reflect on the Irish family who lived “rudely”. Compare and contrast the well-to-do professional with manners to the man who lived in a hollow log. (I wonder whether or not that second man really existed.)

Thoreau’s description of the pond itself were arscient. [1] He describes its color, its depth, and over all size. He ponders where it got its name, its relation to surrounding ponds, and where its water comes from and goes. He fishes in it regularly, and walk upon its ice in the winter. He describes how men harvest its ice and how the pond keeps most of the effort. He appreciates the appearance of the pond as he observes it during different times of year as well as from different vantage points. In my mind, it is a good thing to observe anything and just about everything from many points of view, both literally and figuratively.

Conclusion

The concluding chapter has a number of meaty thoughts. “I left the woods for a good a reason as I went there. Perhaps it seemed to me that I had several more lives to live, and could not spare any more time for that one… I learned this, at least, by my experiment: that if one advances confidently in the direction of his dreams, and endeavors to live the life which he has imagined, he will meet with a success unexpected in common hours… If a man does not keep pace with his companions, perhaps it is because he hears a different drummer. Let him step to the music which he hears, however measured or far away… However mean your life is, meet it and live it; do not shun it and call it hard names… Love your life, poor as it is… Rather than love, than money, than fame, give me truth.”

Word cloud

As a service against the text, and as a means to learning about it more quickly, I give you the following word cloud (think “concordance”) complete with links to the places in the text where the words can be found:

life pond most house day though water many time never about woods without much yet long see before first new ice well down little off know own old nor good part winter far way being last after heard live great world again nature shore morning think work once same walden thought feet spring earth here perhaps night side sun things surface few thus find found summer must true got also years village enough myself half poor seen air better put read till small within wood cannot fire ground deep end bottom left nothing went away place almost least

Note

[1] Arscience — art-science — is a term I use to describe a way of thinking incorporating both artistic and scientific elements. Arscient thinking is poetic, intuitive, free-flowing, and at the same time it is systematic, structured, and repeatable. To my mind, a person requires both in order to create a cosmos from the apparent chaos of our surroundings.

Eric Lease Morgan’s Top Tech Trends for ALA Mid-Winter, 2009

Eric Lease Morgan — Mon, 09 Feb 2009 04:03:00 +0000

This is a list of “top technology trends” written for ALA Mid-Winter, 2009. They are presented in no particular order. [This text was originally published on the LITA Blog, but it is duplicated here because “lot’s of copies keep stuff safe.” –ELM]

Indexing with Solr/Lucene works well – Lucene seems to have become the gold standard when it comes to open source indexer/search engine platforms. Solr — a Web Services interface to Lucene — is increasingly the preferred way to read & write Lucene indexes. Librarians love to create lists. Books. Journals. Articles. Movies. Authoritative names and subjects. Websites. Etc. All of these lists beg for the organization. Thus, (relational) databases. But Lists need to be short, easily sortable, and/or searchable in order to be useful as finding aids. Indexers make things searchable, not databases. The library profession needs to get its head around the creation of indexes. The Solr/Lucene combination is a good place to start — er, catch up.

Linked data is a new name for the Semantic Web – The Semantic Web is about creating conceptual relationships between things found on the Internet. Believe it or not, the idea is akin to the ultimate purpose of a traditional library card catalog. Have an item in hand. Give it a unique identifier. Systematically describe it. Put all the descriptions in one place and allow people to navigate the space. By following the tracings it is possible to move from one manifestation of an idea to another ultimately providing the means to the discovery, combination, and creation of new ideas. The Semantic Web is almost the exactly the same thing except the “cards” are manifested using RDF/XML on computers through the Internet. From the beginning RDF has gotten a bad name. “Too difficult to implement, and besides the Semantic Web is a thing of science fiction.” Recently the term “linked data” has been used to denote the same process of creating conceptual relationships between things on the ‘Net. It is the Semantic Web by a different name. There is still hope.

Blogging is peaking – There is no doubt about it. The Blogosphere is here to stay, yet people have discovered that it is not very easy to maintain a blog for the long haul. The technology has made it easier to compose and distribute one’s ideas, much to the chagrin of newspaper publishers. On the other hand, the really hard work is coming up with meaningful things to say on a regular basis. People have figured this out, and consequently many blogs have gone by the wayside. In fact, I’d be willing to bet that the number of new blogs is decreasing, and the number of postings to existing blogs is decreasing as well. Blogging was “kewl” is cool but also hard work. Blogging is peaking. And by the way, I dislike those blogs which are only partial syndicated. They allow you to read the first 256 characters or so of and entry, and then encourage you to go to their home site to read the whole story whereby you are bombarded with loads of advertising.

Word/tag clouds abound – It seems very fashionable to create word/tag clouds now-a-days. When you get right down to it, word/tag clouds are a whole lot like concordances — one of the first types of indexes. Each word (or tag) in a document is itemized and counted. Stop words are removed, and the results are sorted either alphabetically or numerically by count. This process — especially if it were applied to significant phrases — could be a very effective and visual way to describe the “aboutness” of a file (electronic book, article, mailing list archive, etc.). An advanced feature is to hyperlink each word, tag, or phrase to specific locations in the file. Given a set of files on similar themes, it might be interesting to create word/tag clouds against them in order to compare and contrast. Hmmm…

“Next Generation” library catalogs seem to be defined – From my perspective, the profession has stopped asking questions about the definition of “next generation” library catalogs. I base this statement on two things. First, the number of postings and discussion on a mailing list called NGC4Lib has dwindled. There are fewer questions and even less discussion. Second, the applications touting themselves, more or less, as “next generation” library catalog systems all have similar architectures. Ingest content from various sources. Normalize it into an internal data structure. Store the normalized data. Index the normalized data. Provide access to the index as well as services against the index such as tag, review, and Did You Mean? All of this is nice, but it really isn’t very “next generation”. Instead it is slightly more of the same. An index allows people to find, but people are still drinking from the proverbial fire hose. Anybody can find. In my opinion, the current definition of “next generation” does not go far enough. Library catalogs need to provide an increased number services against the content, not just services against the index. Compare & contrast. Do morphology against. Create word cloud from. Translate. Transform. Buy. Review. Discuss. Share. Preserve. Duplicate. Trace idea, citation, and/or author forwards & backwards. It is time to go beyond novel ways to search lists.

SRU is becoming more viable – SRU (Search/Retrieve via URL) is a Web Services-based protocol for searching databases/indexes. Send a specifically shaped URL to a remote HTTP server. Get back a specifically shaped response. SRU has been joined with a no-longer competing standard called OpenSearch in the form of an Abstract Protocol Definition, and the whole is on its way to becoming an OASIS standard. Just as importantly, an increasing number of the APIs supporting the external-facing OCLC Grid Services (WorldCat, Identities, Registries, Terminologies, Metadata Crosswalk) use SRU as the query interface. SRU has many advantages, but some of those advantages are also disadvantages. For example, its query language (CQL) is expressive, especially compared to OpenSearch or Google, but at the same time, it is not easy to implement. Second, the nature of SRU responses can range from rudimentary and simple to obtuse and complicated. More over, the response is always in XML. These factors make transforming the response for human consumption sometimes difficult to implement. Despite all these things, I think SRU is a step in the right direction.

The pendulum of data ownership is swinging – I believe it was Francis Bacon who said, “Knowledge is power”. In my epistemological cosmology, knowledge is based on information, and information is based on data. (Going the other way, knowledge leads to wisdom, but that is another essay.) Therefore, he who owns or has access to the data will ultimately have more power. Google increasingly has more data than just about anybody. They have a lot of power. OCLC increasingly “owns” the bibliographic data created by its membership. Ironically, this data — in both the case of Google and OCLC — is not freely available, even when the data was created for the benefit of the wider whole. I see this movement akin to the movement of a pendulum swinging one way and then the other. On my more pessimistic days I view it as a battle. On my calmer days I see it as a natural tendency, a give and take. Many librarians I know are in the profession, not for the money, but to support some sort of cause. Intellectual freedom. The right to read. Diversity. Preservation of the historical record. If I have a cause it then is about the free and equal access to information. This is why I advocate open access publishing, open source software, and Net Neutrality. When data and information is “owned” and “sold” an environment of information have’s and have not’s manifests itself. Ultimately, this leads to individual gain but not necessarily the improvement of the human condition as a whole.

The Digital Dark Age continues – We, as a society, are continuing to create a Digital Dark Age. Considering all of the aspects of librarianship, the folks who deal with preservation, conservation, and archives have the toughest row to hoe. It is ironic. On one hand there is more data and information available than just about anybody knows what to do with. On the other hand, much of this data and information will not be readable, let alone available, in the foreseeable future. Somebody is going to want to do research on the use of blogs and email. What libraries are archiving this data? We are writing reports and summaries in binary and proprietary formats. Such things are akin to music distributed on 8-track tapes. Where are the gizmos enabling us to read these formats? We increasingly license our most desired content — scholarly journal articles — and in the end we don’t own anything. With the advent of Project Gutenberg, Google Books, and the Open Content Alliance the numbers of freely available electronic books rival the collections of many academic libraries. Who is collecting these things? Do we really want to put all of our eggs into one basket and trust these entities to keep them for the long haul? The HathiTrust understand this phenomonon, and “Lot’s of copies keep stuff safe.” Good. In the current environment of networked information, we need to re-articulate the definition of “collection”.

Finally, regarding change. It manifests itself along a continuum. At one end is evolution. Slow. Many false starts. Incremental. At the other end is revolution. Fast. Violent. Decisive. Institutions and their behaviors change slowly. Otherwise they wouldn’t be the same institutions. Librarianship is an institution. Its behavior changes slowly. This is to be expected.

YAAC: Yet Another Alex Catalogue

Eric Lease Morgan — Tue, 03 Feb 2009 01:30:53 +0000

I have implemented another version of my Alex Catalogue of Electronic Texts, more specifically, I have dropped the use of one indexer and replaced it with Solr/Lucene. See http://infomotions.com/alex/ This particular implementation does not have all the features of the previous one. No spell check. No thesaurus. No query suggestions. On the other hand, it does support paging, and since it runs under mod_perl, it is quite responsive.

As always I am working on the next version, and you can see where I’m going at http://infomotions.com/sandbox/alex4/ Like the implementation above, this one runs under mod_perl and supports paging. Unlike the implementation above, it also supports query suggestions, a thesaurus, and faceted browsing. It also sports the means to view metadata details. Content-wise, it included images, journal titles, journal articles, and some content from the HathiTrust.

It would be great if I were to get some feedback regarding these implementations. Are they easy to use?

ISBN numbers

Eric Lease Morgan — Mon, 02 Feb 2009 05:04:31 +0000

I’m beginning to think about ISBN numbers and the Alex Catalogue of Electronic Texts. For example, I can add ISBN numbers to Alex, link them to my (fledgling) LibraryThing collection, and display lists of recently added items here:

Interesting, but I think the list will change over time, as new things get added to my collection. It would be nice to link to a specific item. Hmm…

[openbook booknumber=”9781593082277″] On the other hand, I could exploit ISBN numbers and OpenLibrary using a WordPress plug-in called OpenBook Book Data by John Miedema. It displays cover art, a link to OpenLibrary as well as WorldCat

Again, very interesting. For more details, see the “OpenBook WordPress Plugin: Open Source Access to Bibliographic Data” in Code4Lib Journal.

A while ago I wrote a CGI script that took ISBN numbers as input, fed them to xISBN and/or ThingISBN to suggest alternative titles. I called it Send It To Me.

Then of course there is the direct link to Amazon.com.

Amazon.com Widgets

I suppose it is nice to have choice.

Fun with WebService::Solr, Part III of III

Eric Lease Morgan — Fri, 23 Jan 2009 01:46:26 +0000

This is the last of a three-part series providing an overview of a set of Perl modules called WebService::Solr. In Part I, WebService::Solr was introduced with two trivial scripts. Part II put forth two command line driven scripts to index and search content harvested via OAI. Part III illustrates how to implement an Search/Retrieve via URL (SRU) search interface against an index created by WebService::Solr.

Search/Retrieve via URL

Search/Retrieve via URL (SRU) is a REST-like Web Service-based protocol designed to query remote indexes. The protocol essentially consists of three functions or “operations”. The first, explain, provides a mechanism to auto-discover the type of content and capabilities of an SRU server. The second, scan, provide a mechanism to browse an index’s content much like perusing the back-of-a-book index. The third, searchRetrieve, provides the means for sending a query to the index and getting back a response. Many of the librarians in the crowd will recognize SRU as the venerable Z39.50 protocol redesigned for the Web.

During the past year, time has been spent joining the SRU community with the OpenSearch community to form a single, more unified set of Search Web Service protocols. OpenSearch has very similar goals to SRU — to provide standardized interfaces for searching indexes — but the techniques between it an SRU are different. Where OpenSearch’s query language is simple, SRU’s is expressive. Where OpenSearch returns an RSS-like data stream, SRU includes the ability to return just about any XML format. OpenSearch may be easier to implement, but SRU is suited for a wider number of applications. To bring SRU and OpenSearch together, and to celebrate similarities as opposed to differences, an OASIS Abstract Protocol Definition has been drafted defining how the searching of Web-based databases and indexes can be done in a standardized way.

SRU is an increasingly important protocol for the library community because of a growing number of the WorldCat Grid Services are implemented using SRU. The Grid supports indexes such lists of library holdings (WorldCat), name and subject authority files (Identities), as well as names of libraries (the Registry). By sending SRU queries to these services and mashing up the results with the output of other APIs, all sorts of library and bibliographic applications can be created.

Integrating WebService::Solr into SRU

Personally, I have been creating SRU interfaces to many of my indexes for about four years. I have created these interfaces against mailing list archives, OAI-harvested content, and MARC records. The underlying content has been indexed with swish-e, Plucene, KinoSearch, and now Lucene through WebService::Solr.

Ironic or not, I use yet another set of Perl modules — available on CPAN and called SRU — written by Brian Cassidy to implement my SRU servers. The form of my implementations is rather simple. Get the input. Determine what operation is requested. Branch accordingly. Do the necessary processing. Return a response.

The heart of my SRU implementation is a subroutine called search. It is within this subroutine where indexer-specific hacking takes place. For example and considering WebService::Solr:

sub search {

  # initialize
  my $query   = shift;
  my $request = shift;
  my @results;
  
  # set up Solr
  my $solr = WebService::Solr->new( SOLR );
    
  # calculate start record and number of records
  my $start_record = 0;
  if ( $request->startRecord ) { $start_record = $request->startRecord - 1 }
  my $maximum_records = MAX; $maximum_records = $request->maximumRecords 
     unless ( ! $request->maximumRecords );

  # search
  my $response   = $solr->search( $query, {
                                  'start' => $start_record,
                                  'rows'  => $maximum_records });
  my @hits       = $response->docs;
  my $total_hits = $response->pager->total_entries;
  
  # display the number of hits
  if ( $total_hits ) {
  
    foreach my $doc ( @hits ) {
                
      # slurp
      my $id          = $doc->value_for(  'id' );
      my $name        = &escape_entities( $doc->value_for(  'title' ));
      my $publisher   = &escape_entities( $doc->value_for(  'publisher' ));
      my $description = &escape_entities( $doc->value_for(  'description' ));
      my @creator     = $doc->values_for( 'creator' );
      my $contributor = &escape_entities( $doc->value_for(  'contributor' ));
      my $url         = &escape_entities( $doc->value_for(  'url' ));
      my @subjects    = $doc->values_for( 'subject' );
      my $source      = &escape_entities( $doc->value_for(  'source' ));
      my $format      = &escape_entities( $doc->value_for(  'format' ));
      my $type        = &escape_entities( $doc->value_for(  'type' ));
      my $relation    = &escape_entities( $doc->value_for(  'relation' ));
      my $repository  = &escape_entities( $doc->value_for(  'repository' ));

      # full results, but included entities; hmmm...
      my $record  = '';
      $record .= '' .  $name . '';
      $record .= '' .  $publisher . '';
      $record .= '' .  $url . '';
      $record .= '' .  $description . '';
      $record .= '' . $source . '';
      $record .= '' .  $format . '';
      $record .= '' .  $type . '';
      $record .= '' .   $contributor . '';
      $record .= '' .   $relation . '';
      foreach ( @creator ) { $record .= '' .  $_ . '' }
      foreach ( @subjects ) { $record .= '' . $_ . '' }
      $record .= '';
      push @results, $record;
            
    }
    
  }
  
  # done; return it
  return ( $total_hits, @results );
  
}

The subroutine is not unlike the search script outlined in Part II of this series. First the query, SRU::Request object, results, and local Solr objects are locally initialized. A pointer to the first desired hit as well as the maximum number of records to return are calculated. The search is done, and the total number of search results is saved for future reference. If the search was a success, then each of the hits are looped through while stuffing them into an XML element named record and scoped with a Dublin Core name space. Finally, the total number of records as well as the records themselves are returned to the main module where they are added to an SRU::Response object and returned to the SRU client.

This particular implementation is pretty rudimentary, and it does not really exploit the underlying functionality of Solr/Lucene. For example, it does not support facets, spell check, suggestions, etc. On the other hand, it does support paging, and since it is implemented under mod_perl it is just about as fast as it can get on my hardware.

Give the implementation a whirl. The underlying index includes about 20,000 records of various electronic books (from the Alex Catalogue of Electronic Texts, Project Gutenberg, and the HathiTrust), photographs (from my own adventures), journal titles, and journal articles (both from the Directory of Open Access Journals).

Summary

It is difficult for me to overstate the number of possibilities for librarianship considering the current information environment. Data and information abound! Learning has not stopped. It is sexy to be in the information business. All of the core principles of librarianship are at play in this environment. Collection. Preservation. Organization. Dissemination. The application of relational databases combined with indexers provide the means to put into practice these core principles in today’s world.

The Solr/Lucene combination is an excellent example, and WebService::Solr is just one way to get there. Again, I don’t expect every librarian to know and understand all of things outlined in this series of essays. On the other hand, I do think it is necessary for the library community as a whole to understand this technology in the same way they understand bibliography, conservation, cataloging, and reference. Library schools need to teach it, and librarians need to explore it.

Source code

Finally, plain text versions of this series’ postings, the necessary Solr schema.xml files, as well as all the source code is available for downloading. Spend about an hour putzing around. I’m sure you will come out the other end learning something.

Fun with WebService::Solr, Part II of III

Eric Lease Morgan — Mon, 12 Jan 2009 23:50:41 +0000

In this posting (Part II), I will demonstrate how to use WebService::Solr to create and search a more substantial index, specifically an index of metadata describing the content of the Directory of Open Access Journals. Part I of these series introduced Lucene, Solr, and WebService::Solr with two trivial examples. Part III will describe how to create an SRU interface using WebService::Solr.

Directory of Open Access Journals

The Directory of Open Access Journals (DOAJ) is a list of freely available scholarly journals. As of this writing the Directory contains approximately 3,900 titles organized into eighteen broad categories such as Arts and Architecture, Law and Political Science, and General Science. Based on my tertiary examination, a large percentage of the titles are in the area of medicine.

Not only is it great that such a directory exists, but it is even greater that the Directory’s metadata — the data describing the titles in the Directory — is available for harvesting via OAI-PMH. While the metadata is rather sparse, it is more than adequate for creating rudimentary MARC records for importing into library catalogs, or better yet, incorporating into some other Web service. (No puns intended.)

In my opinion, the Directory is a especially underutilized. For example, not only are the Directory’s journal titles available for download, but so is the metadata of about 25,000 journal articles. Given these two things (metadata describing titles as well as articles) it would be entirely possible to seed a locally maintained index of scholarly journal content and incorporate that into library “holdings”. But alas, that is another posting and another story.

Indexing the DOAJ

It is almost trivial to create a search engine against DOAJ content when you know how to implement an OAI-PMH harvester and indexer. First, you need to know the OAI-PMH root URL for the Directory, and it happens to be http://www.doaj.org/oai Second, you need to peruse the OAI-PMH output sent by the Directory and map it to fields you will be indexing. In the case of this demonstration, the fields are id, title, publisher, subject, and URL. Consequently, I updated the schema from the first demonstration to look like this:

The astute reader will notice the addition of a field named facet_subject. This field, denoted as a string and therefore not parsed by the indexer, is destined to be a browsable facet in the search engine. By including this sort of field in the index it is be possible to return results like, “Your search identified 100 items, and 25 of them are associated with the subject Philosophy.” A very nice feature. Think of it as the explicit exploitation of controlled vocabulary terms for search results. Facets turn the use of controlled vocabularies inside out. The library community has something to learn here.

Once the schema was updated, I wrote the following script to index the journal title content from the Directory:

#!/usr/bin/perl

# index-doaj.pl - get doaj content and index it

# Eric Lease Morgan 
# January  12, 2009 - version 1.0


# define
use constant OAIURL => 'http://www.doaj.org/oai';
use constant PREFIX => 'oai_dc';
use constant SOLR   => 'http://localhost:210/solr';

# require
use Net::OAI::Harvester;
use strict;
use WebService::Solr;

# initialize oai and solr
my $harvester = Net::OAI::Harvester->new( baseURL => OAIURL );
my $solr      = WebService::Solr->new( SOLR );

# get all records and loop through them
my $records = $harvester->listAllRecords( metadataPrefix => PREFIX );
my $id      = 0;
while ( my $record = $records->next ) {

  # increment
  $id++;
  last if ( $id > 100 );  # comment this out to get everything

  # extract the desired metadata
  my $metadata     = $record->metadata;
  my $identifier   = $record->header->identifier;
  my $title        = $metadata->title      ? &strip( $metadata->title )     : '';
  my $url          = $metadata->identifier ? $metadata->identifier          : '';
  my $publisher    = $metadata->publisher  ? &strip( $metadata->publisher ) : '';
  my @all_subjects = $metadata->subject    ? $metadata->subject             : ();

  # normalize subjects
  my @subjects = ();
  foreach ( @all_subjects ) {

    s/DoajSubjectTerm: //;  # remove DOAJ label
    next if ( /LCC: / );    # don't want call numbers
    push @subjects, $_;

  }

  # echo
  print "      record: $id\n";
  print "  identifier: $identifier\n";
  print "       title: $title\n";
  print "   publisher: $publisher\n";
  foreach ( @subjects ) { print "     subject: $_\n" }
  print "         url: $url\n";
  print "\n";

  # create solr/lucene document
  my $solr_id        = WebService::Solr::Field->new( id        => $identifier );
  my $solr_title     = WebService::Solr::Field->new( title     => $title );
  my $solr_publisher = WebService::Solr::Field->new( publisher => $publisher );
  my $solr_url       = WebService::Solr::Field->new( url       => $url );

  # fill up a document
  my $doc = WebService::Solr::Document->new;
  $doc->add_fields(( $solr_id, $solr_title, $solr_publisher, $solr_url ));
  foreach ( @subjects ) {

    $doc->add_fields(( WebService::Solr::Field->new( subject => &strip( $_ ))));
    $doc->add_fields(( WebService::Solr::Field->new( facet_subject => &strip( $_ ))));

  }

  # save; no need for commit because it comes for free
  $solr->add( $doc );

}

# done
exit;


sub strip {

  # strip non-ascii characters; bogus since the OAI output is suppose to be UTF-8
  # see: http://www.perlmonks.org/?node_id=613773
  my $s =  shift;
  $s    =~ s/[^[:ascii:]]+//g;
  return $s;

}

The script is very much like the trivial example from Part I. It first defines a few constants. It then initializes both an OAI-PMH harvester as well as a Solr object. It then loops through each record of the harvested content extracting the desired data. The subject data, in particular, is normalized. The data is then inserted into WebService::Solr::Field objects which in turn are inserted into WebService::Solr::Document objects and added to the underlying Lucene index.

Searching the index

Searching the index is less trivial than the example in Part I because of the facets, below:

#!/usr/bin/perl

# search-doaj.pl - query a solr/lucene index of DOAJ content

# Eric Lease Morgan 
# January 12, 2009 - version 1.0


# define
use constant SOLR => 'http://localhost:210/solr';
use constant ROWS => 100;
use constant MIN  => 5;

# require
use strict;
use WebService::Solr;

# initalize
my $solr = WebService::Solr->new( SOLR );

# sanity check
my $query = $ARGV[ 0 ];
if ( ! $query ) {

  print "Usage: $0 \n";
  exit;

}

# search; get no more than ROWS records and subject facets occuring MIN times
my $response  = $solr->search( $query, { 'rows'           => ROWS,
                                         'facet'          => 'true', 
                                         'facet.field'    => 'facet_subject', 
                                         'facet.mincount' => MIN });

# get the number of hits, and start display
my $hit_count = $response->pager->total_entries;
print "Your search ($query) found $hit_count document(s).\n\n";

# extract subject facets, and display
my %subjects = &get_facets( $response->facet_counts->{ facet_fields }->{ facet_subject } );
if ( $hit_count ) {

  print "  Subject facets: ";
  foreach ( sort( keys( %subjects ))) { print "$_ (" . $subjects{ $_ } . "); " }
  print "\n\n";
  
}

# display each hit
my $index = 0;
foreach my $doc ( $response->docs ) {

  # slurp
  my $id        = $doc->value_for( 'id' );
  my $title     = $doc->value_for( 'title' );
  my $publisher = $doc->value_for( 'publisher' );
  my $url       = $doc->value_for( 'url' );
  my @subjects  = $doc->values_for( 'subject' );

  # increment
  $index++;

  #echo
  print "     record: $index\n";
  print "         id: $id\n";
  print "      title: $title\n";
  print "  publisher: $publisher\n";
  foreach ( @subjects ) { print "    subject: $_\n" }
  print "        url: $url\n";
  print "\n";

}

# done 
exit;


sub get_facets {

  # convert array of facet/hit-count pairs into a hash; obtuse
  my $array_ref = shift;
  my %facet;
  my $i = 0;
  foreach ( @$array_ref ) {

    my $k = $array_ref->[ $i ]; $i++;
    my $v = $array_ref->[ $i ]; $i++;
    next if ( ! $v );
    $facet{ $k } = $v;

  }

  return %facet;

}

The script needs a bit of explaining. Like before, a few constants are defined. A Solr object is initialized, and the existence of a query string is verified. The search method makes use of a few options, specifically, options to return ROW number of search results as well as specific facets occurring MIN number of times. The whole thing is stuffed into a WebService::Solr::Response object, which is, for better or for worse, a JSON data structure. Using the pager method against the response object, the number hits are returned which is assigned to a scalar and displayed.

The trickiest part of the script is the extraction of the facets done by the get_facets subroutine. In WebService::Solr, facets names and their values are returned in an array reference. get_facets converts this array reference into a hash, and is then displayed. Finally, each WebService::Solr::Response object is looped through and echoed. Notice how the the subject field is handled. It contains multiple values which are retrieved through the values_for method which returns an array, not a scalar. Below is sample output for the search “library”:

Your search (library) found 84 document(s).

  Subject facets: Computer Science (7); Library and Information
Science (68); Medicine (General) (7); information science (19);
information technology (8); librarianship (16); libraries (6);
library and information science (14); library science (5);

     record: 1
         id: oai:doaj.org:0029-2540
      title: North Carolina Libraries
  publisher: North Carolina Library Association
    subject: libraries
    subject: librarianship
    subject: media centers
    subject: academic libraries
    subject: Library and Information Science
        url: http://www.nclaonline.org/NCL/

     record: 2
         id: oai:doaj.org:1311-8803
      title: Bibliosphere
  publisher: NBU Library
    subject: Bulgarian libraries
    subject: librarianship
    subject: Library and Information Science
        url: http://www.bibliosphere.eu/ 

     record: 3
         id: ...

In a hypertext environment, each of the titles in the returned records would be linked with their associated URLs. Each of the subject facets listed at the beginning of the output would be hyperlinked to subsequent searches combining the original query plus the faceted term, such as “library AND subject:’Computer Science'”. An even more elaborate search interface would allow the user to page through search results and/or modify the value of MIN to increase or decrease the number of relevant facets displayed.

Making lists searchable

Librarians love lists. We create lists of books. Lists of authors of books. Lists of journals. Lists of journal articles. Recently we have become enamored with lists of Internet resources. We pay other people for lists, and we call these people bibliographic index vendors. OCLC’s bread and butter is a list of library holdings. Librarians love lists.

Lists aren’t very useful unless they are: 1) short, 2) easily sortable, or 3) searchable. For the most part, the profession has mastered the short, sortable list, but we are challenged when it comes to searching our lists. We insist on using database applications for this, even when we don’t know how to design a (relational) database. Our searching mentality is stuck in the age of mediated online search services such as DIALOG and BRS. The profession has not come to grips with the advances in information retrieval. Keyword searching, as opposed to field searching, has its merits. Tools like Lucene, KinoSearch, Zebra, swish-e, and a host of predecessors like Harvest, WAIS, and Veronica all facilitate(d) indexing/searching.

As well as organizing information — the creation of lists — the profession needs to learn how to create its own indexes and make them searchable. While I do not advocate every librarian know how to exploit things like WebService::Solr, I do advocate the use of these technologies to a much greater degree. Without them the library profession will always be a follower in the field of information technology as opposed to a leader.

Summary

This posting, Part II of III, illustrated how to index and search content from an OAI-PMH data repository. It also advocated the increased use of indexer/search engines by the library profession. In the next and last part of this series WebService::Solr will be used as a part of an Search/Retrieve via URL (SRU) interface.

Acknowledgements

Special thanks go to Brian Cassidy and Kirk Beers who wrote WebService::Solr. Additional thanks go to Ed Summers and Thomas Berger who wrote Net::OAI::Harvester. I am simply standing on the shoulders of giants.

Mr. Serials is dead. Long live Mr. Serials

Eric Lease Morgan — Mon, 12 Jan 2009 03:50:17 +0000

This posting describes the current state of the Mr. Serials Process.

Background

Round about 1994 when I was employed by the North Carolina State University Libraries, Susan Nutter, the Director, asked me to participate in an ARL Collection Analysis Project (CAP). The goal of the Project was to articulate a mission/vision statement for the Libraries fledgling Collection Development Department. “It will be a professional development opportunity”, she told me. I don’t think she knows how much of an opportunity it really was.

Through the CAP I, along with a number of others (Margaret Hunt, John Abbott, Caroline Argentati, and Orion Pozo) became acutely aware of the “serials pricing crisis”. Academic writes article. Article gets peer-reviewed. Publisher agrees to distribute article in exchange for copyright. Article gets published in journal. Library subscribes to journal at an ever-increasing price. Academic reads journal. Repeat.

The whole “crisis” made me frustrated (angry), and others were frustrated too. Why did prices need to be increasing so dramatically? Why couldn’t the Academe coordinate peer-review? Why couldn’t the Internet be used a distribution medium? Some people tried to answer some of these questions differently than the norm, and the result was the creation of electronic journals distributed via email such as the venerable Bryn Mawr Classical Review, Psycoloquy, Postmodern Culture, and PACS Review.

Given this environment, I sought to be a part of the solution instead of perpetuating the problem. I created the Mr. Serials Process — a set of applications/scripts that collected, archived, indexed, and re-distributed sets of electronic journals. I figured I could demonstrate to the library and academic communities that if everybody does their part, then there would less of need for commercial publishers — entities who were exploiting the system and more interested in profit than the advancement of knowledge. Mr. Serials was “born” around 1994 and documented in an article from Serials Review. Mr. Serial, now 14-years old, would be considered a child by most people’s standards. Yet, fourteen years is a long time in Internet years.

Mr. Serials is dead

For all intents and purposes, Mr. Serials is dead because his process was based on the distribution of electronic serials via email. His death was long and drawn out. The final nail driven into his coffin came when ACQNET, one of the original “journals” he collected, moved from Appalachian State University to iBiblio a few months ago. After the move Mr. Serials was no longer considered the official archivist of the content, and his era had passed.

This is not a big deal. Change happens. Processes evolve. Besides, Mr. Serials created a legacy for himself, a set of early electronic serial literature exemplifying the beginnings of networked scholarly communication which includes more than thirty titles archived at serials.infomotions.com.

Long live Mr. Serials

At the same time, Mr. Serials is alive and well. Maybe, like many people his age, he is going through an adolescence.

In the middle 1990s electronic journals were distributed via email. As such the Mr. Serials Process used procmail to filter incoming mail. He then used a Hypercard program to create configuration files denoting the locations of bibliographic data in journal titles. He then used a Perl program reading the configuration files, automatically extracting the bibliographic information from each issue, removing the email header, and saving the resulting journal article in a specified location. Initially, the whole collection was made available via a Gopher server and indexed with WAIS. Later, the collection was made available via an HTTP server and other indexing technologies were used but many of them are broken.

Somewhere along the line, some of the “journals” became mailing lists, and the Process was modified to take advantage of an archiving program called Hypermail. Like the original Process, the archived materials are accessible via a Web server and indexed with some sort of search engine technology. (There have been so many.) With the movement of ACQNET, the original “journals” have all gone away, but Mr. Serials has picked up a few mailing lists along the way, notably colldv-l, Code4LIb, and NGC4Lib. Consequently, Mr. Serials is not really dead, just transformed.

A lot of the credit goes to procmail, Hypermail, Web servers, and indexers. Procmail reads incoming mail and processes it accordingly. File it here. File it there. Delete it. Send it off to another process. Hypermail makes pretty email archives which are more or less configurable. It allows one to keep email messages in their original RFC 822 (mbox) format and reuse them for many purposes. We all know what HTTP servers do. Indexers complement the Hypermail process by providing searchable interfaces to the collection. The indexer used against colldv-l, Code4Lib, and NGC4Lib is called KinoSearch and is implemented through an SRU interface.

Mr. Serials is a modern day library process. It has a set of collection development goals. It acquires content. It organizes content. It archives and preserves content. It redisseminates content. The content it currently collects may not be extraordinarily scholarly, but someday somebody is going to want it. It is a special collection. Much if its success is a testiment to open source software. All the tools it uses are open source. In fact most of them were distributed as open source even before the phrase was coined.

Long live Mr. Serials.

Fun with WebService::Solr, Part I of III

Eric Lease Morgan — Mon, 05 Jan 2009 23:23:17 +0000

This posting (Part I) is an introduction to a Perl module called WebService::Solr. In it you will learn a bit of what Solr is, how it interacts with Lucene (an indexer), and how to write two trivial Perl scripts: 1) an indexer, and 2) a search engine. Part II of this series will introduce less trivial scripts — programs to index and search content from the Directory of Open Access Journals (DOAJ). Part III will demonstrate how to use WebService::Solr to implement an SRU interface against the index of DOAJ content. After reading each Part you should have a good overview of what WebService::Solr can do, but more importantly, you should have a better understanding of the role indexers/search engines play in the world of information retrieval.

Solr, Lucene, and WebService::Solr

I must admit, I’m coming to the Solr party at least one year late, and as you may or may not know, Solr is a Java-based, Web Services interface to the venerable Lucene — the current gold standard when it comes to indexers/search engines. In such an environment, Lucene (also a Java-based system) is used to first create inverted indexes from texts or numbers, and second, provide a means for searching the index. Solr is a Web Services interface to Lucene. Instead of writing applications reading and writing Lucene indexes directly, you can send Solr HTTP requests which are parsed and passed on to Lucene. For example, one could feed Solr sets of metadata describing, say, books, and provide a way to search the metadata to identify items of interest. (“What a novel idea!”) Using such a Web Servcies technique the programmer is free to use the programming/scripting language of their choice. No need to know Java, although Java-based programs would definitely be faster and more efficient.

For better or for worse, my programming language of choice is Perl, and upon perusing CPAN I discovered WebService::Solr — a module making it easy to interface with Solr (and therefore Lucene). After playing with WebService::Solr for a few days I became pretty impressed, thus, this posting.

Installing and configuring Solr

Installing Solr is relatively easy. Download the distribution. Save it in a convenient location on your file system. Unpack/uncompress it. Change directories to the example directory, and fire up Solr by typing java -jar start.jar at the command line. Since the distribution includes Jetty (a pint-sized HTTP server), and as long as you have not made any configuration changes, you should now be able to connect to your locally hosted Solr administrative interface through your favorite Web browser. Try, http://localhost:8983/solr/

When it comes to configuring Solr, the most important files are found in the conf directory, specifically, solrconfig.xml and schema.xml. I haven’t tweaked the former. The later denotes the types and names of fields that will ultimately be in your index. Describing in detail the in’s and out’s of solrconfig.xml and schema.xml are beyond the scope of this posting, but for our purposes here, it is important to note two things. First I modified schema.xml to include the following Dublin Core-like fields:

Second, I edited a Jetty configuration file (jetty.xml) so it listens on port 210 instead of the default port, 8983. “Remember Z39.50?”

There is a whole lot more to configuring Solr than what is outlined above. To really get a handle on the indexing process the Solr documentation is required reading.

Installing WebService::Solr

Written by Brian Cassidy and Kirk Beers, WebService::Solr is a set Perl modules used to interface with Solr. Create various WebService::Solr objects (such as fields, documents, requests, and responses), and apply methods against them to create, modify, find, add, delete, query, and optimize aspects of your underlying Lucene index.

Since WebService::Solr requires a large number of supporting modules, installing WebService::Solr is best done with using CPAN. From the CPAN command line, enter install WebService::Solr. It worked perfectly for me.

Indexing content

My first WebService::Solr script, an indexer, is a trivial example, below:

 #!/usr/bin/perl
 
 # trivial-index.pl - index a couple of documents
 
 # define
 use constant SOLR => 'http://localhost:210/solr';
 use constant DATA => ( 'Hello, World!', 'It is nice to meet you.' );
 
 # require
 use strict;
 use WebService::Solr;
 
 # initialize
 my $solr = WebService::Solr->new( SOLR );
 
 # process each data item
 my $index = 0;
 foreach ( DATA ) {
 
   # increment
   $index++;
     
   # populate solr fields
   my $id  = WebService::Solr::Field->new( id  => $index );
   my $title = WebService::Solr::Field->new( title => $_ );
 
   # fill a document with the fields
   my $doc = WebService::Solr::Document->new;
   $doc->add_fields(( $id, $title ));
 
   # save
   $solr->add( $doc );
   $solr->commit;
 
 }
 
 # done
 exit;

To elaborate, the script first defines the (HTTP) location of our Solr instance as well as array of data containing two elements. It then includes/requires the necessary Perl modules. One to keep our programming technique honest, and the other is our reason de existence. Third, a WebService::Solr object is created. Fourth, a pointer is initialized, and a loop instantiated reading each data element. Inside the loop the pointer is incremented and local WebService::Solr::Field objects are created using the values of the pointer and the current data element. The next step is to instantiate a WebService::Solr:Document object and fill it up with the Field objects. Finally, the Document is added to the index, and the update is committed.

If everything went according to plan, the Lucene index should now contain two documents. The first with an id equal to 1 and a title equal to “Hello, World!”. The second with an id equal to 2 and a title equal to “It is nice to meet you.” To verify this you should be able to use the following script to search your index:

  #!/usr/bin/perl
  
  # trivial-search.pl - query a lucene index through solr
  
  # define
  use constant SOLR => 'http://localhost:210/solr';
  
  # require
  use strict;
  use WebService::Solr;
  
  # initialize
  my $solr = WebService::Solr->new( SOLR );
  
  # sanity check
  my $query = $ARGV[ 0 ];
  if ( ! $query ) {
  
    print "Usage: $0 \n";
    exit;
    
  }
  
  # search & get hits
  my $response = $solr->search( $query );
  my @hits = $response->docs;
  
  # display
  print "Your search ($query) found " . ( $#hits + 1 ) . " document(s).\n\n";
  foreach my $doc ( @hits ) {
  
    # slurp
    my $id    = $doc->value_for( 'id' );
    my $title = $doc->value_for( 'title' );
    
    # echo
    print "     id: $id\n";
    print "  title: $title\n";
    print "\n";
      
  }

Try queries such as hello, “hello OR meet”, or “title: world” will return results. Because the field named text includes the content of the title field, as per our definition, queries without field specifications default to the text field. Nice. The power of an index.

Here is how the script works. It first denotes the location of Solr. It then includes/requires the necessary modules. Next, it creates a WebService::Solr object. Fourth, it makes sure there is a query on the command line. Fifth, it queries Solr creating a WebService::Solr::Response object, and this object is queried for an array of hits. Finally, the hits are looped through, creating and displaying the contents of each WebService::Solr::Document object (hit) found.

Summary

This posting provided an overview of Lucene, Solr, and a set of Perl modules called WebService::Solr. It also introduced the use of the modules to index content and search it. Part II will provide a more in-depth introduction to the use of WebService::Solr and Solr in general.

Visit to Ball State University

Eric Lease Morgan — Wed, 17 Dec 2008 17:17:43 +0000

I took time yesterday to visit a few colleagues at Ball State University.

Ball State, the movie!

Over the past few months the names of some fellow librarians at Ball State University repeatedly crossed my path. The first was Jonathan Brinley who is/was a co-editor on Code4Lib Journal. The second was Kelley McGrath who was mentioned to me as top-notch cataloger. The third was Todd Vandenbark who was investigating the use of MyLibrary. Finally, a former Notre Damer-er, Marcy Simons, recently started working at Ball State. Because Ball State is relatively close, I decided to take the opportunity to visit these good folks during this rather slow part of the academic year.

Compare & contrast

After I arrived we made our way to lunch. We compared and contrasted our libraries. For example, they had many — about say 200 — public workstations. The library was hustling and bustling. About 18,000 students go to Ball State and seemingly many of them go home on the weekends. Ball State was built with money from the canning jar industry, but upon a visit to the archives no canning jars could be seen. I didn’t really expect any.

Shop talk

Over lunch we talked a lot about FRBR and the possibilities of creating work-level records from the myriad of existing item-level (MARC) records. Since the work-related content is often times encoded as free text in some sort of 500 field, I wonder how feasible the process would be. Ironically, an article, “Identifying FRBR Work-Level Data in MARC Bibliographic Records for Manifestations of Moving Images” by Kelley had been published the day before in Code4Lib. Boy, it certainly is a small world.

I always enjoy “busman’s holidays” and visiting other libraries. I find we oftentimes have more things in common than differences.

A Day with OLE

Eric Lease Morgan — Sat, 13 Dec 2008 13:20:27 +0000

This posting documents my experience at Open Library Environment (OLE) project workshop that took place at the University of Chicago, December 11, 2008. In a sentence, the workshop provided an opportunity to describe and flowchart a number of back-end library processes in an effort to help design an integrated library system.

What is OLE

full-scale gargoyle

As you may or may not know, the Open Library Environment is a Mellon-funded initiative in cooperation with a growing number of academic libraries to explore the possibilities of building an integrated library system. Since this initiative is more about library back-end and business processes (acquisitions, cataloging, circulation, reserves, ILL, etc.), it is complimentary to the the eXtensible Catalog (XC) project which is more about creating a “discovery” layer against and on top of existing integrated library system’s public access interfaces.

Why OLE?

Why do this sort of work? There are a few reasons. First, vendor consolidation makes the choices of commercial solutions few. Not a good idea; we don’t like monopolies. Second, existing applications do not play well with other (campus) applications. Better integration is needed. Third, existing library systems are designed for print materials, but with the advent of greater and greater amounts of electronic materials the pace of change has been inadequate and too slow.

OLE is an effort to help drive and increase change in Library Land, and this becomes even more apparent when you consider all of the Mellon-related library initiatives it is supporting: Portico (preservation), JSTOR and ArtSTOR (collections), XC (discovery), OLE (business processes/technical services).

The day’s events

The workshop took place at the Regenstein Library (University of Chicago). There were approximately thirty or forty attendees from universities such as Grinnell, Indiana, Notre Dame, Minnesota, Illinois, Iowa, and of course, Chicago.

After being given a short introduction/review of what OLE is and why, we were broken into four groups (cataloging/authorities, circulation/reserves/ILL, acquisitions, and serials/ERM), and we were first asked to enumerate the processes of our respective library activities. We were then asked to classify these activities into four categories: core process, shifting/changing process, processes that could be stopped, and processes that we wanted but don’t have. All of us, being librarians, were not terribly surprised by the enumerations and classifications. The important thing was to articulate them, record them, and compare them with similar outputs from other workshops.

After lunch (where I saw the gargoyle and made a few purchases at the Seminary Co-op Bookstore) we returned to our groups to draw flowcharts of any of our respective processes. The selected processes included checking in a journal issue, checking in an electronic resource, keeping up and maintaining a file of borrowers, acquiring a firm order book, cataloging a rare book, and cataloging a digital version of a rare book. This whole flowcharting process was amusing since the workflows of each participants’ library needed to be amalgamated into a single processes. “We do it this way, and you do it that way.” Obviously there is more than one way to skin a cat. In the end the flowcharts were discussed, photographed, and packaged up to ship back to the OLE home planet.

What do you really want?

The final, wrap-up event of the day was a sharing and articulation of what we really wanted in an integrated library system. “If there one thing you could change, then what would it be?” Based on my notes, the most popular requests were:

make the system interoperable with sets of APIs (4 votes)
allow the system to accommodate multiple metadata formats (3 votes)
include a robust reporting mechanism; give me the ARL Generate Statistics Button (2 votes)
implement a staff interface allowing work to be done without editing records (2 votes)
implement consortial borrowing across targets (2 votes)
separate the discovery processes from the business processes (2 votes)

Other wish list items I thought were particularly interesting included: integrating the collections process into the system, making sure the application was operating system independent, and implementing Semantic Web features.

Summary

I’m glad I had the opportunity to attend. It gave me a chance to get a better understanding of what OLE is all about, and I saw it as a professional development session where I learned more about where things are going. The day’s events were well-structured, well-organized, and manageable given the time restraints. I only regret there was too little “blue skying” by attendees. Much of the time was spent outlining how our work is done now. I hope any future implementation explores new ways of doing things in order to take better advantage of the changing environment as opposed to simply automating existing processes.

ASIS&T Bulletin on open source software

Eric Lease Morgan — Fri, 12 Dec 2008 13:37:20 +0000

The following is a verbatim duplication of an introduction I wrote for a special issue of the ASIS&T Bulletin on open source software in libraries. I appreciate the opportunity to bring the issue together because I sincerely believe open source software provides a way for libraries to have more control over their computing environment. This is especially important for a profession that is about learning, teaching, scholarship, data, information, and knowledge. Special thanks goes to Irene L. Travis who brought the opportunity to my attention. Thank you.

Open Source Software in Libraries

It is a privilege and an honor to be the guest editor for this special issue of the Bulletin of the American Society for Information Science and Technology on open source software. In it you will find a number of articles describing open source software and how it has been used in libraries. Open source software or free and open source software is defined and viewed in a variety of ways, and the definition will be refined and enriched by our authors. However, very briefly, for those readers unfamiliar with it, open source software is software that is distributed under one of a number of licensing arrangements that (1) require that the software’s source code be made available and accessible as part of the package and (2) permit the acquirer of the software to modify the code freely to fit their own needs provided that, (3) if they distribute the software modifications they create, they do so under an open source license. If these basic elements are met, there is no requirement that the resulting software be distributed at no cost or non-commercially, although much widely used open source software such as the web browser Firefox is also distributed without charge.

In This Issue

The articles begin with Scot Colford’s “Explaining Free and Open Source Software,” in which he describes how the process of using open source software is a lot like baking a cake. He goes on to outline how open source software is all around us in our daily computing lives.

Karen Schneider’s “Thick of the Fray” lists some of the more popular open source software projects in libraries and describes how these sorts of projects would not have been nearly as feasible in an era without the Internet.

Marshall Breeding’s “The Viability of Open Source ILS” provides a balanced comparison between open source software integrated library systems and closed source software integrated library systems. It is a survey of the current landscape.

Bob Molyneux’s “Evergreen in Context” is a case study of one particular integrated library system, and it is a good example of the open source adage “scratching an itch.”

In “The Development and Usage of the Greenstone Digital Library Software,” Ian Witten provides an additional case study but this time of a digital library application. It is a good example of how many different types of applications are necessary to provide library service in a networked environment.

Finally, Thomas Krichel expands the idea of open source software to include open data and open libraries. In “From Open Source to Open Libraries,” you will learn that many of the principles of librarianship are embodied in the principles of open source software. In a number of ways, librarianship and open source software go hand-in-hand.

What Is Open Source Software About?

Open source software is about quite a number of things. It is about taking more complete control over one’s computer infrastructure. In a profession that is a lot about information, this sort of control is increasingly necessary. Put another way, open source software is about “free.” Not free as in gratis, but free as in liberty. Open source software is about community – the type of community that is only possible in a globally networked computer environment. There is no way any single vendor of software will be able to gather together and support all the programmers that a well-managed open source software project can support. Open source software is about opportunity and flexibility. In our ever-dynamic environment, these characteristics are increasingly important.

Open source software is not a panacea for libraries, and while it does not require an army of programmers to support it, it does require additional skills. Just as all libraries – to some degree or another – require collection managers, catalogers and reference librarians, future-thinking libraries require people who are knowledgeable about computers. This background includes knowledge of relational databases, indexers, data formats such as XML and scripting languages to glue them together and put them on the web. These tools are not library-specific, and all are available as open source.

Through reading the articles in this issue and discussing them with your colleagues, you should become more informed regarding the topic of open source software. Thank you for your attention and enjoy.

Fun with the Internet Archive

Eric Lease Morgan — Wed, 10 Dec 2008 13:02:51 +0000

I’ve been having some fun with Internet Archive content.

The process

More specifically, I have created a tiny system for copying scanned materials locally, enhancing it with a word cloud, indexing it, and providing access to whole thing. There is how it works:

Identify materials of interest from the Archive and copy their URLs to a text file.
Feed the text file to a wget (wget.sh) which copies the plain text, PDF, XML metadata, and GIF cover art locally.
Create a rudumentary word cloud (cloud.pl) against each full text version of a document in an effort to suppliment the MARC metadata.
Index each item using the MARC metadata and full text (index.pl). Each index entry also includes the links to the word cloud, GIF image, PDF file, and MARC data.
Provide a simple one-box, one-button interface to the index (search.pl & search.cgi). Search results appear much like the Internet Archive’s but also include the word cloud.
Go to Step #1; rinse, shampoo, and repeat.

The demonstration

Attached are all the scripts I’ve written for the as-of-yet-unamed process, and you can try the demonstration at http://dewey.library.nd.edu/hacks/ia/search.cgi, but remember, there are only about two dozen items presently in the index.

The possibilities

There are many ways the system can be improved, and they can be divided into two types: 1) servcies against the index, and 2) services against the items. Services against the index include things like paging search results, making the interface “smarter”, adding things like faceted browse, implementing an advaced search, etc.

Services against the items interest me more. Given the full text it might be possible to do things like: compare & contrast documents, cite documents, convert documents into many formats, trace idea forward & backward, do morphology against words, add or subtract from “my” collection, search “my” collection, share, annotate, rank & review, summarize, create relationships between documents, etc. These sort of features I believe to be a future direction for the library profession. It is more than just get the document; it is also about doing things with them once they are acquired. The creation of the word clouds is a step in that direction. It assists in the compare & contrast of documents.

The Internet Archive makes many of these things possible because they freely distribute their content — including the full text.

InternetArchive++

Snow blowing and librarianship

Eric Lease Morgan — Sun, 07 Dec 2008 18:37:46 +0000

I don’t exactly know why, but I enjoy snow blowing.

snow blower

I think it began when I was college. My freshman year I stayed on during the January earning money from Building & Grounds. For much of the time they simply said, “Go shovel some snow.” It was quiet, peaceful, and solitary. It was physical labor. It was a good time to think, and the setting was inspirational.

A couple of years later, in order to fulfill a graduation requirement, I needed to design and complete a “social practicum”. I decided to shovel snow for my neighbors. Upon asking them for permission, I got a lot of strange looks. “Why would you want to shovel my snow?”, they’d ask. I’d say, “Because I am more able to do it than you. I’m just being helpful and providing a social service.” Surprisingly, many people did not take me up on my offer, but a few did.

I now live and work in northern Indian only forty-five minutes from Lake Michigan where “lake effect” snow is common. I own a big, bad snowblower. It gives me a sense of power, and even though it disturbs the quiet, I enjoy the process of cleaning my driveway and sidewalk. I enjoy trying to figure out the most effectient way to get the job done. I enjoy it so much I even snow blow around the block.

Snow blowing and librarianship

What does this have to do with librarianship? In reality, not a whole lot. On the other hand, one of the aspects of librarianship, especially librarianship in public libraries, is community service — providing means for improving society. My clearing of snow for my neighbors is done in a similar vein, and it works for me. I can do something for my fellow man and have fun at the same time. Weird?

P.S. Mowing the grass gives me the same sort of feelings.

Tarzan of the Apes

Eric Lease Morgan — Mon, 01 Dec 2008 13:34:02 +0000

This is a simple word cloud of Edgar Rice Burroughs’ Tarzan of the Apes:

[openbook]978-1593082277[/openbook]

tarzan little clayton great jungle before d’arnot jane back about cabin mr toward porter professor saw again time philander eyes strange know first here though never old turned many after black forest left hand own thought day knew beneath body head see young life long found most girl lay village face tribe wild away tree until ape down must seen far within door white few much esmeralda savage above once dead mighty ground stood side last trees apes cried thing among moment took hands new off without almost beast huge alone close just tut canler nor way knife small

I found this story to have quite a number of similarities with James Fenimore Cooper’s The Last of the Mohicans. The central character in both was super human. Both includes some sort of wilderness. In the Last of the Mohicans it was the forest. In Tarzan it was the jungle. In both cases the wilderness was inhabited by savages. Indians, apes, or pirates. Both included damsels in distress who were treated in a rather Victorian manner and were sought after by an unwanted lover. Both included characters with little common sense. David and Professor Porter.

I found Tarzan much more readable and story-like compared to the Last of the Mohicans. It can really be divided into two parts. The first half is a character development. Who is Tarzan, and how did he becomes who he is. The second half is a love story, more or less, where Tarzan pursues his love. I found it rather distasteful that Tarzan was a man of “breeding“. I don’t think people are to bred like animals.

WorldCat Hackathon

Eric Lease Morgan — Sun, 09 Nov 2008 14:29:36 +0000

I attended the first-ever WorldCat Hackathon on Friday and Saturday (November 7 & 8), and us attendees explored ways to take advantage of various public application programmer interfaces (APIs) supported by OCLC.

Web Services

The WorldCat Hackathon was an opportunity for people to get together, learn about a number of OCLC-supported APIs, and take time to explore how they can be used. These APIs are a direct outgrowth of something that started at least 6 years ago with an investigation of how OCLC’s data can be exposed through Web Service computing techniques. To date OCLC’s services fall into the following categories, and they are described in greater detail as a part of the OCLC Grid Services Web page:

WorldCat Search API – Search and display content from WorldCat — a collection of mostly books owned by libraries
Registry Services – Search and display names, addresses, and information about libraries
Identifier Services – Given unique keys, find similar items found in WorldCat
WorldCat Identities – Search and display information about authors from a name authority list
Terminology Services – Search and display subject authority information
Metadata Crosswalk Service – Convert one metadata format (MARC, MARCXML, XML/DC, MODS, etc.) into another. (For details of how this works, see “Toward element-level interoperability in bibliographic metadata” in Issue #2 of the Code4Lib Journal).

The Hacks

The event was attended by approximately fifty (50) people. The prize going to the person coming the furthest went to someone from France. A number of OCLC employees attended. Most people were from academic libraries, and most people were from the surrounding states. About three-quarters of the attendees were “hackers”, and the balance were there to learn.

Taking place in the Science, Industry and Business Library (New York Public Library), the event began with an overview of each of the Web Services and the briefest outline of how they might be used. We then quickly broke into smaller groups to “hack” away. The groups fell into a number of categories: Drupal, VUFind, Find More Like This One/Miscellaneous, and language-specific hacks. We reconvened after lunch on the second day sharing what we had done as well as what we had learned. Some of the hacks included:

Term Finder – Enter a term. Query the Terminology Services. Get back a list of broader and narrower terms. Select items from results. Repeat. Using such a service a person can navigate a controlled vocabulary space to select the most appropriate subject heading.
Name Finder – Enter a first name and a last name. Get back a list of WorldCat Identities matching the queries. Display the subject terms associated with the works of this author. Select subject terms results are displayed in Term Finder.
Send It To Me – Enter an ISBN number. Determine whether or not the item is held locally. If so, then allow the user to borrow the item. If not, then allow the user to find other items like that item, purchase it, and/or facilitate an interlibrary load request. All three of these services were written by myself. The first two were written at during the Hackathon. The last was written more than a year ago. All three could be used on their own or incorporated into a search results page.
Find More Like This One in VUFind – Written by Scott Mattheson (Yale University Library) this prototype was in the form of a number of screen shots. It allows the user to first do a search in VUFind. If desired items are checked out, then it will search for other local copies.
Google Map Libraries – Greg McClellan (Brandeis University) combined the WorldCat Search API, Registries Services, the Google Maps to display the locations of nearby libraries who reportably own a particular item.
Recommend Tags – Chad Fennell (University of Minnesota Libraries) overrode a Drupal tagging function to work with MeSH controlled vocabulary terms. In other words, as items in Drupal are being tagged, this hack leads the person doing data entry to use MeSH headings.
Enhancing Metadata – Piotr Adamzyk (Metropolitan Museum of Art) has access to both bibliographic and image materials. Through the use of Yahoo Pipes technology he was able to read metadata from an OAI repository, map it to metadata found in WorldCat, and ultimately supplement the metadata describing the content of his collections.
Pseudo-Metasearch in VUFind – Andrew Nagy (Villanova University) demonstrated how a search could be first done in VUFind, and have subsequent searches done against WorldCat by simply clicking on a tabbed interface.
Find More Like This One – Mark Matienzo (NYPL Labs) created an interface garnering an OCLC number as input. Given this it returned subject headings an effort to return other items. It was at this point Ralph LeVan (OCLC) said, “Why does everybody use subject headings to find similar items? Why not map your query to Dewey numbers and find items expected to be placed right next to the given item on the shelf?” Good food for thought.
xISBN Bookmarklette – Liu Xiaoming (OCLC) demonstrated a Web browser tool. Enter your institution’s name. Get back a browser bookmarklette. Drag bookmarklette to your toolbar. Search things like Amazon. Select ISBN number from the Web page. Click bookmarklette. Determine whether or not your local library owns the item.

Summary

Obviously the hacks created in this short period of time by a small number of people illustrate just a tiny bit of what could be done with the APIs. More importantly and IMHO, what these APIs really demonstrate is the ways librarians can have more control over their computing environment if they were to learn to exploit these tools to their greatest extent. Web Service computing techniques are particularly powerful because they are not wedded to any specific user interface. They simply provide the means to query remote services and get back sets of data. It is then up to librarians and developers — working together — to figure out what to do the the data. As I’ve said somewhere previously, “Just give me the data.”

I believe the Hackathon was a success, and I encourage OCLC to sponsor more of them.

VUFind at PALINET

Eric Lease Morgan — Fri, 07 Nov 2008 03:37:47 +0000

I attended a VUFind meeting at PALINET in Philadelphia today, November 6, and this posting summarizes my experiences there.

As you may or may not know, VUFind is a “discovery layer” intended to be applied against a traditional library catalog. Originally written by Andrew Nagy of Villanova University, it has been adopted by a handful of libraries across the globe and is being investigated by quite a few more. Technically speaking, VUFind is an open source project based on Solr/Lucene. Extract MARC records from a library catalog. Feed them to Solr/Lucene. Provide access to the index as well as services against the search results.

The meeting was attended by about thirty people. The three people from Tasmania won the prize for coming the furthest, but there were also people from Stanford, Texas A&M, and a number of more regional libraries. The meeting had a barcamp-like agenda. Introduce ourselves. Brainstorm topics for discussion. Discuss. Summarize. Go to bar afterwards. Alas, I didn’t get to go to the bar, but I was there for the balance. The following bullet points summarize each discussion topic:

Jangle – A desire was expressed to implement some sort of API (application programmer interface) to VUFind in order to ensure a greater degree of interoperability. The DLF-DI was mentioned quite a number of times, but Jangle was the focus of the discussion. Unfortunately, not a whole lot of people around the room knew about Jangle, the ATOM Publishing Protocol, nor REST-ful computing techniques in general. Because creating an API was desired there was some knowledge of the XC (eXtensible Catalog) project around the room, and there was curiosity/frustration as to why more collaboration could not be done with XC. Apparently the XC process and their software is not as open and transparent has I had thought. (Note to self: ping the folks at XC and bring this issue to their attention.) In the end, implementing something like Jangle was endorsed.
Non-MARC content – It was acknowledged that non-MARC content ought to be included in any sort of “discovery layer”. A number of people had experimented with including content from their local institutional repositories, digital libraries, and/or collection of theses & dissertations. The process is straight-forward. Get set of metadata. Map it to VUFind/Solr fields. Feed it to the indexer. Done. Other types of data people expressed an interest in incorporating included: EAD, TEI, images, various types of data sets, and mathematical models. From here the discussion quickly evolved into the next topic…
Solrmarc – Through the use of a Java class called MARC4J, a Solr plug-in has been created by the folks at the University of Virginia. This plug-in — Solrmarc — makes it easier to read MARC data and feed it to Solr. There was a lot of discussion whether or not this plug-in should be extended to include other data types, such as the ones outlined above, or to distribute Solrmarc as-is, more akin to a GNU “do one thing and one thing well” type of tool. From my perspective, no specific direction was articulated.
Authority control – We all knew the advantage of incorporating authority lists (names, authors, titles) into VUFind. The general ideas was to acquire authority lists. Incorporate this data into the underlying index. Implement “find more like this one” types of services against search results based on the related records linked through authorities. There was then much discussion on how to initially acquire the necessary authority data. We were a bit stymied. After lunch a slightly different tack was taken. Acquire some authority data, say about 1,000 records. Incorporate it into an implementation of VUFind. Demonstrate the functionality to wider audiences. Tackle the problem of getting more complete and updated authority data later.
De-duplication/FRBR – This was probably the shortest discussion point, and it really surrounded FRBR. We ended up asking ourselves, “To what degree do we want to incorporate Web Services such as xISBN into VUFind to implement FRBR-like functionality, or to what degree should ‘real’ FRBRization take place?” Compared to other things, de-duplication/FRBR seemed to be taking a lower priority.
Serials holdings – This discussion was around indexing and/or displaying serials holdings information. There was much talk about the ways various integrated library systems allow libraries to export holdings information, whether or not it was merged with bibliographic information, and how consistent it was from system to system. In general it was agreed that this holdings information ought to be indexed to enable searches such as “Time Magazine 2004”, but displaying the results was seen as problematic. “Why not use your link resolver to address this problem?” was asked. This whole issue too was given a lower priority since more and more serial holdings are increasingly electronic in nature.
Federated search – It was agreed that federated search s?cks, but it is a necessary evil. Techniques for incorporating it into VUFind ranged from: 1) side-stepping the problem by licensing bibliographic data from vendors, 2) side-stepping the problem by acquiring binary Lucene indexes of bibliographic data from vendors, 3) creating some sort of “smart” interface that looks at VUFind search results to automatically select and search federated search targets whose results are hidden behind a tab until selected by the user, or 4) allow the user to assume some sort of predefined persona (Thomas Jefferson, Isaac Newton, Kurt Godel, etc.) to point toward the selection of search targets. LibraryFind was mentioned as a store for federated search targets. Pazpar2 was mentioned as tool to do the actual searching.
Development process – The final discussion topic regarded the on-going development process. To what degree should the whole thing be more formalized? Should VUFind be hosted by a third party? Code4Lib? PALINET? A newly created corporation? Is it a good idea to partner with similar initiative such as OLE (Open Library Environment), XC, ILF-DI, or BlackLight? On one hand, such formalization would give the process more credibility and open more possibilities for financial support, but on the other hand the process would also become more administratively heavy. Personally, I liked the idea of allowing PALINET to host the system. It seems to be an excellent opportunity for such an library-support organization.

The day was wrapped up by garnering volunteers to see after each of the discussion points in the hopes of developing them further.

I appreciated the opportunity to attend the meeting, especially since it is quite likely I will be incorporating VUFind into a portal project called the Catholic Research Resources Alliance. I find it amusing the way many “next generation” library catalog systems — “discovery layers” — are gravitating toward indexing techniques and specifically Lucene. Currently, these systems include VUFind, XC, BlackLight, and Primo. All of them provide a means to feed an indexer data, and then user access to the index.

Of all the discussions, I enjoyed the one on federated search the most because it toyed with the idea of making the interfaces to our indexes smarter. While this smacks of artificial intelligence, I sincerely think this is an opportunity to incorporate library expertise into search applications.

Dinner with Google

Eric Lease Morgan — Mon, 22 Sep 2008 21:45:32 +0000

On Thursday, September 4 a person from Google named Jon Trowbridge gave a presentation at Notre Dame called “Making scientific datasets universally accessible and useful”. This posting reports on the presentation and dinner afterwards.

The presentation

Jon Trowbridge is a software engineer working for Google. He seems to be an open source software and an e-science type of guy who understands academia. He echoed the mission of Google — “To organize the world’s information and make it universally accessible and useful”, and he described how this mission fits into his day-to-day work. (I sort of wish libraries would have such a easily stated mission. It might clear things up and give us better focus.)

Trowbridge works for group in Google exploring ways to making large datasets available. He proposes to organize and distribute datasets in the same manner open source software is organized.

He enumerated things people do with data of this type: compute against it, visualize it, search it, do meta-analysis, and create mash-ups. But all of this begs Question 0. “You have to possess the data before you can do stuff with it.” (This is also true in libraries, and this is why I advocate digitization as oppose to licensing content.)

He speculated why scientists have trouble distributing their data, especially if it more than a terabyte in size. URLs break. Datasets are not very indexable. Datasets of the fodder for new research. He advocated the creation of centralized data “clouds”, and these “clouds” ought to have the following qualities:

archival
librarian-friendly (have some metadata)
citation-friendly
publicly accessible
legally unencumbered
discipline neutral
massively scalable
downloadable via HTTP

As he examined people’s datasets he noticed that many of them are simple hierarchal structures saved to file systems, but they are so huge that transporting them over the network isn’t feasible. After displaying a few charts and graphs, he posited that physically shipping hard disks via FedEx provides the fastest throughput. Given that hard drives can cost as little as 16¢/GB, FedEx can deliver data at a rate of 20 TB/day. Faster and cheaper than the just about anybody’s network connection.

The challenge

Given this scenario, Trowbridge gave away 5 TB of hard disk disk space. He challenged us to fill it up with data and share it with him. He would load the data into his “cloud” and allow people to use it. This is just the beginning of an idea, not a formal service. Host data locally. Provide tools to access and use it. Support e-science.

Personally, I thought it was a pretty good idea. Yes, Google is a company. Yes, I wonder to what degree I can trust Google. Yes, if I make my data accessible then I don’t have a monopoly on it, and others will may beat me to the punch. On the other hand, Google has so much money that they can afford to “Do no evil.” I sincerely doubt anybody was trying to pull the wool over our eyes.

Dinner with Jon

After the presentation I and a couple of my colleagues (Mark Dehmlow and Dan Marmion) had dinner with Jon. We discussed what it is like to work for Google. The hiring process. The similarities and differences between Google and libraries. The weather. Travel. Etc.

All in all, I thought it was a great experience. “Thank you for the opportunity!” It is always nice to chat with sets of my peers about my vocation (as well as my avocation).

Unfortunately, we never really got around to talking about the use of data, just its acquisition. The use of data is a niche I believe libraries can fill and Google can’t. Libraries are expected to know their audience. Given this, information acquired through a library settings can be put into the user’s context. This context-setting is a service. Beyond that, other services can be provided against the data. Translate. Analyze. Manipulate. Create word cloud. Trace idea forward and backward. Map. Cite. Save for later and then search. Etc. These are spaces where libraries can play a role, and the lynchpin is the acquisition of the data/information. Other institutions have all but solved the search problem. It is now time to figure out how to put the information to use so we can stop drinking from the proverbial fire hose.

P.S. I don’t think very many people from Notre Dame will be taking Jon up on his offer to host their data.

MyLibrary: A Digital library framework & toolbox

Eric Lease Morgan — Thu, 18 Sep 2008 03:26:12 +0000

I recently had published an article in Information Technology and Libraries (ITAL) entitled “MyLibrary: A Digital library framework & toolkit” (volume 27, number 3, pages 12-24, September 2008). From the abstract:

This article describes a digital library framework and toolkit called MyLibrary. At its heart, MyLibrary is designed to create relationships between information resources and people. To this end, MyLibrary is made up of essentially four parts: 1) information resources, 2) patrons, 3) librarians, and 4) a set of locally-defined, institution-specific facet/term combinations interconnecting the first three. On another level, MyLibrary is a set of object-oriented Perl modules intended to read and write to a specifically shaped relational database. Used in conjunction with other computer applications and tools, MyLibrary provides a way to create and support digital library collections and services. Librarians and developers can use MyLibrary to create any number of digital library applications: full-text indexes to journal literature, a traditional library catalog complete with circulation, a database-driven website, an institutional repository, an image database, etc. The article describes each of these points in greater detail.

http://infomotions.com/musings/mylibrary-framework/

The folks at ITAL are gracious enough to allow authors to distribute their work on the Web as long as the distribution happens after print publication. “Nice policy!”

Many people will remember MyLibrary from more than ten years ago. It is alive and well. It drives a few digital library projects at Notre Dame. It is often associated with customization/personalization, but now it is more about creating relationships between people and information resources through an institution-defined controlled vocabulary — a set of facet/term combinations.

In my opinion, libraries spend too much time describing resources and creating interdependencies between them. Instead, I think libraries should be spending more time creating relationships between resources and people. You can do this in any number of ways, and sets of facet/term combinations are just one. Think up qualities used to describe people. Think up qualities used to describe information resources. Create relationships by bringing resources and people together that share qualities.

MBooks, revisited

Eric Lease Morgan — Tue, 09 Sep 2008 01:36:59 +0000

This posting makes available a stylesheet to render MARCXML from a collection of records called MBooks.

In a previous post — get-mbooks.pl — I described how to use OAI-PMH to harvest MARC records from the MBooks project. The program works; it does what it is suppose to do.

The MBooks collection is growing so I harvested the content again, but this time I wanted to index it. Using an indexer/search engine called Zebra, the process was almost trivial. (See “Getting Started With Zebra” for details.)

Since Zebra supports SRU (Search/Retrieve via URL) out of the box, searches against the index return MARCXML. This will be a common returned XML stream for a while, so I needed to write an XSLT stylesheet to render the output. Thus, mbooks.xsl was born.

What is really “kewl” about the stylesheet is the simple inline Javascript allowing the librarian to view the MARC tags in all their glory. For a little while you can see how this all fits together in a simple interface to the index.

Use mbooks.xsl as you see fit, but remember “Give back to the ‘Net.”

wordcloud.pl

Eric Lease Morgan — Mon, 25 Aug 2008 13:50:25 +0000

Attached should be simple Perl script called wordcloud.pl. Initialize it with a hash of words and associated integers. Output rudimentary HTML in the form of a word cloud. This hack was used to create the word cloud in a posting called “Last of the Mohicans and services against texts“.

Last of the Mohicans and services against texts

Eric Lease Morgan — Mon, 25 Aug 2008 13:28:27 +0000

Here is a word cloud representing James Fenimore Cooper’s The Last of the Mohicans; A narrative of 1757. It is a trivial example of how libraries can provide services against documents, not just the documents themselves.

scout heyward though duncan uncas little without own eyes before hawkeye indian young magua much place long time moment cora hand again after head returned among most air huron toward well few seen many found alice manner david hurons voice chief see words about know never woods great rifle here until just left soon white heard father look eye savage side yet already first whole party delawares enemy light continued warrior water within appeared low seemed turned once same dark must passed short friend back instant project around people against between enemies way form munro far feet nor

About the story

While I am not a literary scholar, I am able to read a book and write a synopsis.

Set during the French And Indian War in what was to become upper New York State, two young women are being escorted from one military camp to another. Along the way the hero, Natty Bumppo (also known by quite a number of other names, most notably “Hawkeye” or the “scout”), alerts the convoy that their guide, Magua, is treacherous. Sure enough, Magua kidnaps the women. Fights and battles ensue in a pristine and idyllic setting. Heroic deeds are accomplished by Hawkeye and the “last of the Mohicans” — Uncas. Everybody puts on disguises. In the end, good triumphs over evil but not completely.

Cooper’s style is verbose. Expressive. Flowery. On this level it was difficult to read. Too many words. In the other hand the style was consistent, provided a sort of pattern, and enabled me to read the novel with a certain rhythm.

There were a couple of things I found particularly interesting. First, the allusion to “relish“. I consider this to be a common term now-a-days, but Cooper thought it needed elaboration when used to describe food. Cooper used the word within a relatively short span of text to describe condiment as well as a feeling. Second, I wonder whether or not Cooper’s description of Indians built on existing stereotypes or created them. “Hugh!”

Services against texts

The word cloud I created is simple and rudimentary. From my perspective, it is just a graphical representation of a concordance, and a concordance has to be one of the most basic of indexes. This particular word cloud (read “concordance” or “index”) allows the reader to get a sense of a text. It puts words in context. It allows the would-be reader to get an overview of the document.

This particular implementation is not pretty, nor is it quick, but it is functional. How could libraries create other services such as these? Everybody can find and get data and information these days. What people desire is help understanding and using the documents. Providing services against texts such as word clouds (concordances) might be one example.

Amazon.com Widgets

Crowd sourcing TEI files

Eric Lease Morgan — Fri, 15 Aug 2008 19:44:21 +0000

How feasible and/or practical do you think “crowd sourcing” TEI files would be?

I like writing in my books. In fact, I even have a particular system for doing it. Circled things are the subjects of sentences. Squared things are proper nouns. Underlined things connected to the circled and squared things are definitions. Moreover, my books are filled with marginalia. Comments. Questions. See alsos. I call this process ELMTGML (Eric Lease Morgan’s Truly Graphic Mark-up Language), and I find it a whole lot more useful than the use of simple highlighter pen that where all the mark-up has the same value. Florescent yellow.

I think I could easily “crosswalk” my mark-up process to TEI mark-up because there are TEI elements for many of things I highlight. Given such a thing I could mark-up texts using my favorite editor and then create stylesheets that turn on or turn off my commentary.

Suppose many classic texts were marked-up in TEI. Suppose there were stylesheets that allowed you to turn on or turn off other people’s commentary/annotations or allowed you to turn on or turn off particular people’s commentary/annotation. Wouldn’t that be interesting?

Moreover, what if some sort of tool, widget, or system were created that allowed anybody to add commentary to texts in the form of TEI mark-up. Do you think this would be feasible? Useful?

Metadata and data structures

Eric Lease Morgan — Wed, 06 Aug 2008 01:18:08 +0000

It is important to understand the differences between metadata and data structures. This posting outlines some of the differences between the two.

Introduction

Every once in a while people ask me for advice that I am usually very happy to give because the answers usually involve succinctly articulating some of the things floating around in my head. Today someone asked:

I’ve been looking at Dublin Core and looking at MODS to arrive at the best metadata for converting MARC records into human readable format. Dublin Core lacks specificity, but maybe I don’t understand it that well. Plus, I cannot find what parts of the MARC are mapped to what–where are the “rules.” I look at Mods and find it overwhelming and I’m not even sure of its intended purpose.

Below is how I replied.

Dublin Core is a list of element names

First of all, please understand that Dublin Core is really just a list of fifteen or so metadata element names. Title. Creator. Publisher. Format. Identifier. Etc. Moreover, each of these names come with simple definitions denoting the type of content they are expected to represent. Dublin Core is NOT a metadata format. Dublin Core does not define how data should be encoded. It is simply a list of elements.

MARC and XML as data structures

MARC is a metadata format — a data structure — a container — a “bit bucket”. The MARC standard defines how data should be encoded. First there is a leader. It is always 24 characters long and different characters in the leader denote different things. Then there is the directory — a “map” of where the data resides in the file. Finally, there is the data itself which is divided into indicators, fields, and subfields. This MARC standard has been used to hold bibliographic data as well as authority data. In one case the 245 field is intended to encode title/author information. In another case the 245 means something else. In both cases they are using MARC — a data structure.

XML is second type of data structure. Instead of leaders, directories, and data sections, XML is made up of nested elements where the elements of the file are denoted by a Document Type Definition (DTD) or XML schema. XML is much more flexible than MARC. XML is much more verbose than MARC. There are many industries supporting XML. MARC is supported by a single industry. MARC was cool in its time, but it has grown long in the tooth. XML is definitely the data structure to use now-a-days.

MARCXML and MODS

MARCXML is a specific flavor of XML used to contain 100% of the data in a bibliographic MARC file. It works. It does what it is suppose to do, but in order to really take advantage of it the user needs to know that the 245 field contains title information, the 100 field contains author information, etc. In other words, to use MARCXML the user needs to know the “secret code book” translating library tags into human-readable elements. Moreover, MARCXML retains all of the “syntactical” sugar of MARC. Last name first. First name last. Parentheses around birth and death dates. “pbk” to denote paperback. Etc.

MODS is a second flavor of XML also designed to contain bibliographic data. In at least a couple of ways, MODS is much better than MARCXML. First and foremost, MODS removes the need for “secret code book” because the element names are human-readable, not integers. Second, some but not all, of the syntactical sugar is removed.

When it comes to bibliographic data, I advocate MODS over MARCXML any day. Not perfect, but a step in the right direction. There are utilities to convert MARC to MARCXML and then to MODS. Conversion is almost a trivial computing problem to solve.

The “right” metadata standard

When it comes to choosing the “right” metadata standard it is often about choosing the “right” flavor of XML. VRACore, for example, is more amenable to describing image data. TEI is best suited to describe — mark-up — prose and/or poetry. EAD is the “best” candidate for archival finding aids. Authority data can be represented in a relatively new XML flavor called MADS. METS is used, more or less, to create collections of metadata objects. RDF is similar to METS and is intended to form the basis of the Semantic Web. SKOS is an XML format for thesauri.

In short, there are two things to consider. First, what is your data? Bibliographic? Image? Full texts? Second, what data structure do you want to employ? MARC? XML? Something else such as a tab-delimited file? (Ick!) Or maybe a relational database schema? (Maybe.) In most cases I expect XML will be the data structure you want to employ, and then the question is, “What XML DTD or schema do I want to exploit?”

I allude to many of these issues in an XML workshop I wrote called XML In Libraries.

‘Hope this helps.

Origami is arscient, and so is librarianship

Eric Lease Morgan — Wed, 30 Jul 2008 17:08:52 +0000

To do origami well a person needs to apply both artistic and scientific methods to the process. The same holds true for librarianship.

Arscience

Arscience is a word I have coined to denote the salient aspects of both art and science. It is a type of thinking — thinquing — that is both intuitive as well as systematic. It exemplifies synthesis — the bringing together of ideas and concepts — and analysis — the division of our world into smaller and smaller parts. Arscience is my personal epistemological method employing a Hegalian dialectic — an internal discussion. It juxtaposes approaches to understanding including art and science, synthesis and analysis, as well as faith and experience. These epistemological methods can be compared and contrasted, used or exploited, applied and debated against many of the things we encounter in our lives. Through this process I believe a fuller understanding of many things can be achieved.

Origami

A trivial example is origami. One one hand, origami is very artistic. Observe something in the natural world. Examine its essential parts and take notice of their shape. Acquire a piece of paper. Fold the paper to bring the essential parts together to form a coherent whole. The better your observation skills, the better your command of the medium, the better your origami will be.

On the other hand, you can discover that a square can be inscribed on any plane, and upon a square any number of regular polygons can be further inscribed. All through folding. You can then go about bisecting angles and dividing paper in halves, creating symbols denoting different types of folds, and systematically recording the process so it can be shared with others, ultimately creating a myriad of three-dimensional objects from an essentially two-dimensional thing. Unfold the three-dimensional object to expose its mathematics.

Seemingly conflicting approaches to the same problem results in similar outcomes. Arscience.

Librarianship

The same artistic and scientific processes — an arscient process — can be applied to librarianship. While there are subtle differences between different libraries, they all do essentially the same thing. To some degree they all collect, organize, preserve, and disseminate data, information, and knowledge for the benefit their respective user populations.

To accomplish these goals the librarian can take both an analysis tack as well as a synthesis tack. Interactions with people is more about politics, feelings, wants, and needs. Such things are not logical but emotional. This is one side of the coin. The other side of the coin includes well-structured processes & workflows, usability studies & statistical analysis, systematic analysis & measurable results. In our hyper-dynamic environment, such as the one we are working it, innovation — thinking a bit outside the box — is a necessary ingredient for moving forward. At the same time, it is not all about creativity but it is also about strategically planning for the near, medium, and long term future.

Librarianship requires both. Librarianship is arscient.

On the move with the Mobile Web

Eric Lease Morgan — Sun, 20 Jul 2008 23:26:56 +0000

On The Move With The Mobile Web by Ellyssa Kroski provides a nice overview of mobile technology and what it presently means for libraries.

What is in the Report

In my most recent list of top technology trends I mentioned mobile devices. Because of this Kroski had a copy of the Library Technology Report she authored, above, sent to me. Its forty-eight pages essentially consists of six chapters (articles) on the topic of the Mobile Web:

What is the Mobile Web? – An overview of Web technology and its use on hand-held, portable devices. I liked the enumeration of Mobile Web benefits such as: constant connectivity, location-aware services, limitless access, and interactive capabilities. Also, texting was described here as a significant use of the Mobile Web. Ironically, I sent my first text message just prior to the 2008 ALA Annual Meeting.
Mobile devices – A listing and description of the hardware, software (operating systems as well as applications), networks, and companies working in the sphere of the Mobile Web. Apparently three companies (Verizon, AT&T, and Sprint Nextel) have 70% of the market share in terms of network accessibility in the United States.
What can you do with the Mobile Web? – Another list and description but this time of application types: email, text messaging, ringtones & wallpaper, music & radio, software & games, instant messaging, social networking, ebooks, social mapping networks (sort of scary if you ask me), search, mapping, audiobooks, television, travel, browsers, news, blogging, food ordering, and widgets.
Library mobile initiatives – A listing and description of what some libraries are doing with the Mobile Web. Ball State University’s Mobile Web presence seems to be out in front in this regard, and PubMed seems pretty innovative as well. For some commentary regarding iPhone-specific applications for libraries see Peter Brantley’s “The Show Room Library“.
How to create a mobile experience – This is more or less a set of guidelines for implementing Mobile Web services. Some of the salient points include: it is about providing information to people who don’t have a computer, think a lot about location-based services, understand the strengths & weaknesses of the technology. I found this chapter to be the most useful.
Getting started with the Mobile Web – A list of fun things to do to educate yourself on what the Mobile Web can do.

Each chapter is complete with quite a number of links and citations for further reading.

Cellphone barcodes

Through my reading of this Report my knowledge of the Mobile Web increased. The most interesting thing I learned was the existence of Semapedia, a project that “strives to tag real-world objects with 2D barcodes that can be read by camera phones.” Go to Semapedia. Enter a Wikipedia URL. Get back a PDF document containing “barcodes” that your cellphone should be able to read (with the appropriate application). Label real-world things with the barcode. Scan the code with your cellphone. See a Wikipedia article describing the thing. Interesting. Below is one of these barcodes for the word “blog” which links to the Mobile Web-ready Wikipedia entry on blogs:

Read the report

I still believe the Mobile Web is going to play larger role in people’s everyday lives. (Duh!) By extension, I believe it is going to play a larger role in libraries. Ellyssa Kroski’s On The Move With The Mobile Web will give you a leg up on the technology.

TPM — technological protection measures

Eric Lease Morgan — Sun, 20 Jul 2008 18:55:25 +0000

I learned a new acronym a few weeks ago — TPM — which stands for “technological protection measures”, and in the May 2008 issue of College & Research Libraries Kristin R. Eschenfelder wrote an article called “Every library’s nightmare?” and enumerated various types of protection measures employed by publishers to impede the use of electronic scholarly material.

Types of restrictions

In today’s environment, where digital information is increasingly bought, sold, and/or licensed, publishers feel the need to protect their product from duplication. As described by Eschenfelder, these protections — restrictions — come in two forms: soft and hard.

Soft restrictions are “configurations of hardware or software that make certain uses such as printing, saving, copy/pasting, or e-mailing more difficult — but not impossible — to achieve.” The soft restrictions have been divided into the following subtypes:

extent of use – page print limits; PDF download limits; data export limits; suspicious use tracking
obfuscation – need to select items before options become available
omission – not providing buttons or links to enact users
decomposition – saving document results in many files, making recreating or e-mailing the document difficult
frustration – page chunking in e-books
warning – copyright warnings; end-user licenses on startup

Hard restrictions are “configurations of software or hardware that strictly prevent certain uses.” The hard restrictions have been divided into the following subtypes:

restricted copy and paste OCR – OCR exposed for searching, but not for copying and pasting of text
secure container TPM – use rights vary by resource

To investigate what types of restricts were put into everyday practice Eschenfelder studied a total of about seventy-five resources from three different disciplines (engineering, history, art history) and tallied the types of restrictions employed.

Salient quotes

A few salient quotes from the article exemplify Eschenfelder’s position on TPM:

“This paper suggests that the soft restrictions that are present in licensed products may haver already changed user’s and librarian’s expectations about what the use rights they ought to expect from vendors and their products.” (Page 207)
“One concern is that the library community has already accepted many of the soft use restrictions identified in this paper.” (Page 219)
“[Librarians] should also advocate for removal of use restrictions, or encourage new vendors to offer competing restriction-free products.” (Page 219)
“A more realistic solution might be a shared knowledge base of vendor interfaces and known use restrictions.” (Page 219)
“The paper argues that soft use restrictions deserve more attention from the library community, and that librarians should not accept these restrictions as the natural order of things.” (Page 220)

My commentary

I agree with Eschenfelder.

Many people who work in libraries seem to be there because of the values libraries portray. Examples include but are not limited to: intellectual freedom, education, diversity, equal access to information, preservation of the historical record for future generations, etc. Heaven know, people who work in libraries are not in it for the money! I fall into the equal access to information camp, and that is why I advocate things like open access publishing and open source software development.

TPM inhibits the free and equal access of information, and I think Eschenfelder makes a good point when she says the “library community has already accepted many of the soft use restrictions.” Why do we accept them? Librarians are not required to purchase and/or license these materials. We have choice. If much of the scholarly publishing industry is driven by the marketplace — supply & demand — then why don’t/can’t we just say, “No”. Nobody is forcing us spend our money this way. If vendors don’t provide the sort of products and services we desire, then the marketplace will change. Right?

In any event, consider educating yourself on the types of TPM and read Eschenfelder’s article.

Against The Grain is not

Eric Lease Morgan — Tue, 15 Jul 2008 23:24:26 +0000

Against The Grain is not your typical library-related serial.

Last year I had the opportunity to present at the 27th Annual Charleston Conference where I shared my ideas regarding the future of search and how some of those ideas can implemented in “next-generation” library catalogs. In appreciation of my efforts I was given a one-year subscription to Against The Grain. From the website’s masthead:

Against the Grain (ISSN: 1043-2094) is your key to the latest news about libraries, publishers, book jobbers, and subscription agents. It is a unique collection of reports on the issues, literature, and people that impact the world of books and journals. ATG is published on paper six times a year, in February, April, June, September, and November and December/January.

I try to read the issues as they come out, but I find it difficult. This not because the content is poor, but rather because the there is so much of it! In a few words and phrases, Against The Grain is full, complete, dense, tongue-in-cheek, slightly esoteric, balanced, graphically challenging and at the same time graphically interesting, informative, long, humorous, supported by advertising, somewhat scholarly, personal, humanizing, a realistic reflection of present-day librarianship (especially in regards to technical services in academic libraries), predictable, and consistent. For example, the every issue contains a “rumors” article listing bunches and bunches of people, where they are going, and what they are doing. Moreover, the articles are printed in a relatively small typeface in a three-column format. Very dense. To make things easier to read, sort of, all names and titles are bolded. I suppose the dutiful reader could simply scan for names of interest and read accordingly, but there are so many of them. (Incidentally, the bolded names pointed me to the Tenth Fiesole Retreat which piqued my interest because I had given a modified SIG-IR presentation on MyLibrary at the Second Fiesole Retreat. Taking place at Oxford, that was a really cool meeting!)

Don’t get me wrong. I like Against The Grain but it so full of information and has been so thoroughly put together that I feel almost embarrassed not reading it. I feel like the amount of work put into each issue warrants the same amount of effort on my part to read it.

The latest issue (volume 20, number 3, June 2008) includes a number of articles about Google. For me, the most interesting articles included:

“Kinda just like Google” by Jimmy Ghaphery – an examination of the number of search targets appearing on ARL library home pages. Almost all of them include a search of the catalog. Just fewer have searches of meta-search engines. Just fewer than that are pages including searches of Google and its relatives, and just fewer than that, if not non-existent, were searches of locally created indexes like institution repositories or digital collections. Too many search boxes?
“Giggling Over Google” by Lilia Murray – a description of how Google Docs and Google Custom Search engines can be used and harnessed in libraries. Well-documented. Well-written. Advocates the creation of more Custom Search Engines by librarians. Sounds like a great idea to me.
“Keeping the Enemy Close” by John Wender – compares and contrasts the advantages and disadvantages of including/supporting Google Scholar in an academic library setting. I liked the allusion to Carl Shapiro and Hall Varian’s idea of “information as an ‘experience good'”. Kinda like, “A bird in the hand is worth two in the bush.”
“Measuring the ‘Google Effect’ at JSTOR by Bruce Heterick – a description of how JSTOR’s usage skyrocketed after its content was indexed by Google.
“Prescription vs. Description in the information-seeking process, or should we encourage our patrons to use Google Scholar?” by Bruce Sanders – contrasts “prescription” and “description” librarianship. One encourages competent, sophisticated searching of databases. The other tailors the library Website to make the patron search strategies as effective as possible. An interesting comparison.
“Medium rare books, PODS wars, instant books brought to you by algorithms” by John D. Riley – describes how a fortune of books was found in the stacks of the Forbes Library as opposed to the library’s special collections.

If you have the time, spent it reading Against The Grain.

E-journal archiving solutions

Eric Lease Morgan — Tue, 15 Jul 2008 03:01:26 +0000

A JISC-funded report on e-journal archiving solutions is an interesting read, and it seems as if no particular solution is the hands-down “winner”.

Terry Morrow, et al. recently wrote a report sponsored by JISC called “A Comparative study of e-journal archiving solutions“. Its goal was to compare & contrast various technical solutions to archiving electronic journals and present the informed opinion on the subject.

Begged and unanswered questions

The report begins by setting the stage. Of particular note is the increased movement to e-only journal solutions many libraries are adopting. This e-only approach begs unanswered questions regarding the preservation and archiving of electronic journals — two similar but different aspects of content curation. To what degree will e-journals suffer from technical obsolescence? Just as importantly, how will the change in publishing business models, where access, not content, is provided through license agreements effect perpetual access and long-term preservation of e-journals?

Two preservation techniques

The report outlines two broad techniques to accomplish the curation of e-journal content. On one hand there is “source file” preservation where content (articles) are provided by the publisher to a third-party archive. This is the raw data of the articles — possibly SGML files, XML files, Word documents, etc. — as opposed to the “presentation” files intended for display. This approach is seen as being more complete, but relies heavily on active publisher and third party participation. This is the model employed by Portico. The other technique is harvesting. In this case the “presentation” files are archived from the Web. This method is more akin to the traditional way libraries preserved and archived their materials. This is the model employed by LOCKSS.

Compare & contrast

In order to come their conclusions, Morrow et al. compared & contrasted six different e-journal preservation initiatives while looking through the lense of four possible trigger events. These initiatives (technical archiving solutions) included:

British Library e-Journal Digital Archive – a fledgling initiative by a national library
CLOCKSS – a dark archive of articles using the same infrastructure as LOCKSS
e-Depot – a national library initiative from The Netherlands
LOCKSS – an open source and distributed harvesting implementation
OCLC ECO – an aggregation of aggregators, not really preservation
Portico – a Mellon-backed “source file” approach

The trigger events included:

cancelation of an e-journal title
e-journal no longer available from a publisher
publisher ceased operation
catastrophic hardware or network failure

These characteristics made up a matrix and enabled Morrow, et al. to describe what would happen with each initiative under each trigger event. In summary, they would all function but it seems the LOCKSS solution would provide immediate access to content whereas most of the other solutions would only provide delayed access. Unfortunately, the LOCKSS initiative seems to have less publisher backing than the Portico initiative. On the other hand, the Portico initiative costs more money and assumes a lot of new responsibilities from publishers.

In today’s environment where information is more routinely sold and licensed, I wonder whether or not what level of trust can be given to publishers. What’s in it for them? In the end, neither solution — LOCKSS nor Portico — can be considered ideal, and both ought to be employed at the present time. One size does not fit all.

Recommendations

In the end there were ten recommendations:

carry out risk assessments
cooperate with one or more external e-journal archiving solutions
develop standard cross-industry definitions of trigger events and protocols
ensure archiving solutions cover publishers of value to UK libraries
explicitly state perpetual access policies
follow the Transfer Code of Practice
gather and share statistical information about the likelihood of trigger events
provide greater detail of coverage details
review and update this study on a regular basis
take the initiative by specifying archiving requirements when negotiating licenses

Obviously the report went into much greater detail regarding all of these recommendations and how they derived. Read the report for the details.

There are many aspects that make up librarianship. Preservation is just one of them. Unfortunately, when it comes to preservation of electronic, born-digital content, the jury is still out. I’m afraid we are suffering from a wealth of content right now, but in the future this content may not be accessible because society has not thought very long into the future regarding preservation and archiving. I hope we are not creating a Digital Dark Age as we speak. Implementing ideas from this report will help reduce the possibility of this problem from becoming a reality.

Web 2.0 and “next-generation” library catalogs

Eric Lease Morgan — Tue, 15 Jul 2008 01:50:50 +0000

A First Monday article systematically comparing & contrasting Web 1.0 and Web 2.0 website technology recently caught my interest, and I think it points a way to making more informed decisions regarding “next-generation” library catalog interfaces and Internet-based library services in general.

Web 1.0 versus Web 2.0

Graham Cormode and Balachander Krishnamurthy in “Key differences between Web 1.0 and Web 2.0“, First Monday, 13(6): June 2008 thoroughly describe the characteristics of Web 2.0 technology. It outlines the features of Web 2.0, describes the structure of Web 2.0 sites, identifies problem with measurement of Web 2.0 usage, and covers technical issues.

I really liked the how it listed some of the identifying characteristics. Web 2.0 sites usually:

encourage user-generated content
exploit AJAX
have a strong social component
support some sort of public API
support the ability to form connections between people
support the posting of content in many forms
treat users as first class entities in the system

The article included a nice matrix of popular websites across the top and services down the side. At the intersection of the rows and columns check marks were placed denoting whether or not the website supported the services. Of all the websites Facebook, YouTube, Flicr, and MySpace ranked as being the most Web 2.0-esque. Not surprising.

The compare & contrast between Web 1.0 and Web 2.0 sites was particular interesting, and can be used as a sort of standard/benchmark for comparing existing (library) websites to the increasingly expected Web 2.0 format. For example, Web 1.0 sites are characterized as being:

stateless
shaped like a “bow-tie” where there is a front-page linked to many sub-pages and supplimented with many cross links between sub-pages
covering a single topic

Whereas Web 2.0 websites generally:

include a broader mixture of content types
produce groups or feeds of content
rely on user-provided content
represent a shared space
require some sort of log-in function
see “portalization” is a trend

For readers who feel they they do not understand the meaning of Web 2.0, the items outlined above and elaborated upon in the article will make the definition of Web 2.0 clearer. Good reading.

Library “catalogs”

The article also included an interesting graphic, Figure 1, illustrating the paths from content creator to consumer in Web 2.0. The images is linked from the article, below:

The far left denotes people creating content. The far right denotes people using content. In the middle are services. When I look at the image I see everything from the center to the far right of the following illustration (of my own design):

This illustration represents a model for a “next-generation” library catalog. On the far left is content aggregation. In the center is content normalization and indexing. On the right are services against the content. The right half of the illustration above is analgous to the entire illustration from Cormode and Krishnamurthy.

Like the movement from Web 1.0 to Web 2.0, library websites (online “catalogs”) need to be more about users, their content, and services applied against it. “Next-generation” library catalogs will fall short if they are only enhanced implementations of search and browse interfaces. With the advent of digization, everybody has content. What is needed are tools — services — to make it more useful.

Alex Lite: A Tiny, standards-compliant, and portable catalogue of electronic texts

Eric Lease Morgan — Sat, 12 Jul 2008 16:13:01 +0000

One the beauties of XML its ability to be transformed into other plain text files, and that is what I have done with a simple software distribution called Alex Lite.

My TEI publishing system(s)

A number of years ago I created a Perl-based TEI publishing system called “My personal TEI publishing system“. Create a database designed to maintain authority lists (titles and subjects), sets of XSLT files, and TEI/XML snippets. Run reports against the database to create complete TEI files, XHTML files, RSS files, and files designed to be disseminated via OAI-PMH. Once the XHTML files are created, use an indexer to index them and provide a Web-based interface to the index. Using this system I have made accessible more than 150 of my essays, travelogues, and workshop handouts retrospectively converted as far back as 1989. Using this system, many (if not most) of my writings have been available via RSS and OAI-PMH since October 2004.

A couple of years later I morphed the TEI publishing system to enable me to mark-up content from an older version of my Alex Catalogue of Electronic Texts. Once marked up I planned to transform the TEI into a myriad of ebook formats: plain text, plain HTML, “smart” HTML, PalmPilot DOC and eReader, Rocket eBook, Newton Paperback, PDF, and TEI/XML. The mark-up process was laborious and I have only marked up about 100 texts, and you can see the fruits of these labors, but the combination of database and XML technology has enabled me to create Alex Lite.

Alex Lite

Alex Lite the result of a report written against my second TEI publishing system. Loop through each item in the database and update an index of titles. Create a TEI file against each item. Using XSLT, convert each TEI file into a plain HTML file, a “pretty” XHTML file, and a FO (Formatting Objects) file. Use a FO processor (like FOP) to convert the FO into PDF. Loop through each creator in the database to create an author index. Glue the whole thing together with an index.html file. Save all the files to a single directory and tar up the directory.

The result is a single file that can be downloaded, unpacked, and provide immediate access to sets of electronic books in an standards-compliant, operating system independent manner. Furthermore, no network connection is necessary except for the initial acquisition of the distribution. This directory can then be networked or saved to a CD-ROM. Think of the whole thing as if it were a library.

Give it a whirl; download a version of Alex Lite. Here is a list of all the items in the tiny collection:

Alger Jr., Horatio (1834-1899)
- The Cash Boy
- Cast Upon The Breakers
Bacon, Francis (1561-1626)
- The Essays
- The New Atlantis
Burroughs, Edgar Rice (1875-1850)
- At The Earth’s Core
- The Beasts Of Tarzan
- The Gods Of Mars
- The Jungle Tales Of Tarzan
- The Monster Men
- A Princess Of Mars
- The Return Of Tarzan
- The Son Of Tarzan
- Tarzan And The Jewels Of Opar
- Tarzan Of The Apes
- The Warlord Of Mars
Conrad, Joseph (1857-1924)
- The Heart Of Darkness
- Lord Jim
- The Secret Sharer
Doyle, Arthur Conan (1859-1930)
- The Adventures Of Sherlock Holmes
- The Case Book Of Sherlock Holmes
- His Last Bow
- The Hound Of The Baskervilles
- The Memoirs Of Sherlock Holmes
Machiavelli, Niccolo (1469-1527)
- The Prince
Plato (428-347 B.C.)
- Charmides, Or Temperance
- Cratylus
- Critias
- Crito
- Euthydemus
- Euthyphro
- Gorgias
Poe, Edgar Allan (1809-1849)
- The Angel Of The Odd–An Extravaganza
- The Balloon-Hoax
- Berenice
- The Black Cat
- The Cask Of Amontillado
Stoker, Bram (1847-1912)
- Dracula
- Dracula’s Guest
Twain, Mark (1835-1910)
- The Adventures Of Huckleberry Finn
- A Connecticut Yankee In King Arthur’s Court
- Extracts From Adam’s Diary
- A Ghost Story
- The Great Revolution In Pitcairn
- My Watch: An Instructive Little Tale
- A New Crime
- Niagara
- Political Economy

XSLT

As alluded to above, the beauty of XML is its ability to be transformed into other plain text formats. XSLT allows me to convert the TEI files into other files for different mediums. The distribution includes only simple HTML, “pretty” XHTML, and PDF versions of the texts, but for the XSLT affectionatos in the crowd who may want to see the XSLT files, I have included them here:

tei2htm.xsl – used to create plain HTML files complete with metadata
tei2html.xsl – used to create XHTML files complete with metadata as well as simple CSS-enabled navigation
tei2fo.xsl – used to create FO files which were fed to FOP in order to create things designed for printing on paper

Here’s a sample TEI file, Edgar Allen Poe’s The Cask Of Amontillado.

Future work

I believe there is a lot of promise in the marking-up of plain text into XML, specifically works of fiction and non-fictin into TEI. Making available such marked-up texts paves the way for doing textual analysis against them and for enhancing them with personal commentary. It is too bad that the mark-up process, even simple mark-up, is so labor intensive. Maybe I’ll do more of this sort of thing in my copius spare time.

Indexing MARC records with MARC4J and Lucene

Eric Lease Morgan — Wed, 09 Jul 2008 21:35:36 +0000

In anticipation of the eXtensible Catalog (XC) project, I wrote my first Java programs a few months ago to index MARC records, and you can download them from here.

The first uses MARC4J and Lucene to parse and index MARC records. The second uses Lucene to search the index created from the first program. They are very simple programs — functional and not feature-rich. For the budding Java programmer in libraries, these programs could be used as a part a rudimentary self-paced tutorial. From the distribution’s README:

This is the README file for two Java programs called Index and Search.

Index and Search are my first (real) Java programs. Using Marc4J, Index
reads a set of MARC records, parses them (for authors, titles, and call
numbers), and feeds the data to Lucene for indexing. To get the program
going you will need to:

Get the MARC4J .jar files, and make sure they are in your CLASSPATH.

Get the Lucene .jar files, and make sure they are in your CLASSPATH.

Edit Index.java so the value of InputStream points to a set of MARC records.

Create a directory named index in the same directory as the source code.

Compile the source (javac Index.java).

Run the program (java Index).

The program should echo the parsed data to the screen and create an
index in the index directory. It takes me about fifteen minutes to index
700,000 records.

The second program, Search, is designed to query the index created by
the first program. To get it to run you will need to:

Get the Lucene .jar files, and make sure they are in your CLASSPATH.

Make sure the index created by Index is located in the same directory as the source code.

Compile the source (javac Search.java).

Run the program (java Search where is a word or phrase).

The result should be a list items from the index. Simple.

Enjoy?!

Encoded Archival Description (EAD) files everywhere

Eric Lease Morgan — Wed, 02 Jul 2008 02:24:20 +0000

I’m beginning to see Encoded Archival Description (EAD) files everywhere, but maybe it is because I am involved with a project called the Catholic Research Resources Alliance (CRRA).

As you may or may not know, EAD files are the “MODS files” of the archival community. These XML files provide the means to administratively describe archival collections as well as describe the things in the collections at the container, folder, or item level.

Columbia University and MARC records

During the past few months, I helped edit and shepherd an article for Code4Lib Journal by Terry Catapano, Joanna DiPasquale, and Stuart Marquis called “Building an archival collections portal“. The article describes the environment and outlines the process folks at Columbia University use to make sets of their archival collections available on the Web. Their particular process begins with sets of MARC records dumped from their integrated library system. Catapano, DiPasquale, and Marquis then crosswalk the MARC to EAD, feed the EAD to Solr/Lucene, and provide access to the resulting index. Their implementation uses a mixture of Perl, XSLT, PHP, and Javascript. What was most interesting was the way they began the process with MARC records.

Florida State University and tests/tools

Today I read an article by Plato L. Smith II from Information Technology and Libraries (volume 27, number 2, pages 26-30) called “Preparing locally encoded electronic finding aid inventories for union environments: A Publishing model for Encoded Archival Description”. [COinS] Smith describes how the Florida State University Libraries create their EAD files with Note Tab Light templates and then convert them into HTML and PDF documents using XSLT. They provide access to the results through the use of content management system — DigiTool. What I found most intriguing about this article where the links to test/tools used to enrich their EAD files, namely the RLG EAD Report Card and the Online Archive of California Best Practices Guidelines, Appendix B. While I haven’t set it up yet, the former should check EAD files for conformity (beyond validity), and the later will help create DACS-compliant EAD Formal Public Identifiers.

Catholic Research Resources Alliance portal

Both of these articles will help me implement the Catholic Research Resources Alliance (CRRA) portal. From a recent workshop I facilitated:

The ultimate goal of the CRRA is to facilitate research in Catholic scholarship. The focus of this goal is directed towards scholars but no one is excluded from using the Alliance’s resources. To this end, participants in the Alliance are expected to make accessible rare, unique, or infrequently held materials. Alliance members include but are not limited to academic libraries, seminaries, special collections, and archives. Similarly, content might include but is not limited to books, manuscripts, letters, directories, newspapers, pictures, music, videos, etc. To date, some of the Alliance members are Boston College, Catholic University, Georgetown University, Marquette University, Seton Hall University, University of Notre Dame, and University of San Diego.

Like the Columbia University implementation, the portal is expected to allow Alliance members to submit MARC records describing individual items. The Catapano, DiPasquale, and Marquis article will help me map my MARC fields to my local index. Like the Florida Sate University implementation, the portal is expected to allow Alliance members to submit EAD files. The Smith article will help me create unique identifiers. For Alliance members who have neither MARC nor EAD files, the portal is expected to allow Alliance members submit their content via a fill-in-the-blank interface which I am adopting from the good folks at the Archives Hub.

The CRRA portal application is currently based on MyLibrary and an indexer/search engine called KinoSearch. After submitting them to the portal, EAD files and MARC records are parsed and saved to a MySQL database using the Perl-based MyLibrary API. Various reports are then written against the database, again, using the MyLibrary API. These reports are used to create on-the-fly browsable lists of formats, names, subjects, and CRRA “themes”. They are used to create sets of XML files for OAI-PMH harvesting. They are used to feed data to Kinosearch to create an index. (For example, see mylibrary2files.pl and then ead2kinosearch.pl.) Finally, the whole thing is brought together with a single Perl script for searching (via SRU) and browsing.

It is nice to see a growing interest in EAD. I think the archival community has a leg up on it library brethren regarding metadata. They are using XML more and more. Good for them!

Finally, let’s hear it for the ‘Net, free-flowing communication, and open source software. Without these things I would not have been able to accomplish nearly as much as I have regarding the portal. “Thanks guys and gals!”

eXtensible Catalog (XC): A very transparent approach

Eric Lease Morgan — Fri, 27 Jun 2008 00:19:25 +0000

An article by Jennifer Bowen entitled “Metadata to support next-generation library resource discovery: Lessons from the eXtensible Catalog, Phase 1” appeared recently in Information Technology & Libraries, the June 2008 issue. [1]

The article outlines next-steps for the XC Project and enumerates a number of goals for their “‘next-generation’ library catalog” application/system:

provide access to all library resources, digital and non-digital
bring metadata about library resources into a more open Web environment
provide an interface with new Web functionality such as Web 2.0 features and faceted browsing
conduct user research to inform system development
publish the XC code as open-source software

Because I am somewhat involved in the XC Project from past meetings and as a Development Partner, the article did not contain a lot of new news for me, but it did elaborate on a number of points.

Its underlying infrastructure is a good example. Like many “next-generation” library catalog applications/systems, it proposes to aggregate content from a wide variety of sources, normalize the data into a central store (the “hub”), index the content, and provide access to the central store or index through a number of services. This is how Primo, VUFind, AquaBrowser operate. Many others work in a similar manner; all of these systems have more things in common than differences. Unlike other applications/systems, XC seems to embrace a more transparent and community-driven process.

One of the things that intrigued me most was goal #2. “XC will reveal library metadata not only through its own separate interface.., but will also allow library metadata to be revealed through other Web applications.” This definitely the way to go. A big part of librarianship is making data, information, and knowledge widely accessible. Our current systems do this very poorly. XC is moving in the right direction in this regard. Kudos.

Another thing that caught my eye was a requirement for goal #3, “The XC system will capture metadata generated by users from any one of the system’s user environments… and harvest it back into the system’s metadata services hub for processing.” This too sounds like a good idea. People are the real sources of information. Let’s figure out ways to harness the knowledge, expertise, and experiences of our users.

What is really nice about XC is the approach they are taking. It is not all about their software and their system. Instead, it is about building on the good work of others and providing direct access to their improvements. “Projects such as the eXtensible Catalog can serve as a vehicle for moving forward by providing an opportunity for libraries to experiment and to then take informed action to move the library community toward a next generation of resource discovery systems.”

I wish more librarians would be thinking about their software development processes in the manner of XC.

[1] The article is immediately available online at http://hdl.handle.net/1802/5757.

Top Tech Trends for ALA (Summer ’08)

Eric Lease Morgan — Thu, 19 Jun 2008 03:59:07 +0000

Here is a non-exhaustive list of Top Technology Trends for the American Library Association Annual Meeting (Summer, 2008). These Trends represent general directions regarding computing in libraries — short-term future directions where, from my perspective, things are or could be going. They are listed in no priority order.

“Bling” in your website – I hate to admit it, but it seems increasingly necessary to make sure your institution’s website be aesthetically appealing. This might seem obvious to you, but considering the fact we all think “content is king” we might have to reconsider. Whether we like it or not, people do judge a book by its cover, and people do judge other’s on their appearance. Websites aren’t very much different. While librarians are great at organizing information bibliographically, we stink when it comes to organizing things visually. Think graphic design. Break down and hire a graphic designer, and temper their output with usability tests. We all have our various strengths and weaknesses. Graphic designers have something to offer that, in general, librarians lack.
Data sets – Increasingly it is not enough for the scholar or researcher to evaluate old texts or do experiments and then write an article accordingly. Instead it is becoming increasingly important to distribute the data and information the scholar or researcher used to come to their conclusions. This data and information needs to be just as accessible as the resulting article. How will this access be sustained? How will it be described and made available? To what degree will it be important to preserve this data and/or migrate it forward in time? These sorts of questions require some thought. Libraries have experience in these regards. Get your foot in the door, and help the authors address these issues.
Institutional repositories – I don’t hear as much noise about institutional repositories as I used to hear. I think their lack of popularity is directly related to the problems they are designed to solve, namely, long-term access. Don’t get me wrong, long-term access is definitely a good thing, but that is a library value. In order to be compelling, institutional repositories need to solve the problems of depositors, not the librarians. What do authors get by putting their content in an institutional repository that they don’t get elsewhere? If they supported version control, collaboration, commenting, tagging, better syndication and possibilities for content reuse — in other words, services against the content — then institutional repositories might prove to be more popular.
Mobile devices – The iPhone represents a trend in mobile computing. It is both cool and “kewl” for three reasons: 1) its physical interface complete with pinch and drag touch screen options make it easy to use; you don’t need to learn how to write in its language, 2) its always-on and endlessly-accessible connectivity to the Internet make it trivial to keep in touch, read mail, and “surf the Web”, 3) its software interface is implemented in the form of full-blown applications, not dummied down text interfaces with lot’s of scrolling lists. Apple Computer got it right. Other companies will follow suit. Sooner or later we will all by walking around like people from the Starship Enterprise. “Beam me up, Scotty!” Consider integrating into your services the ability to text the content of library research to a telephone.
Net Neutrality – The Internet, by design, is intended to be neutral, but increasingly Internet Service Providers (ISP) are twisting the term “neutrality” to mean, “If you pay a premium, then we won’t throttle your network connection.” Things like BitTorrent is a good example. This technique exploits the Internet making file transfers more efficient, but ISPs want to inhibit it and/or charge more for its use. Yet again, the values and morals of a larger, more established community, in this case capitalism, are influencing the Internet. Similar value changes manifested themselves when email became commonplace. Other values, such as not wasting Internet bandwidth by transferring unnecessarily large files over the ‘Net, have changed as both the technology and the numbers of people using the Internet have changed. Take a stand for “Net Neutrality”.
“Next generation” library catalogs – The profession has finally figured it out. Our integrated library systems don’t solve the problems of our users. Consequently, the idea of the “next generation” library catalog is all the rage, but don’t get too caught up in features such as Did You Mean?, faceted browse, cover art, or the ability of include a wide variety of content into a single interface. Such things are really characteristics and functions of underlying index. They are all things designed to make it easier to accomplish the problem of find, but this is not the problem to be solved. Google make it easy to find. Really easy. We are unable to compete in that arena. Everybody can find, and we are still “drinking” from the proverbial “fire hose”. Instead, think about ways to enable the patron to use the content they find. Put the content into context. Like the institutional repositories, above, and the open access content, below, figure out way to make the content useful. Empower the patron. Enable them to apply actions against the content, not just the index. Such things are exemplified by action verbs. Tag. Share. Review. Add. Read. Save. Delete. Annotate. Index. Syndicate. Cite. Compare forward and backward in time. Compare and contrast with other documents. Transform into other formats. Distill. Purchase. Sell. Recommend. Rate. Create flip book. Create tag cloud. Find email address of author. Discuss with colleagues. Etc. The types of services implementable by “next generation” library catalogs is as long as the list of things people do with the content they find in libraries. This is one of the greatest opportunities facing our profession.
Open Access Publishing – Like its sister, institutional repositories, I don’t hear as much about open access publishing as I used to hear. We all know it is a “good thing” but like so many things that are “free” its value is only calculated by the amount of money paid for it. “The journals from this publisher are very expensive. We had better promote them and make them readily visible on our website in order for us to get our money’s worth.” In a library setting, the value of material is not based on dollars but rather on things such as but limited to usefulness, applicability, keen insight, scholarship, and timeliness. Open access publishing content manifests these characteristics as much a traditionally published materials. Open access content can be made even more valuable if its open nature were exploited. Like the content found in institutional repositories, and like the functions of “next generation” library catalogs outlined above, the ability to provide services against open access content are almost limitless. More than any other content, open access content combined with content from things like the Open Content Alliance and Project Gutenburg can be freely collected, indexed, searched, and then put into the context of the patron. Create bibliography. Trace citation. Find similar words and phrases between articles and books. Take an active role in making open access publishing more of a reality. Don’t wait for the other guy. You are a part of the solution.
Social networking – Social networking is beyond a trend. It is all but a fact of the Internet. Facebook, MySpace, and LinkedIn as well as Wikipedia, YouTube, Flickr, and Delicious are probably the archetypical social networking sites. They have very little content of their own. Instead, they provide a platform for others to provide content — and then services against that content. (“Does anybody see a trend in these trends, yet?”) What these social networking sites are exploiting is a new form of the numbers game. Given a wide enough audience it is possible to find and create sets of others interested in just about any topic under the sun. These people will be passionate about their particular topic. They will be sincere, adamant, and arduous about making sure the content is up-date, accurate, and thoroughly described and accessible. Put your content into these sorts of platforms in the same way the Library of Congress as well as the Smithsonian Institution has put some of their content into Flickr. A rising tide floats all boats. Put your boat into the water. Participate in this numbers game. It is not really about people using your library, but rather about people using the content you have made available.
Web Services-based APIs – xISBN and thingISBN. The Open Library API. The DLF ILS-DI Technical Recommendation. SRU and OpenSearch. OAI-PMH and now OAI-ORE. RSS and ATOM. All of these things are computing techniques called Web Services Application Programmer Interfaces (API). They are computer-to-computer interfaces akin to things like Z39.50 of Library Land. They enable computers to unambiguously share data between themselves. A number of years ago implementing Web Services meant learning things like SOAP, WSDL, and UDDL. These things were (are) robust, well-documented, and full-featured. They are also non-trivial to learn. (OCLC’s Terminology Service embedded within Internet Explorer uses these techniques.) After that REST become more popular. Simpler, and exploits the features of HTTP. The idea was (is) send a URL to a remote computer. Get a response back as XML. Transform the response and put it to use — usually display things on a Web page. This is the way most of the services work (“There’s that word again!”) The latest paradigm and increasingly popular technique uses a data structure called JSON as opposed to XML as the form of the server’s response because JSON is easier to process with Javascript. This is very much akin to AJAX. Despite the subtle differences between each of these Web Services computing techniques, there is a fundamental commonality. Make a request. Wait. Get a response. Do something with the content — make it useful. Moreover, the returned content is devoid of display characteristics. It is just data. It is your responsibility to turn it into information. Learn to: 1) make your content accessible via Web Services, and 2) learn how to aggregate content through Web Services in order to enhance your patron’s experience.

Wow! Where did all of that come from?

(This posting is also available at on the LITA Blog. “Lot’s of copies keep stuff safe.”)

Google Onebox module to search LDAP

Eric Lease Morgan — Mon, 16 Jun 2008 22:13:09 +0000

This posting describes a Google Search Appliance Onebox module for searching an LDAP directory.

At my work I help administrate a Google Search Appliance. It is used index the university’s website. The Appliance includes a functionality — called Onebox — allowing you to search multiple indexes and combining the results into a single Web page. It is sort of like libraray metasearch.

In an effort to make it easier for people to find… people, we created a Onebox module, and you can download the distribution if you so desire. It is written in Perl.

In regards to libraries and librarianship, the Onebox technique is something the techno-weenies in our profession ought to consider. Capture the user’s query. Do intelligent processing on it by enhancing it, sending it to the appropriate index, making suggestions, etc., and finally returning the results. In other words, put some smarts into the search interface. You don’t need a Google Search Appliance to do this, just control over your own hardware and software.

From the distribution’s README file:

This distribution contains a number of files implementing a Google Onebox “widget”. It looks people’s names up in an LDAP directory.

The distribution contains the following files:

people.cgi – the reason de existance

people.pl – command-line version of people.cgi

people.png – an image of a person

people.xsl – XSL to convert people.cgi output to HTML

README – this file

LICENSE – the GNU Public License

The “widet” (people.cgi) is almost trivial. Read the value of the query paramenter sent as a part of the GET request. Open up a connection to the LDAP server. Query the server. Loop through the results keeping only a number of them as defined by the constant UPPER. Mark-up the results as Google XML. Return the XML to the HTTP client. It is then the client’s resposibility to transform the XML into an HTML (table) snippet for display. (That is what people.xsl is for.)

This widget ought to work in many environments. All you really need to do is edit the values of the constants at the beginning of people.cgi.

This code is distributed under the GNU Public License.

Enjoy.

DLF ILS Discovery Internet Task Group Technical Recommendation

Eric Lease Morgan — Thu, 12 Jun 2008 04:50:20 +0000

I read the great interest the DLF ILS Discovery Internet Task Group (ILS-DI) Technical Recommendation [1], and I definitely think it is a step in the right direction for making the content of library systems more accessible.

In regards to the integrated systems of libraries, the primary purpose of the Recommendations is to:

improve discovery and use of library resources
articulate a clear set of expectations for developers
make recommendations applicable to existing and future systems
ensure the recommendations are feasible
support interoperation and cooperation
be responsive to the user and developer community

To this end the Recommendations list a set of abstract functions integrated library systems “should” implement, and it enumerate a number of concrete bindings that can be used to implement these functions. Each of the twenty-five (25) functions can be grouped into one of four overall categories:

data aggregation – harvest content en masse from the underlying system
search – supply a query and get back a list of matching records
patron services – support things like renew, hold, recall, etc.
OPAC integration – provide ways to link to outside services

The Recommendations also group the functions into levels of interoperability:

Level 1: basic interface – simple harvest, search, and display record
Level 2: supplemental – Level 1 plus more robust harvest and search
Level 3: alternative – Level 2 plus patron services
Level 4: robust – Level 3 plus reserves functions and support of an explain function

After describing the things outlined above in greater detail, the Recommendations get down to business, list each function, its parameters, why it is recommended, and suggests one or more “bindings” — possible ways the function can be implemented. Compared to most recommendations in my experience, this one is very easy to read, and it is definitely approachable by anybody who calls themselves a librarian. A few examples illustrate the point.

The Recommendations suggest a number of harvest functions. These functions allow a harvesting system to specify a number of date ranges and get back a list records that have been created or edited within those ranges. These records may be bibliographic, holdings, or authority in nature. These records may be in MARC format, but is strongly suggested they be in some flavor of XML. The search functions allow a remote application to query the system and get back a list of matching records. Like the harvest functions, records may be returned in MARC but XML is prefered. Patron functions support finding patrons, listing patron attributes, allowing patrons to place holds, recalls, or renewals on items, etc.

There was one thing I especially liked about the Recommendations. Specifically, whenever possible, the bindings were based on existing protocols and “standards”. For example, they advocated the use of OAI-PMH, SRU, OpenSearch, NCIP, ISO Holdings, SIP2, MODS, MADS, and MARCXML.

From my reading, there were only two slightly off kilter things regarding the Recommendations. First, it advocated the possible use of an additional namespace to fill in some blanks existing XML vocabularies are lacking. I suppose this was necessary in order to glue the whole thing together. Second, it took me a while to get my head around the functions supporting links to external services — the OPAC interaction functions. These functions are expected to return Web pages that is static, writable, or transformative in nature. I’ll have to think about these some more.

It is hoped vendors of integrated library systems support these functions natively or they are supported through some sort of add-on system. The eXstensible Catalog (XC) is a good example here. The use of Ex Libris’s X-Server interface is another. At the very least a number of vendors have said they would make efforts to implement Level 1 functionality, and this agreement been called the “Berkley Accord” and includes: AquaBrowser, BiblioCommonsCalifornia Digital Library, Ex Libris, LibLime, OCLC, Polaris Library Systems, SirsiDynix, Talis, and VTLS.

Within my own sphere of hack-dom, I think I could enhance my Alex Catalogue of Electronic Texts to support these Recommendations. Create a (MyLibrary) database. Populate it with the metadata and full-text data of electronic books, open access journal articles, Open Content Alliance materials, records from Wikipedia, and photographic images of my own creation. Write reports in the form of browsable lists or feeds expected to be fed to an indexer. Add an OAI-PMH interface. Make sure the indexer is accessible via SRU. Implement a “my” page for users and enhance it to support the Recommendations. Ironically, much of this work has already been done.

In summary, and as I mentioned previously, these Recommendations are a step in the right direction. The implementation of a “next generation” library catalog is not about re-inventing a better wheel and trying to corner the market with superior or enhanced functionality. Instead it is about providing a platform for doing the work libraries do. For the most part, libraries and their functions have more things in common than they have differences. These Recommendations articulate a lot of these commonalities. Implement them, and kudos to Team DLF ILS-DI.

[1] PDF version of Recommendation – http://tinyurl.com/3lqxx2

HyperNote Pro: a text annotating HyperCard stack

Eric Lease Morgan — Sun, 08 Jun 2008 02:53:48 +0000

In 1992 I wrote a HyperCard stack called HyperNote Pro.

HyperNote allowed you to annotate plain text files, and it really was a hypertext system. Import a plain text file. Click a word to see a note. Option-click a word to create a note. Shift-click a word to create an image note. Option-shift-click a word to link to another document. Use the HyperNote > New HypernNote menu option to duplicate the stack and create a new HyperNote document.

HyperCard is all but dead, and need an older Macintosh computer to use the application. It was pretty cool. You can download it from my archives. Here is the text from the self-extracting archive:

HyperNote Pro: a text annotating stack by Eric Lease Morgan

HyperNote Pro is a HyperCard stack used to annotate text. It can also create true hypertext links between itself and other documents or applications.

Simply create a new HyperNote Pro stack, import a text file, and add pop–up notes, pictures, and/or hypertext links to the text. The resulting stack can be distributed to anybody with HyperCard 2.0 and they will be able to read or edit your notes and pictures. They will be able to link to other documents if the documents are available.

Here are some uses for HyperNote Pro. Context sensitive help can be created for applications. News or journal articles could be imported and your opinions added. Business reports could be enhances with graphs. Resumes could go into greater detail without overwhelming the reader. Students could turn in papers and teachers could comment on the text.

Another neat thing about HyperNote Pro is it self–replicating. By selecting “New HN…” and choosing a text–file, HyperNote Pro creates a copy of itself except with the text of the chosen file.

HyperNote Pro is free. It requires HyperCard 2.0 to run.

Features:

any size text–file can be imported

format the text with any available font

add/edit pop–up notes and/or pictures to imported text

add true hypertext links to any document or application

includes a “super find” feature

self–replicating

System 7 compatible
        \ /
       - * -      
     \ // \       Eric Lease Morgan, Systems Librarian 
    - * -|\ /     North Carolina State University
     / \ - * -    Box 7111, Room 2111
      |  |/ \     Raleigh, NC 29695-7111
      \ /| |      (919) 515-6182
     - * - |
      / \| /      
       | |/       
    ===========   America Online: EricMorgan
     \=======/    Compu$erve: 71020,2026
      \=====/     Internet: eric_morgan@ncsu.edu
       =====      The Well: emorgan

P.S. Maybe I will be able to upload this stack to TileStack as seen on Slashdot.

Steve Cisler

Eric Lease Morgan — Fri, 06 Jun 2008 05:01:19 +0000

This is a tribute to Steve Cisler, community builder and librarian.

Late last week I learned from Paul Jones’s blog that Steve Cisler had died. He was a mentor to me, and I’d like to tell a few stories describing the ways he assisted me in my career.

I met Steve in 1989 or so after I applied for an Apple Library of Tomorrow (ALOT) grant. The application was simple. “Send us a letter describing what you would do with a computer if you had one.” Being a circuit-rider medical librarian at the Catawba-Wateree Area Health Education Center (AHEC) in rural Lancaster, South Carolina, I outlined how I would travel from hospital to hospital facilitating searches against MEDLINE, sending requests for specific articles via ‘fax back to my home base, and having the articles ‘faxed back to the hospital the same day. Through this process I proposed to reduce my service’s turn-around time from three days to a few hours.

Those were the best two pages of text I ever wrote in my whole professional career because Apple Computer (Steve Cisler) sent me all the hardware I requested — an Apple Macintosh portable computer and printer. He then sent me more hardware and more software. It kept coming. More hardware. More software. At this same time I worked with my boss (Martha Groblewski) to get a grant from the National Library of Medicine. This grant piggy-backed on the ALOT grant, and I proceeded to write an expert system in HyperCard. It walked the user through a reference interview, constructed a MEDLINE search, dialed up PubMED, executed the search, downloaded the results, displayed them to the user, allowed the user to make selections, and finally turned-around and requested the articles for delivery via DOCLINE. I called it AskEric, about four years before the ERIC Clearinghouse used the same name for their own expert system. In my humble opinion, AskEric was very impressive, and believe it or not, the expert part of the system still works (as long as you have the proper hardware). It was also during this time when I wrote my first two library catalog applications. The first one, QuickCat, read the output of a catalog card printing program called UltraCard. Taking a clue from OCLC’s (Fred Kilgour’s) 4,2,2,1 indexing technique, it parsed the card data creating author, title, subject, and keyword indexes based on a limited number of initial characters from each word. It supported simple field searching and Boolean logic. It even supported rudimentary circulation — search results of items that had been checked-out were displayed a different color than the balance of the display. QuickCat earned me the 1991 Meckler Computers In Libraries Software Award. My second catalog application, QuickCat Mac, read MARC records and exploited HyperCard’s free-text searching functionality. Thanks goes to Walt Crawford who taught me about MARC through his book, MARC For Library Use. Thanks goes to Steve for encouraging the creativity.

Steve then came to visit. He wanted to see my operation and eat barbecue. During his visit, he brought a long a video card, and I had my first digital image taken. The walk to the restaurant where we ate his barbecue was hot and humid but he insisted on going. “When in South Carolina you eat barbecue”, he said. He was right.

It was time for the annual ALOT conference, and Steve flew me out to Apple Computer’s corporate headquarters. There I met other ALOT grantees including Jean Armor Polly (who coined the phrase “surfing the Internet”), Craig Summerhill who was doing some very interesting work indexing content using BRS, folks from OCLC who were scanning tables-of-contents and trying to do OCR against them, and people from the Smithsonian Institution who were experimenting with a new image file format called JPEG.

I outgrew the AHEC, and with the help of a letter of reference from Steve I got a systems librarian job at the North Carolina State University Libraries. My boss, John Ulmschneider, put me to work on a document delivery project jointly funded by the National Agriculture Library and an ALOT grant. “One of the reasons I hired you”, John said, “was because of your experience with a previous ALOT grant.” Our application, code named “The Scan Plan”, was a direct competitor to the fledgling application called Ariel. Our application culminated in an article called “Digitized Document Transmission Using HyperCard”, ironically available as a scanned image from the ERIC Clearinghouse (or this cached version). That year, during ALA, I remember walking through the exhibits. I met up with John and one of his peers, Bil Stahl (University of North Carolina – Charlotte). As we were talking Charles Bailey (University of Houston) of PACS Review fame joined us. Steve then walked up. Wow! I felt like I was really a part of the in crowd. They didn’t all know each other, but they knew me. Most of the people whose opinions I respected the most at that particular time were all gathered in one place.

By this time the “Web” was starting to get hot. Steve contacted me and asked, “Would you please write a book on the topic of Macintosh-based Web servers?” Less than one year, one portable computer, and one QuickTake camera later I had written Teaching a New Dog Old Tricks: A Macintosh-Based World Wide Web Starter Kit Featuring MacHTTP and Other Tools. This earned me two more trips. The first was to WebEdge, the first Macintosh WWW Developer’s Conference, where I won a hackfest award for my webcam application called “Save 25¢ or ‘Is Eric In’?” The second was back to Apple headquarters for the Ties That Bind conference where I learned about AppleSearch which (eventually) morphed into the search functionality of Mac OS X, sort of. I remember the Apple Computer software engineers approaching the Apple Computer Library staff and asking, “Librarians, you have content, right? May we have some to index?”

To me it was the Ties That Bind conference that optimized the Steve Cisler I knew. He described there his passion for community. For sharing. For making content (and software) freely available. We discussed things like “copywrite” as opposed to copyright. It was during this conference he pushed me into talking with a couple of Apple Computer lawyers and convince them to allow the Tricks book to be freely published. It was during this conference he described how we are all a part of a mosaic. Each of us are a dot. Individually we have our own significance, but put together we can create an even more significant picture. He used an acrylic painting he recently found to literally illustrate the point, all puns intended. Since then I have used the mosaic as a part my open source software in libraries handout. I took the things Steve said to heart. Because of Steve Cisler I have been practicing open access publishing and open source software distribution for longer than the phrases have been coined.

A couple more years past and Apple Computer shut down their library. Steve lost his job, and I sort of lost track of Steve. I believe he did a lot of traveling, and the one time I did see him he was using a Windows computer. He didn’t like it, but he didn’t seem to like Apple either. I tried to thank him quite a number of times for the things he had done for me and my career. He shrugged off my praise and more or less said, “Pass it forward.” He then went “off the ‘Net” and did more traveling. (Maybe I got some of my traveling bug from Steve.) I believe I wrote him a letter or two. A few more years past, and like I mentioned above, I learned he had died. Ironically, the next day I was off to Santa Clara (California) to give a workshop on XML. I believe Steve lived in Santa Clara. I thought of him as I walked around downtown.

Tears are in my eyes and my heart is in my stomach when I say, “Thank you, Steve. You gave me more than I ever gave in return.” Every once in a while younger people than I come to visit and ask questions. I am more than happy to share what I know. “Steve, I am doing my best to pass it forward.”

Code4Lib Journal Perl module (version .003)

Eric Lease Morgan — Wed, 28 May 2008 18:36:21 +0000

I hacked together a Code4Lib Journal Perl module providing read-only access to the Journal’s underlying WordPress (MySQL) database. You can download the distribution, and the following is from the distribution’s README file:

This is the README file for a Perl module called C4LJ — Code4Lib Journal

Code4Lib Journal is the refereed serial of the Code4Lib community. [1] The community desires to make the Journal’s content as widely accessible as possible. To that end, this Perl module is a read-only API against the Journal’s underlying WordPress database. Its primary purpose is to generate XML files that can be uploaded to the Directory of Open Access Journals and consequently made available through their OAI interface. [2]

Installation

To install the module you first need to have access to a WordPress (MySQL) database styled after the Journal. There is sample data in the distribution’s etc directory.

Next, you need to edit lib/C4LJ/Config.pm. Specifically, you will need to change the values of:

* $DATA_SOURCE – the DSN of your database, and you will probably need to only edit the value of the database name

* $USERNAME – the name of a account allowed to read the database

* $PASSWORD – the password of $USERNAME

Finally, exploit the normal Perl installation procedure: make; make test; make install.

Usage

To use the module, you will want to use C4LJ::Articles->get_articles. Call this method. Get back a list of article objects, and process each one. Something like this:
  use C4LJ::Article;
  foreach ( C4LJ::Article->get_articles ) {
    print '        ID: ' . $_->id       . "\n";
    print '     Title: ' . $_->title    . "\n";
    print '       URL: ' . $_->url      . "\n";
    print '  Abstract: ' . $_->abstract . "\n";
    print '    Author: ' . $_->author   . "\n";
    print '      Date: ' . $_->date     . "\n";
    print '     Issue: ' . $_->issue    . "\n";
    print "\n";
  }
The bin directory contains three sample applications:

1. dump-metadata.pl – the code above, basically

2. c4lj2doaj.pl – given an issue number, output XML suitable for DOAJ

3. c4lj2doaj.cgi – the same as c4lj2doaj.pl but with a Web interface

See the modules’ PODs for more detail.

License

This module is distributed under the GNU General Public License.

Notes

[1] Code4Lib Journal – http://journal.code4lib.org/
[2] DOAJ OAI information – http://www.doaj.org/doaj?func=loadTempl&templ=070509

Open Library, the movie!

Eric Lease Morgan — Tue, 27 May 2008 02:29:54 +0000

For a good time, I created a movie capturing some of the things I saw while attending the Open Library Developer’s Meeting a few months ago. Introducing, Open Library, the movie!

get-mbooks.pl

Eric Lease Morgan — Tue, 27 May 2008 01:50:33 +0000

I few months ago I wrote a program called get-mbooks.pl, and it is was used to harvest MARC data from the University of Michigan’s OAI repository of public domain Google Books. You can download the program here, and what follows is the distribution’s README file:

This is the README file for script called get-mbooks.pl

This script — get-mbooks.pl — is an OAI harvester. It makes a connection to the OAI data provider at the University of Michigan. [1] It then requests the set of public domain Google Books (mbooks:pd) using the marc21 (MARCXML) metadata schema. As the metadata data is downloaded it gets converted into MARC records in communications format through the use of the MARC::File::SAX handler.

The magic of this script lies in MARC::File::SAX. Is a hack written by Ed Summers against MARC::File::SAX found on CPAN. It converts the metadata sent from the provider into “real” MARC. You will need this hacked version of the module in your Perl path, and it has been saved in the lib directory of this distribution.

To get get-mbooks.pl to work you will first need Perl. Describing how to install Perl is beyond the scope of this README. Next you will need the necessary modules. Installing them is best accomplished through the use of cpan but you will need to be root. As root, run cpan and when prompted, install Net::OAI::Harvester:

$ sudo cpan
cpan> install Net::OAI::Harvester

You will also need the various MARC::Record modules:

$ sudo cpan
cpan> install MARC::Record

When you get this far, and assuming the hacked version of MARC::File::SAX is saved in the distribution’s lib directory, all you need to do next is run the program.

$ ./get-mbooks.pl

Downloading the data is not a quick process, and progress will be echoed in the terminal. At any time after you have gotten some records you can quit the program (ctrl-c) and use the Perl script marcdump to see what you have gotten (marcdump ).

Fun with OAI, Google Books, and MARC.

[1] http://quod.lib.umich.edu/cgi/o/oai/oai

Hello, World!

Eric Lease Morgan — Tue, 27 May 2008 01:09:40 +0000

Hello, World! It is nice to meet you.