Final blog posting

December 19th, 2020

This is probably my final blog posting using the WordPress software, and I hope to pick up posting on Infomotions’ Musings.

WordPress is a significant piece of software, and while its functionality is undeniable, maintaining the software in a constant process. It has become too expensive for me.

Moreover over, blog software, such as WordPress, was suppose to enable two additional types of functionality that have not really come to fruition. The first is/was syndication. Blog software was expected to support things like RSS feeds. While blog software does support RSS, people to not seem to create/maintain lists of blogs and RSS feeds for reading. The idea of RSS has not come to fruition in the expected way. Similarly, blog were expected to support commenting in the form of academic dialog, but that has not really come to fruition either; blog comments are usually terse and do not really foster discussion.

For these reasons, I am foregoing WordPress, and I hope to return to use the of my personal TEI publishing process. I feel as if my personal process will be more long-lasting.

In order to make this transition, I have used a WordPress plug-in called Simply Static. Install the software, play with the settings, create a static site, review results, and repeat if necessary. The software seems to work pretty well. Also, paying the roll of librarian, I have made an effort classify my blog postings while diminishing the number of items in the “miscellaneous” category.

By converting my blog to a static site and removing WordPress from my site, I feel as if I am making the Infomotions web presence simpler and easier to maintain. Sure, I am loosing some functionality, but that loss is smaller than the amount of time, effort, and worry I incur by running software I know too little about.

Posted in Miscellaneous | Comments Off on Final blog posting

Charting & graphing with Tableau Public

May 4th, 2018

They say, “A picture is worth a thousand words”, and through use of something like Tableau this can become a reality in text mining.

After extracting features from a text, you will have almost invariably created lists. Each of the items on the lists will be characterized with bits of context thus transforing the raw data into information. These lists will probably take the shape of matrices (sets of rows & columns), but other data structures exist as well, such as networked graphs. Once the data has been transformed into information, you will want to make sense of the information — turn the information into knowledge. Charting & graphing the data is one way to make that happen.

For example, the reader may have associated each word in a text with a part-of-speech, and then this association was applied across a corpus. The reader might then ask, “To what degree are words associated with each part-of-speech similar or different across items in the corpus? Do different items include similar or different personal pronouns, and therefore, are some documents more male, more female, or more gender neutral?” Alternatively, suppose the named entities have been extracted from items in a corpus, then the reader may want to know, “What countries, states, and/or cities are mentioned in the text/corpus, and to what degree? Are these texts ‘European’ in scope?”

A charting & graphing application like Tableau (or Tableau Public) can address these questions. [1] The first can be answered by enabling the reader to select one more more items from a corpus, select one or more parts-of-speech, counting & tabulating the intersected words, and displaying the result as a word cloud. The second question can be addressed similarly. Allow the reader to select items from a corpus, extract the names of places (countries, states, and cities), and plot the geographic coordinates on a global map. Once these visualizations are complete, they can be saved on the Web for others to use, for example:

Creating visualizations with Tableau (or Tableau Public) takes practice. Not only does the reader need to have structured data in hand, but one needs to be patient in the learning of the interface. To the author’s mind, the whole thing is reminiscent of the venerable HyperCard program from the 1980’s where one was presented with a number of “cards”, and programming interfaces were created by placing “objects” on them.

tableau interface

This workshop comes with two previously created Tableau workbooks located in the etc directory (word-clouds.twbx and maps.twbx). Describing the process to create them is beyond the scope of this workshop, but an outline follows:

amass sets of data, like parts-of-speech or named entities
import the data into Tableau
in the case of the named entities, convert the data to “Geographic Roles”
drag data elements to the Marks, Rows, or Columns cards
make liberal use of the Show Me feature
drag data elements to the Filters card
observe the visualizations and turn your information into knowledge

Tableau is not really intended to be used against textual data/information; Tableau is more useful and more functional when applied to tabular numeric data. After all, the program is called… Tableau. This does not mean Tableau can not be exploited by the text miner. It just means it requires practice and an ability to articulate a question to be answered with the help of a visualization.

Links

[1] Tableau Public – https://public.tableau.com/

Posted in Text Mining and Natural Langauge Processing | Comments Off on Charting & graphing with Tableau Public

Extracting parts-of-speech and named entities with Stanford tools

April 26th, 2018

Extracting specific parts-of-speech as well as “named entities”, and then counting & tabulating them can be quite insightful.

Parts-of-speech include nouns, verbs, adjectives, adverbs, etc. Named entities are specific types of nouns, including but not limited to, the names of people, places, organizations, dates, times, money amounts, etc. By creating features out of parts-of-speech and/or named entities, the reader can answer questions such as:

What is discussed in this document?
What do things do in this document?
How are things described, and how might those descriptions be characterized?
To what degree is the text male, female, or gender neutral?
Who is mentioned in the text?
To what places are things referring?
What happened in the text?

There are a number of tools enabling the reader to extract parts-of-speech, including the venerable Brill part-of-speech tagger implemented in a number of programming languages, CLAWS, the Apache OpenNLP, and a specific part of the Stanford NLP suite of tools called the Stanford Log-linear Part-Of-Speech Tagger. [1] Named entities can be extracted with the Stanford Named Entity Recognizer (NER). [2] This workshop exploits the Standford tools.

The Stanford Log-linear Part-Of-Speech Tagger is written in Java, making it a bit difficult for most readers to use in the manner it was truly designed, the author included. Luckily, the distribution comes with a command-line interface allowing the reader to use the tagger sans any Java programing. Because any part-of-speech or named entity extraction application is the result of a machine learning process, it is necessary to use a previously created computer model. The Stanford tools comes quite a few models from which to choose. The command-line interface also enables the reader to specify different types of output: tagged, XML, tab-delimited, etc. Because of all these options, and because the whole thing uses Java “archives” (read programming libraries or modules), the command-line interface is daunting, to say the least.

After downloading the distribution, the reader ought to be able to change to the bin directory, and execute either one of the following commands:

$ stanford-postagger-gui.sh
> stanford-postagger-gui.bat

The result will be a little window prompting for a sentence. Upon entering a sentence, tagged output will result. This is a toy interface, but demonstrates things quite nicely.

pos gui

The full-blown command-line interface is bit more complicated. From the command-line one can do either of the following, depending on the operating system:

$ stanford-postagger.sh models/english-left3words-distsim.tagger walden.txt
> stanford-postagger.bat models\english-left3words-distsim.tagger walden.txt

The result will be a long stream of tagged sentences, which I find difficult to parse. Instead, I prefer the inline XML output, which is much more difficult to execute but much more readable. Try either:

$ java -cp stanford-postagger.jar: edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/english-left3words-distsim.tagger -outputFormat inlineXML -outputFormatOptions lemmatize -textFile walden.txt
> java -cp stanford-postagger.jar: edu.stanford.nlp.tagger.maxent.MaxentTagger -model models\english-left3words-distsim.tagger -outputFormat inlineXML -outputFormatOptions lemmatize -textFile walden.txt

In these cases, the result will be a long string of ill-formed XML. With a bit of massaging, this XML is much easier to parse with just about any compute programming language, believe it or not. The tagger can also be run in server mode, which makes batch processing a whole lot easier. The workshop’s distribution comes a server and client application for exploiting these capabilities, but, unfortunately, they won’t run on Windows computers unless some sort of Linux shell has been installed. Some people can issue the following command to launch the server from the workshop’s distribution:

$ ./bin/pos-server.sh

The reader can run the client like this:

$ ./bin/pos-client.pl walden.txt

The result will be a well-formed XML file, which can be redirected to a file, processed by another script converting it into a tab-delimited file, and finally saved to a second file for reading by a spreadsheet, database, or data analysis tool:

$ ./bin/pos-client.pl walden.txt > walden.pos; ./bin/pos2tab.pl walden.pos > walden.tsv

For the purposes of this workshop, the whole of the harvested data has been pre-processed with the Stanford Log-linear Part-Of-Speech Tagger. The result is been mirrored in the parts-of-speech folder/directory. The reader can open the files in the parts-of-speech folder/directory for analysis. For example, you might open them in OpenRefine and try to see what verbs appear most frequently in a given document. My guess the answer will be the lemmas “be” or “have”. The next set of most frequently used verb lemmas will probably be more indicative of the text.

The process of extrating features of name entities is very similar with the Stanford NER. The original Stanford NER distribution comes with a number of jar files, models, and configuration/parameter files. After downloading the distribution, the reader can run a little GUI application, import some text, and run NER. The result will look something like this:

ner gui

The simple command-line interface takes a single file as input, and it outputs a stream of tagged sentences. For example:

$ ner.sh walden.txt
> ner.bat walden.txt

Each tag denotes an entity (i.e. the name of a person, the name of a place, the name of an organization, etc.). Like the result of all machine learning algorithms, the tags are not necessarily correct, but upon closer examination, most of them are pretty close. Like the POS Tagger, this workshop’s distribution comes with a set of scripts/programs that can make the Stanford NER tool locally available as a server. It also comes with a simple client to query the server. Like the workshop’s POS tool, the reader (with a Macintosh or Linux computer) can extract named entities all in two goes:

$ ./bin/pos-server.sh
$ ./bin/pos-client.pl walden.txt > walden.ner; ./bin/pos2tab.pl walden.ner > walden.tsv

Like the workshop’s pre-processed part-of-speech files, the workshop’s corpus has been pre-processed with the NER tool. The preprocessed files ought to be in a folder/directory named… named-entities. And like the parts-of-speech files, the “ner” files are tab-delimited text files readable by spreadsheets, databases, OpenRefine, etc. For example, you might open one of them in OpenRefine and see what names of people trend in a given text. Try to create a list of places (which is not always easy), export them to a file, and open them with Tabeau Public for the purposes of making a geographic map.

Extracting parts-of-speech and named entities straddles simple text mining and natural language processing. Simple text mining is often about counting & tabulating features (words) in a text. These features have little context sans proximity to other features. On the other hand, parts-of-speech and named entities denote specific types of things, namely specific types of nouns, verbs, adjectives, etc. While these thing do not necessarily denote meaning, they do provide more context than simple features. Extracting parts-of-speech and named entities is (more or less) a easy text mining task with more benefit than cost. Extracting parts-of-speech and named entities goes beyond the basics.

Creating a plain text version of a corpus with Tika

April 25th, 2018

It is imperative to create plain text versions of corpus items.

Text mining can not be done without plain text data. This means HTML files need to be rid of markup. It means PDF files need to have been “born digitally” or they need to have been processed with optical character recognition (OCR), and then the underlying text needs to be extracted. Word processor files need to converted to plain text, and the result saved accordingly. The days of plain o’ ASCII text files need to be forgotten. Instead, the reader needs to embrace Unicode, and whenever possible, make sure characters in the text files are encoded as UTF-8. With UTF-8 encoding, one gets all of the nice accent marks so foreign to United States English, but one also gets all of the pretty emoticons increasingly sprinkling our day-to-day digital communications. Moreover, the data needs to be as “clean” as possible. When it comes to OCR, do not fret too much. Given the large amounts of data the reader will process, “bad” OCR (OCR with less than 85% accuracy) can still be quite effective.

Converting harvested data into plain text used to be laborious as well as painful, but then a Java application called Apache Tika came on the scene. [1] Tika comes in two flavors: application and server. The application version can take a single file as input, and it can output metadata as well as any underlying text. The application can also work in batch mode taking a directory as input and saving the results to a second directory. Tika’s server version is much more expressive, more powerful, and very HTTP-like, but it requires more “under the hood” knowledge to exploit to its fullest potential.

For the sake of this workshop, versions of the Tika application and Tika server are included in the distribution, and they have been saved in the lib directory with the names tika-desktop.jar and tika-server.jar. The reader can run the desktop/GUI version of the Tika application by merely double-clicking on it. The result will be a dialog box.

Drag a PDF (or just about any) file on to the window, and Tika will extract the underlying text. To use the command-line interface, something like this could be run to output the help text:

$ java -jar ./lib/tika-desktop.jar --help
> java -jar .\lib\tika-desktop.jar --help

And then something like these commands to process a single file or a whole directory of files:

$ java -jar ./lib/tika-desktop.jar -t <filename>
$ java -jar ./lib/tika-desktop.jar -t -i <input directory> -o <output directory>
> java -jar .\lib\tika-desktop.jar -t <filename>
> java -jar .\lib\tika-desktop.jar -t -i <input directory> -o <output directory>

Try transforming a few files individually as well as in batch. What does the output look like? To what degree is it readable? To what degree has the formatting been lost? Text mining does not take formatting into account, so there is no huge loss in this regard.

Without some sort of scripting, the use of Tika to convert harvested data into plain text can still be tedious. Consequently, the whole of the workshop’s harvested data has been pre-processed with a set of Perl and bash scripts (which probably won’t work on Windows computers unless some sort of Linux shell has been installed):

$ ./bin/tika-server.sh – runs Tika in server mode on TCP port 8080, and waits patiently for incoming connections
$ ./bin/tika-client.pl – takes a file as input, sends it to the server, and returns the plain text while handling the HTTP magic in the middle
$ ./bin/file2txt.sh – a front-end to the second script taking a file and directory name as input, transforming the file into plain text, and saving the result with the same name but in the given directory and with a .txt extension

The entirety of the harvested data has been transformed into plain text for the purposes of this workshop. (“Well, almost all.”) The result has been saved in the folder/directory named “corpus”. Peruse the corpus directory. Compare & contrast its contents with the contents of the harvest directory. Can you find any ommisions, and if so, then can you guess why/how they occurred?

Identifying themes and clustering documents using MALLET

April 25th, 2018

Topic modeling is an unsupervised machine learning process. It is used to create clusters (read “subsets”) of documents, and each cluster is characterized by sets of one or more words. Topic modeling is good at answering questions like, “If I were to describe this collection of documents in a single word, then what might that word be? How about two?” or make statements like, “Once I identify clusters of documents of interest, allow me to read/analyze those documents in greater detail.” Topic modeling can also be used for keyword (“subject”) assignment; topics can be identified and then documents can be indexed using those terms. In order for a topic modeling process to work, a set of documents first needs to be assembled. The topic modeler then, at the very least, takes an integer as input, which denotes the number of topics desired. All other possible inputs can be assumed, such as use of a stop word list or denoting the number of time the topic modeler ought to internally run before it “thinks” it has come the best conclusion.

MALLET is the grand daddy of topic modeling tools, and it supports other functions such as text classification and parsing. [1] It is essentially a set of Java-based libraries/modules designed to be incorporated into Java programs or executed from the command line.

A subset of MALLET’s functionality has been implemented in a program called topic-modeling-tool, and the tool bills itself as “A GUI for MALLET’s implementation of LDA.” [2] Topic-modeling-tool provides an easy way to read what possible themes exist in a set of documents or how the documents might be classified. It does this by creating topics, displaying the results, and saving the data used to create the results for future use. Here’s one way:

Create a set of plain text files, and save them in a single directory.
Run/launch topic-modeling-tool.
Specify where the set of plain text files exist.
Specify where the output will be saved.
Denote the number of topics desired.
Execute the command with “Learn Topics”.

The result will be a set of HTML, CSS, and CSV files saved in the output location. The “answer” can also be read in the tool’s console.

A more specific example is in order. Here’s how to answer the question, “If I were describe this corpus in a single word, then what might that one word be?”:

Repeat Steps #1-#4, above.
Specify a single topic to be calculated.
Press “Optional Settings…”.
Specify “1” as the number of topic words to print.
Press okay.
Execute the command with “Learn Topics”.

What one word can be used to describe your collection?

Iterate the modeling process by slowly increasing the number of desired topics and number of topic words. Personally, I find it interesting to implement a matrix of topics to words. For example, start with one topic and one word. Next, denote two topics with two words. Third, specify three topics with three words. Continue the process until the sets of words (“topics”) seem to make intuitive sense. After a while you may observe clear semantic distinctions between each topic as well as commonalities between each of the topic words. Distinctions and commonalities may include genders, places, names, themes, numbers, OCR “mistakes”, etc.

Introduction to the NLTK

April 25th, 2018

The venerable Python Natural Language Toolkit (NLTK) is well worth the time of anybody who wants to do text mining more programmatically. [0]

For much of my career, Perl has been the language of choice when it came to processing text, but in the recent past it seems to have fallen out of favor. I really don’t know why. Maybe it is because so many other computer languages have some into existence in the past couple of decades: Java, PHP, Python, R, Ruby, Javascript, etc. Perl is more than capable of doing the necessary work. Perl is well-supported, and there are a myriad of supporting tools/libraries for interfacing with databases, indexers, TCP networks, data structures, etc. On the other hand, few people are being introduced to Perl; people are being introduced to Python and R instead. Consequently, the Perl community is shrinking, and the communities for other languages is growing. Writing something in a “dead” language is not very intelligent, but that may be over-stating the case. On the other hand, I’m not going to be able to communicate with very many people if I speak Latin and everybody else is speaking French, Spanish, or German. It behooves the reader to write software in a language apropos to the task as well as a langage used by many others.

Python is a good choice for text mining and natural language processing. The Python NLTK provides functionality akin to much of what has been outlined in this workshop, but it goes much further. More specifically, it interaces with WordNet, a sort of thesaurus on steroids. It interfaces with MALLET, the Java-based classification & topic modeling tool. It is very well-supported and continues to be maintained. Moreover, Python is mature in & of itself. There are a host of Python “distributions/frameworks”. There are any number of supporting libraries/modules for interfacing with the Web, databases & indexes, the local file system, etc. Even more importantly for text mining (and natural language processing) techniques, Python is supported by a set of robust machine learning libraries/modules called scikit-learn. If the reader wants to write text mining or natural language processing applications, then Python is really the way to go.

In the etc directory of this workshop’s distribution is a “Jupyter Notebook” named “An introduction to the NLTK.ipynb”. [1] Notebooks are sort of interactive Python interfaces. After installing Jupyter, the reader ought to be able to run the Notebook. This specific Notebook introduces the use of the NLTK. It walks you through the processes of reading a plain text file, parsing the file into words (“features”). Normalizing the words. Counting & tabulating the results. Graphically illustrating the results. Finding co-occurring words, words with similar meanings, and words in context. It also dabbles a bit into parts-of-speech and named entity extraction.

notebook

The heart of the Notebook’s code follows. Given a sane Python intallation, one can run this proram by saving it with a name like introduction.py, saving a file named walden.txt in the same directory, changing to the given directory, and then running the following command:

python introduction.py

The result ought to be a number of textual outputs in the terminal window as well as a few graphics.

Errors may occur, probably because other Python libraries/modules have not been installed. Follow the error messages’ instructions, and try again. Remember, “Your milage may vary.”

# configure; using an absolute path, define the location of a plain text file for analysis
FILE = 'walden.txt'

# import / require the use of the Toolkit
from nltk import *

# slurp up the given file; display the result
handle = open( FILE, 'r')
data   = handle.read()
print( data )

# tokenize the data into features (words); display them
features = word_tokenize( data )
print( features )

# normalize the features to lower case and exclude punctuation
features = [ feature for feature in features if feature.isalpha() ]
features = [ feature.lower() for feature in features ]
print( features )

# create a list of (English) stopwords, and then remove them from the features
from nltk.corpus import stopwords
stopwords = stopwords.words( 'english' )
features  = [ feature for feature in features if feature not in stopwords ]

# count & tabulate the features, and then plot the results -- season to taste
frequencies = FreqDist( features )
plot = frequencies.plot( 10 )

# create a list of unique words (hapaxes); display them
hapaxes = frequencies.hapaxes()
print( hapaxes )

# count & tabulate ngrams from the features -- season to taste; display some
ngrams      = ngrams( features, 2 )
frequencies = FreqDist( ngrams )
frequencies.most_common( 10 )

# create a list each token's length, and plot the result; How many "long" words are there?
lengths = [ len( feature ) for feature in features ]
plot    = FreqDist( lengths ).plot( 10 )

# initialize a stemmer, stem the features, count & tabulate, and output
from nltk.stem import PorterStemmer
stemmer     = PorterStemmer()
stems       = [ stemmer.stem( feature ) for feature in features ]
frequencies = FreqDist( stems )
frequencies.most_common( 10 )

# re-create the features and create a NLTK Text object, so other cool things can be done
features = word_tokenize( data )
text     = Text( features )

# count & tabulate, again; list a given word -- season to taste
frequencies = FreqDist( text )
print( frequencies[ 'love' ] )

# do keyword-in-context searching against the text (concordancing)
print( text.concordance( 'love' ) )

# create a dispersion plot of given words
plot = text.dispersion_plot( [ 'love', 'war', 'man', 'god' ] )

# output the "most significant" bigrams, considering surrounding words (size of window) -- season to taste
text.collocations( num=10, window_size=4 )

# given a set of words, what words are nearby
text.common_contexts( [ 'love', 'war', 'man', 'god' ] )

# list the words (features) most associated with the given word
text.similar( 'love' )

# create a list of sentences, and display one -- season to taste
sentences = sent_tokenize( data )
sentence  = sentences[ 14 ]
print( sentence )

# tokenize the sentence and parse it into parts-of-speech, all in one go
sentence = pos_tag( word_tokenize( sentence ) )
print( sentence )

# extract named enities from a sentence, and print the results
entities = ne_chunk( sentence )
print( entities )

# done
quit()

Using Voyant Tools to do some “distant reading”

April 24th, 2018

Voyant Tools is often the first go-to tool used by either: 1) new students of text mining and the digital humanities, or 2) people who know what kind of visualization they need/want. [1] Voyant Tools is also one of the longest supported tools described in this bootcamp.

As stated the Tool’s documentation: “Voyant Tools is a web-based text reading and analysis environment. It is a scholarly project that is designed to facilitate reading and interpretive practices for digital humanities students and scholars as well as for the general public.” To that end it offers a myriad of visualizations and tabular reports characterizing a given text or texts. Voyant Tools works quite well, but like most things, the best use comes with practice, a knowledge of the interface, and an understanding of what the reader wants to express. To all these ends, Voyant Tools counts & tabulates the frequencies of words, plots the results in a number of useful ways, supports topic modeling, and the comparison documents across a corpus. Examples include but are not limited to: word clouds, dispersion plots, networked analysis, “stream graphs”, etc.

dispersion chart	network diagram
“stream” chart	word cloud
concordance	topic modeling

Voyant Tools’ initial interface consists of six panes. Each pain encloses a feature/function of Voyant. In the author’s experience, Voyant Tools’ is better experienced by first expanding one of the panes to a new window (“Export a URL”), and then deliberately selecting one of the tools from the “window” icon in the upper left-hand corner. There will then be displayed a set of about two dozen tools for use against a document or corpus.

initial layout

focused layout

Using Voyant Tools the reader can easily ask and answer the following sorts of questions:

What words or phrases appear frequently in this text?
How do those words trend throughout the given text?
What words are used in context with a given word?
If the text were divided into T topics, then what might those topics be?
Visually speaking, how do given texts or sets of text cluster together?

After a more thorough examination of the reader’s corpus, and after making the implicit more explicit, Voyant Tools can be more informative. Randomly clicking through its interface is usually daunting to the novice. While Voyant Tools is easy to use, it requires a combination of text mining knoweldge and practice in order to be used effectively. Only then will useful “distant” reading be done.

[1] Voyant Tools – https://voyant-tools.org/

Posted in Text Mining and Natural Langauge Processing | Comments Off on Using Voyant Tools to do some “distant reading”

Using a concordance (AntConc) to facilitate searching keywords in context

April 23rd, 2018

A concordance is one of the oldest of text mining tools dating back to at least the 13th century when they were used to analyze and “read” religious texts. Stated in modern-day terms, concordances are key-word-in-context (KWIC) search engines. Given a text and a query, concordances search for the query in the text, and return both the query as well as the words surrounding the query. For example, a query for the word “pond” in a book called Walden may return something like the following:

  1.    the shore of Walden Pond, in Concord, Massachuset
  2.   e in going to Walden Pond was not to live cheaply 
  3.    thought that Walden Pond would be a good place fo
  4.    retires to solitary ponds to spend it. Thus also 
  5.    the woods by Walden Pond, nearest to where I inte
  6.    I looked out on the pond, and a small open field 
  7.   g up. The ice in the pond was not yet dissolved, t
  8.   e whole to soak in a pond-hole in order to swell t
  9.   oping about over the pond and cackling as if lost,
  10.  nd removed it to the pond-side by small cartloads,
  11.  up the hill from the pond in my arms. I built the

The use of a concordance enables the reader to learn the frequency of the given query as well as how it is used within a text (or corpus).

Digital concordances offer a wide range of additional features. For example, queries can be phrases or regular expressions. Search results and be sorted by the words on the left or on the right of the query. Queries can be clustered by the proximity of their surrounding words, and the results can be sorted accordingly. Queries and their nearby terms can be scored not only by their frequencies but also by the probability of their existence. Concordances can calculate the postion of a query i a text and illustrate the result in the form of a dispersion plot or histogram.

AntConc screen shot AntConc is a free, cross-platform concordance program that does all of the things listed above, as well as a few others. [1] The interface is not as polished as some other desktop applications, and sometimes the usability can be frustrating. On the other hand, given practice, the use of AntConc can be quite illuminating. After downloading and running AntConc, give these tasks a whirl:

use the File menu to open a single file
use the Word List tab to list token (word) frequencies
use the Settings/Tool Preferences/Word List Category to denote a set of stop words
use the Word List tab to regenerate word frequencies
select a word of interest from the frequency list to display the KWIC; sort the result
use the Concordance Plot tab to display the dispersion plot
select the Collocates tab to see what words are near the selected word
sort the collocates by frequency and/or word; use the result to view the concordance

The use of a concordance is often done just after the creation of a corpus. (Remember, a corpus can include one or more text files.) But the use of a concordance is much more fruitful and illuminating if the features of a corpus are previously made explicit. Concordances know nothing about parts-of-speech nor grammer. Thus they have little information about the words they are analyzing. To a concordance, every word is merely a token — the tiniest bit of data. Whereas features are more akin to information because they have value. It is better to be aware of the information at your disposal as opposed to simple data. Do not rush to the use of a concordance before you have some information at hand.

[1] AntConc – http://www.laurenceanthony.net/software/antconc/

Posted in Text Mining and Natural Langauge Processing | Comments Off on Using a concordance (AntConc) to facilitate searching keywords in context

Word clouds with Wordle

April 22nd, 2018

words A word cloud, or sometimes called a “tag cloud” is a fun, easy, and popular way to visualize the characteristics of a text. Usually used to illustrate the frequency of words in a text, a word clouds make some features (“words”) bigger than others, sometimes colorize the features, and amass the result in a sort of “bag of words” fashion.

Many people disparage the use of word clouds. This is probably because word clouds may have been over used. The characteristics they illustrate are sometimes sophomoric. Or too much value has been given to their meaning. Despite these facts, a word cloud is an excellent way to initialize the analysis of texts.

There are many word cloud applications and programming libraries, but Wordle is probably the easiest to use as well as the most popular. † [1] To get started, use your Web browser and go to the Wordle site. Click the Create tab and type some text into the resulting text box. Submit the form. Your browser may ask for permissions to run a Java application, and if granted, the result ought to be simple word cloud. The next step is to play with Wordle’s customizations: fonts, colors, layout, etc. To begin doing useful analysis, open a file from the workshop’s corpus, and copy/paste it into Wordle. What does the result tell you? Copy/paste a different file into Wordle and then compare/contrast the two word clouds.

places By default, Wordle make effort to normalize the input. It removes stop words, lower-cases letters, removes numbers, etc. Wordle then counts & tabulates the frequencies of each word to create the visualization. But the frequency of words only tells one part of a text’s story. There are other measures of interest. For example, the reader might want to create a word cloud of ngram frequencies, the frequencies of parts-of-speech, or even the log-likelihood scores of significant words. To create the sorts of visualization as word clouds, the reader must first create a colon-delimited list of features/scores, and then submit them under Wordle’s Advanced tab. The challenging part of this process is created the list of features/scores, and the process can be done using a combination of the tools described in the balance of the workshop.

† Since Wordle is a Web-based Java application, it is also a good test case to see whether or not Java is installed and configured on your desktop computer.

[1] Wordle – http://www.wordle.net

Posted in Text Mining and Natural Langauge Processing | Comments Off on Word clouds with Wordle

An introduction to the NLTK: A Jupyter Notebook

April 12th, 2018

The attached file introduces the reader to the Python Natural Langauge Toolkit (NLTK).

The Python NLTK is a set of modules and corpora enabling the reader to do natural langauge processing against corpora of one or more texts. It goes beyond text minnig and provides tools to do machine learning, but this Notebook barely scratches that surface.

This is my first Python Jupyter Notebook. As such I’m sure there will be errors in implementation, style, and functionality. For example, the Notebook may fail because the value of FILE is too operating system dependent, or the given file does not exist. Other failures may/will include the lack of additional modules. In these cases, simply read the error messages and follow the instructions. “Your mileage may vary.”

That said, through the use of this Notebook, the reader ought to be able to get a flavor for what the Toolkit can do without the need to completly understand the Python language.

Posted in Text Mining and Natural Langauge Processing | Comments Off on An introduction to the NLTK: A Jupyter Notebook

What is text mining, and why should I care?

March 28th, 2018

[This is the first of a number of postings on the topic of text mining. More specifically, this is the first draft of an introductory section of a hands-on bootcamp scheduled for ELAG 2018. As I write the bootcamp’s workbook, I hope to post things here. Your comments are most welcome. –ELM]

Amarylis (1 of 3) Text mining is a process used to identify, enumerate, and analyze syntactic and semantic characteristics of a corpus, where a corpus is a collection of documents usually in the form of plain text files. The purpose of this process it to bring to light previously unknown facts, look for patterns & anomalies in the facts, and ultimately have a better understanding of the corpora as a whole.

The simplest of text mining processes merely count & tabulate a document’s “tokens” (usually words but sometimes syllables). The counts & tabulations are akin to the measurements and observations made in the physical and social sciences. Statistical methods can then be applied to the observations for the purposes of answering questions such as:

What is the average length of documents in the collection, and do they exhibit a normal distribution?
What are the most common words/phrases in a document?
What are the most common words/phrases in a corpus?
What are the unique words/phrases in a document?
What are the infrequent words/phrases in a corpus?
What words/phrases exist in every document and to what extent?
Where do given words/phrases appear in a text?
What other words surround a given word/phrase?
What words/phrases are truly representative of a document or corpus?
If a document or corpus where to be described in a single word, then what would that word be? How about described in three words? How about describing a document with three topics where each topic is denoted with five words?

The answers to these questions bring to light a corpus’s previously unknown features enabling the reader to use & understand a corpus more fully. Given the answers to these sorts of questions, a person can learn when Don Quixote actually tilts at windmills, to what degree does Thoreau’s Walden use the word “ice” in the same breath as “pond”, or how has the defintion of “scientific practice” has evolved over time?

Given models created from the results of natural language processing, other characteristics (sentences, parts-of-speech, named entities, etc.) can be parsed. These values can also be counted & tabulated enabling the reader to answer new sets of questions:

How difficult is a document to read?
What is being discussed in a corpus? To what degree are the things the names of people, organizations, places, dates, money amounts, etc? What percentage of the personal pronouns are male, female, or neutral?
What is the action in a corpus? What things happen in a document? Are things explained? Said? Measured?
How are things in the corpus described? Overall, are the connotations positive or negative? Do the connotations evolve within a document?

The documents in a corpus are often associated with metadata such as authors, titles, dates, subjects/keywords, numeric rankings, etc. This metadata can be combined with measurements & observations to answer questions like:

How have the use of specific words/phrases waxed & waned over time?
To what degree do authors write about a given concept?
What are the significant words/phrases described with a given genre?
Are there correlations between words/phrases and given document’s usefulness score?

Again, text mining is a process, and the process usually includes the following steps:

Articulating a research question
Amassing a corpus to study
Coercing the corpus into a form amenable to computer processing
Taking measurements and making observations
Analyzing the results and drawing conclusions

Articulating a research question can be as informally stated as, “I’d like to know more about this corpus” or “I’d like to garner an overview of the corpus before I begin reading it in earnest.” On the other hand, articulating a research question can be as formal as a dissertation’s thesis statement. The purpose of articulating a research question — no matter how formal — is to give you a context for your investigations. Knowing a set of questions to answer helps you determine what tools you will employ in your inquires.

Amarylis (2 of 3) Creating a corpus is not always as easy as you might think. The corpus can be as small as a single document, or as large as millions. The “documents” in the corpus can be anything from tweets from a Twitter feed, Facebook postings, survey comments, magazine or journal articles, reference manuals, books, screen plays, musical lyrics, etc. The original documents may have been born digital or not. If not, then they will need to be digitized in one way or another. It is better if each item in the corpus is associated with metadata, such as authors, titles, dates, keywords, etc. Actually obtaining the documents may be impeded by copyrights, licensing restrictions, or hardware limitations. Once the corpus is obtained, it is useful to organize it into a coherent whole. There is a lot of possible for when it comes to corpus creation.

Coercing a corpus into a form amenable to computer processing is a chore in an of itself. In all cases, the document’s text needs to be in “plain” text. These means the document includes only characters, numbers, punctuation marks, and a limited number of symbols. Plain text files include no graphical formatting. No bold. No italics, no “codes” denoting larger or smaller fonts, etc. Documents are usually manifested as files on a computer’s file system. The files are usually brought together as lists, and each item in the list have many attributes — the metadata describing each item. Furthermore, each document may need to be normalized, and normalization may include changing the case of all letters to lower case, parsing the document into words (usually called “features”), identifying the lemmas or stems of a word, eliminating stop/function words, etc. Coercing your corpus into coherent whole is not to be underestimated. Remember the old adage, “Garbage in, garbage out.”

Ironically, taking measurements and making observations is the easy part. There are a myriad of tools for this purpose, and the bulk of this workshop describes how to use them. One important note: it is imperative to format the measurements and observations in a way amenable to analysis. This usually means a tabular format where each column denotes a different observable characteristic. Without formating measurements and observation in tabular formats, it will be difficult to chart and graph any results.

Analyzing the results and drawing conclusions is the subprocess of looping back to Step #1. It is where you attempt to actually answer the questions previously asked. Keep in mind that human interpretation is a necessary part of this subprocess. Text mining does not present you with truth, only facts. It is up to you to interpret the facts. For example, suppose the month is January and the thermometer outside reads 32º Fahrenheit (0º Centigrade), then you might think nothing is amiss. On the other hand, suppose the month is August, and the thermometer still reads 32º, then what might you think? “It is really cold,” or maybe, “The thermometer is broken.” Either way, you bring context to the observations and interpret things accordingly. Text mining analysis works in exactly the same way.

Amarylis (3 of 3) Finally, text mining is not a replacement for the process of traditional reading. Instead, it ought to be considered as complementary, supplemental, and a natural progression of traditional reading. With the advent of ubiquitous globally networked computers, the amount of available data and information continues to grow exponentially. Text mining provides a means to “read” massive amounts of text quickly and easily. The process is akin the inclusion of now-standard parts of a scholarly book: title page, verso complete with bibliographic and provenance metadata, table of contents, preface, introduction, divisions into section and chapters, footnotes, bibliography, and a back-of-the-book index. All of these features make a book’s content more accessible. Text mining processes applied to books is the next step in accessibility. Text mining is often described as “distant” or “scalable” reading, and it is often contrasted with the “close” reading. This is a false dichotomy, but only after text mining becomes more the norm will the dichotomy fade.

All that said, the totality of this hands-on workshop is based on the following outline:

What is text mining, and why should I care?
Creating a corpus
Creating a plain text version of a corpus with Tika
Using Voyant Tools to do some “distant” reading
Using a concordance, like AntConc, to facilitate searching keywords in context
Creating a simple word list with a text editor
Cleaning & analyzing word lists with OpenRefine
Charting & graphing word lists with Tableau Public
Increasing meaning by extracting parts-of-speech with the Standford POS Tagger
Increasing meaning by extracting named entities with the Standford NER
Identifying themes and clustering documents using MALLET

By the end of the workshop you will have increased your ability to:

identify patterns, anomalies, and trends in a corpus
practice both “distant” and “scalable” reading
enhance & complement your ability to do “close” reading
use & understand any corpus of poetry or prose

The workshop is operating system agnostic, and all the software is freely available on the ‘Net, or already installed on your computer. Active participation requires zero programming, but readers must bring their own computer, and they must be willing to learn how to use a text editor such as NotePad++ or BBEdit. NotePad, WordPad and TextEdit are totally insufficient.

Posted in Text Mining and Natural Langauge Processing | Comments Off on What is text mining, and why should I care?

How to do text mining in 69 words

August 15th, 2017

Doing just about any type of text mining is a matter of: 0) articulating a research question, 1) acquiring a corpus, 2) cleaning the corpus, 3) coercing the corpus into a data structure one’s software can understand, 4) counting & tabulating characteristics of the corpus, and 5) evaluating the results of Step #4. Everybody wants to do Step #4 & #5, but the initial steps usually take more time than desired.

painting

Posted in Text Mining and Natural Langauge Processing | Comments Off on How to do text mining in 69 words

Achieving perfection

June 3rd, 2016

Through the use of the Levenshtein algorithm, I am achieving perfection when it comes to searching VIAF. Well, almost.

trastevare I am making significant progress with VIAF Finder [0], but now I have exploited the use of the Levenshtein algorithm. In fact, I believe I am now able to programmatically choose VIAF identifiers for more than 50 or 60 percent of the authority records.

The Levenshtein algorithm measures the “distance” between two strings. [1] This distance is really the number of keystrokes necessary to change one string into another. For example, the distance between “eric” and “erik” is 1. Similarly the distance between “Stefano B” and “Stefano B.” is still 1. Along with a colleague (Stefano Bargioni), I took a long, hard look at the source code of an OpenRefine reconciliation service which uses VIAF as the backend database. [2] The code included the calculation of a ratio to denote the relative distance of two strings. This ratio is the quotient of the longest string minus the Levenshtein distance divided by the length of the longest string. From the first example, the distance is 1 and the length of the string “eric” is 4, thus the ratio is (4 – 1) / 4, which equals 0.75. In other words, 75% of the characters are correct. In the second example, “Stefano B.” is 10 characters long, and thus the ratio is (10 – 1) / 10, which equals 0.9. In other words, the second example is more correct than the first example.

Using the value of MARC 1xx$a of an authority file, I can then query VIAF. The SRU interface returns 0 or more hits. I can then compare my search string with the search results to create a ranked list of choices. Based on this ranking, I am able to more intelligently choose VIAF identifiers. For example, from my debugging output, if I get 0 hits, then I do nothing:

       query: Lucariello, Donato
        hits: 0

If I get too many hits, then I still do nothing:

       query: Lucas Lucas, Ramón
        hits: 18
     warning: search results out of bounds; consider increasing MAX

If I get 1 hit, then I automatically save the result, which seems to be correct/accurate most of the time, even though the Levenshtein distance may be large:

       query: Lucaites, John Louis
        hits: 1
       score: 0.250     John Lucaites (57801579)
      action: perfection achieved (updated name and id)

If I get many hits, and one of them exactly matches my query, then I “achieved perfection” and I save the identifier:

       query: Lucas, John Randolph
        hits: 3
       score: 1.000     Lucas, John Randolph (248129560)
       score: 0.650     Lucas, John R. 1929- (98019197)
       score: 0.500     Lucas, J. R. 1929- (2610145857009722920913)
      action: perfection achieved (updated name and id)

If I get many hits, and many of them are exact matches, then I simply use the first one (even though it might not be the “best” one):

       query: Lucifer Calaritanus
        hits: 5
       score: 1.000     Lucifer Calaritanus (189238587)
       score: 1.000     Lucifer Calaritanus (187743694)
       score: 0.633     Luciferus Calaritanus -ca. 370 (1570145857019022921123)
       score: 0.514     Lucifer Calaritanus gest. 370 n. Chr. (798145857991023021603)
       score: 0.417     Lucifer, Bp. of Cagliari, d. ca. 370 (64799542)
      action: perfection achieved (updated name and id)

If I get many hits, and none of them are perfect, but the ratio is above a configured threshold (0.949), then that is good enough for me (even if the selected record is not the “best” one):

       query: Palanque, Jean-Remy
        hits: 5
       score: 0.950     Palanque, Jean-Rémy (106963448)
       score: 0.692     Palanque, Jean-Rémy, 1898- (46765569)
       score: 0.667     Palanque, Jean Rémy, 1898- (165029580)
       score: 0.514     Palanque, J. R. (Jean-Rémy), n. 1898 (316408095)
       score: 0.190     Marrou-Davenson, Henri-Irénée, 1904-1977 (2473942)
      action: perfection achieved (updated name and id)

By exploiting the Levenshtein algorithm, and by learning from the good work of others, I have been able to programmatically select VIAF identifiers for more than half of my authority records. When one has as many as 120,000 records to process, this is a good thing. Moreover, this use of the Levenshtein algorithm seems to produce more complete results when compared to the VIAF AutoSuggest API. AutoSuggest identified approximately 20 percent of my VIAF identifiers, while my Levenshtein algorithm/logic identifies more than 40 or 50 percent. AutoSuggest is much faster though. Much.

Fun with the intelligent use of computers, and think of the possibilities.

[0] VIAF Finder – ./2016/05/viaf-finder/

[1] Levenshtein – http://bit.ly/1Wz3qZC

[2] reconciliation service – https://github.com/codeforkjeff/refine_viaf

Posted in Hacks | 1 Comment »

VIAF Finder

May 27th, 2016

This posting describes VIAF Finder. In short, given the values from MARC fields 1xx$a, VIAF Finder will try to find and record a VIAF identifier. [0] This identifier, in turn, can be used to facilitate linked data services against authority and bibliographic data.

Quick start

Here is the way to quickly get started:

download and uncompress the distribution to your Unix-ish (Linux or Macintosh) computer [1]
put a file of MARC records named authority.mrc in the ./etc directory, and the file name is VERY important
from the root of the distribution, run ./bin/build.sh

VIAF Finder will then commence to:

create a “database” from the MARC records, and save the result in ./etc/authority.db
use the VIAF API (specifically the AutoSuggest interface) to identify VAIF numbers for each record in your database, and if numbers are identified, then the database will be updated accordingly [3]
repeat Step #2 but through the use of the SRU interface
repeat Step #3 but limiting searches to authority records from the Vatican
repeat Step #3 but limiting searches to the authority named ICCU
done

Once done the reader is expected to programmatically loop through ./etc/authority.db to update the 024 fields of their MARC authority data.

Manifest

Here is a listing of the VIAF Finder distribution:

00-readme.txt – this file
bin/build.sh – “One script to rule them all”
bin/initialize.pl – reads MARC records and creates a simple “database”
bin/make-dist.sh – used to create a distribution of this system
bin/search-simple.pl – rudimentary use of the SRU interface to query VIAF
bin/search-suggest.pl – rudimentary use of the AutoSuggest interface to query VIAF
bin/subfield0to240.pl – sort of demonstrates how to update MARC records with 024 fields
bin/truncate.pl – extracts the first n number of MARC records from a set of MARC records, and useful for creating smaller, sample-sized datasets
etc – the place where the reader is expected to save their MARC files, and where the database will (eventually) reside
lib/subroutines.pl – a tiny set of… subroutines used to read and write against the database

Usage

If the reader hasn’t figured it out already, in order to use VIAF Finder, the Unix-ish computer needs to have Perl and various Perl modules — most notably, MARC::Batch — installed.

If the reader puts a file named authority.mrc in the ./etc directory, and then runs ./bin/build.sh, then the system ought to run as expected. A set of 100,000 records over a wireless network connection will finish processing in a matter of many hours, if not the better part of a day. Speed will be increased over a wired network, obviously.

But in reality, most people will not want to run the system out of the box. Instead, each of the individual tools will need to be run individually. Here’s how:

save a file of MARC (authority) records anywhere on your file system
not recommended, but optionally edit the value of DB in bin/initialize.pl
run ./bin/initialize.pl feeding it the name of your MARC file, as per Step #1
if you edited the value of DB (Step #2), then edit the value of DB in bin/search-suggest.pl, and then run ./bin/search-suggest.pl
if you want to possibly find more VIAF identifiers, then repeat Step #4 but with ./bin/search-simple.pl and with the “simple” command-line option
optionally repeat Step #5, but this time use the “named” command-line option, and the possible named values are documented as a part of the VAIF API (i.e., “bav” denotes the Vatican
optionally repeat Step #6, but with other “named” values
optionally repeat Step #7 until you get tired
once you get this far, the reader may want to edit bin/build.sh, specifically configuring the value of MARC, and running the whole thing again — “one script to rule them all”
done

A word of caution is now in order. VIAF Finder reads & writes to its local database. To do so it slurps up the whole thing into RAM, updates things as processing continues, and periodically dumps the whole thing just in case things go awry. Consequently, if you want to terminate the program prematurely, try to do so a few steps after the value of “count” has reached the maximum (500 by default). A few times I have prematurely quit the application at the wrong time and blew my whole database away. This is the cost of having a “simple” database implementation.

To do

Alas, search-simple.pl contains a memory leak. Search-simple.pl makes use of the SRU interface to VIAF, and my SRU queries return XML results. Search-simple.pl then uses the venerable XML::XPath Perl module to read the results. Well, after a few hundred queries the totality of my computer’s RAM is taken up, and the script fails. One work-around would be to request the SRU interface to return a different data structure. Another solution is to figure out how to destroy the XML::XPath object. Incidentally, because of this memory leak, the integer fed to simple-search.pl was implemented allowing the reader to restart the process at a different point dataset. Hacky.

Database

The use of the database is key to the implementation of this system, and the database is really a simple tab-delimited table with the following columns:

id (MARC 001)
tag (MARC field name)
_1xx (MARC 1xx)
a (MARC 1xx$a)
b (MARC 1xx$b and usually empty)
c (MARC 1xx$c and usually empty)
d (MARC 1xx$d and usually empty)
l (MARC 1xx$l and usually empty)
n (MARC 1xx$n and usually empty)
p (MARC 1xx$p and usually empty)
t (MARC 1xx$t and usually empty)
x (MARC 1xx$x and usually empty)
suggestions (a possible sublist of names, Levenshtein scores, and VIAF identifiers)
viafid (selected VIAF identifier)
name (authorized name from the VIAF record)

Most of the fields will be empty, especially fields b through x. The intention is/was to use these fields to enhance or limit SRU queries. Field #13 (suggestions) is for future, possible use. Field #14 is key, literally. Field #15 is a possible replacement for MARC 1xx$a. Field #15 can also be used as a sort of sanity check against the search results. “Did VIAF Finder really identify the correct record?”

Consider pouring the database into your favorite text editor, spreadsheet, database, or statistical analysis application for further investigation. For example, write a report against the database allowing the reader to see the details of the local authority record as well as the authority data in VIAF. Alternatively, open the database in OpenRefine in order to count & tabulate variations of data it contains. [4] Your eyes will widened, I assure you.

Commentary

bird First, this system was written during my “artist’s education adventure” which included a three-month stint in Rome. More specifically, this system was written for the good folks at Pontificia Università della Santa Croce. “Thank you, Stefano Bargioni, for the opportunity, and we did some very good collaborative work.”

Second, I first wrote search-simple.pl (SRU interface) and I was able to find VIAF identifiers for about 20% of my given authority records. I then enhanced search-simple.pl to include limitations to specific authority sets. I then wrote search-suggest.pl (AutoSuggest interface), and not only was the result many times faster, but the result was just as good, if not better, than the previous result. This felt like two steps forward and one step back. Consequently, the reader may not ever need nor want to run search-simple.pl.

Third, while the AutoSuggest interface was much faster, I was not able to determine how suggestions were made. This makes the AutoSuggest interface seem a bit like a “black box”. One of my next steps, during the copious spare time I still have here in Rome, is to investigate how to make my scripts smarter. Specifically, I hope to exploit the use of the Levenshtein distance algorithm. [5]

Finally, I would not have been able to do this work without the “shoulders of giants”. Specifically, Stefano and I took long & hard looks at the code of people who have done similar things. For example, the source code of Jeff Chiu’s OpenRefine Reconciliation service demonstrates how to use the Levenshtein distance algorithm. [6] And we found Jakob Voß’s viaflookup.pl useful for pointing out AutoSuggest as well as elegant ways of submitting URL’s to remote HTTP servers. [7] “Thanks, guys!”

Fun with MARC-based authority data!

Links

[0] VIAF – http://viaf.org

[1] VIAF Finder distribution – http://infomotions.com/sandbox/pusc/etc/viaf-finder.tar.gz

[2] VIAF API – http://www.oclc.org/developer/develop/web-services/viaf.en.html

[4] OpenRefine – http://openrefine.org

[5] Levenshtein distance – https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance

[6] Chiu’s reconciliation service – https://github.com/codeforkjeff/refine_viaf

[7] Voß’s viaflookup.pl – https://gist.github.com/nichtich/832052/3274497bfc4ae6612d0c49671ae636960aaa40d2

Posted in Semantic Web and Linked Data | 1 Comment »

Making stone soup: Working together for the advancement of learning and teaching

May 9th, 2016

It is simply not possible for any of us to do our jobs well without the collaboration of others. Yet specialization abounds, jargon proliferates, and professional silos are everywhere. At the same time we all have a shared goal: to advance learning and teaching. How are we to balance these two seemingly conflicting characteristics in our workplace? How can we satisfy the demands of our day-to-day jobs and at the same time contribute to the work of others? ‡

bouquet sunflowers roses

The answer is not technical but instead rooted in what it means to a part of a holistic group of people. The answer is rooted in things like the abilities to listen, to share, to learn, to go beyond tolerance and towards respect, to take a sincere interest in the other person’s point of view, to discuss, and to take to heart the idea that nobody really sees the whole picture.

As people — members of the human race — we form communities with both our strengths & our weaknesses, with things we know would benefit the group & things we would rather not share, with both our beauties and our blemishes. This is part of what it means to be people. There is no denying it, and if we try, then we are only being less of who we really are. To deny it is an unrealistic expectation. We are not gods. We are not actors. We are people, and being people — real people — is a good thing.

Within any community, there are norms of behavior. Without norms of behavior, there is really no community, only chaos and anarchy. In anarchy and chaos, physical strength is oftentimes the defining characteristic of decision-making, but when the physically strong are outnumbered by the emotionally mature and intellectually aware, then chaos and anarchy are overthrown for a more holistic set of decision-making proceses. Examples include democracy, consensus building, and even the possibility governance through benevolent dictatorship.

A community’s norms are both written and unwritten. Our workplaces are good examples of such communities. On one hand there may be policies & procedures, but these policies & procedures usually describe workflows, the methods used to evaluate employees, or to some extent strategic plans. They might outline how meetings are conducted or how teams are to accomplish their goals. On the other hand, these policies & procedures do not necessarily outline how to talk with fellow employees around the virtual water cooler, how to write email messages, nor how to greet each on a day-to-day basis. Just as importantly, our written norms of behavior do not describe how to treat and communicate with people outside one’s own set of personal expertise. Don’t get me wrong. This does not mean I am advocating written norms for such things, but such things do need to be discussed and agreed upon. Such are the beginnings of stone soup.

Increasingly we seem to work in disciplines of specialization, and these specializations, necessarily, generate their own jargon. “Where have all the generalists gone? Considering our current environment, is it really impossible to be a Renaissance Man^h^h^h Person?” Increasingly, the answer to the first question is, “The generalists have gone the way of Leonardo DiVinci.” And the answer to the second question is, “Apparently so.”

For example, some of us lean more towards formal learning, teaching, research, and scholarship. These are the people who have thoroughly studied and now teach a particular academic discipline. These same people have written dissertations, which, almost by defintion, are very specialized, if not unique. They live in a world pursuant of truth while balancing the worlds of rigorous scholarly publishing and student counseling.

There are those among us who thoroughly know the in’s and out’s of computer technology. These people can enumerate the differences between a word processor and a text editor. They can compare & contrast operating systems. These people can configure & upgrade software. They can make computers communicate on the Internet. They can trouble-shoot computer problems when the computers seem — for no apparent reason — to just break.

Finally, there are those among us who specialize in the collection, organization, preservation, and dissemination of data, information, and knowledge. These people identify bodies of content, systematically describe it, make every effort to preserve it for posterity, and share it with their respective communities. These people deal with MARC records, authority lists, and subject headings.

Despite these truisms, we — our communities — need to figure out how to work together, how to bridge the gaps in our knowledge (a consequence of specialization), and how to achieve our shared goals. This is an aspect of our metaphoric stone soup.

So now the problem can be re-articulated. We live and work in communities of unwritten and poorly articulated norms. To complicate matters, because of our specializations, we all approach our situations from different perspectives and use different languages to deal with the situations. As I was discussing this presentation with a dear friend & colleague, the following poem attributed to Prissy Galagarian was brought to my attention†, and it eloquently states the imperative:

  The Person Next to You

  The person next to you is the greatest miracle
   and the greatest mystery you will ever
   meet at this moment.

  The person next to you is an inexhaustible
   reservoir of possibility,
   desire and dread,
   smiles and frowns, laughter and tears,
   fears and hopes,
   all struggling to find expression.

  The person next to you believes in something,
   stands for something, counts for something,
   lives for something, labors for something,
   waits for something, runs from something,
   runs to something.

  The person next to you has problems and fears,
   wonders how they're doing,
   is often undecided and unorganized
   and painfully close to chaos!
   Do they dare speak of it to you?

  The person next to you can live with you
   not just alongside you,
   not just next to you.

  The person next to you is a part of you.
   for you are the person next to them.

How do we overcome these impediments in order to achieve our mutual goals of the workplace? The root of the answer lies in our ability to really & truly respect our fellow employees.

Working together towards a shared goals is a whole lot like making “stone soup”. Do you know the story of “stone soup”? A man comes into a village, and asks the villagers for food. Every time he asks he is told that there is nothing to give. Despite an apparent lack of anything, the man sets up a little fire, puts a pot of water on, and drops a stone into the pot. Curious people come by, and they ask, “What are you doing?” He says, “I’m making stone soup, but I think it needs a bit of flavor.” Wanting to participate, people begin to add their own things to the soup. “I think I have some carrots,” says one villager. “I believe I have a bit of celery,” says another. Soon the pot is filled with bits of this and that and the other thing: onions, salt & pepper, a beef bone, a few tomatoes, a couple of potatoes, etc. In the end, a rich & hearty stew is made, enough for everybody to enjoy. Working together, without judgement nor selfishness, the end result is a goal well-accomplished.

This can happen in the workplace as well. It can happen in our community where the goal is teaching & learning. And in the spirit of cooking, here’s a sort of recipe:

Understand that you do not have all the answers, and in fact, nobody does; nobody has the complete story nor sees the whole picture. Only after working on a task, and completing it at least once, will a holistic perspective begin to develop.
Understand that nobody’s experience is necessarily more important than the others’, including your own. Everybody has something to offer, and while your skills & expertise may be imperative to success, so are the skills & expertise of others. And if there an established hierarchy within your workplace, understand that the hierarchy is all but arbitrary, and maintained by people with an over-developed sense of power. We all have more things in common than differences.
Spend the time to get to know your colleagues, and come to a sincere appreciation of who they are as a person as well as a professional. This part of the “recipe” may include formal or informal social events inside or outside the workplace. Share a drink or a meal. Take a walk outside or through the local museum. Do this in groups of two or more. Such activities provide a way for everybody involved to reflect upon an outside stimulus. Through this process the interesting characteristics of the others will become apparent. Appreciate these characteristics. Do not judge them, but rather respect them.
Remember, listening is a wonderful skill, and when the other person talks for a long time, they will go away thinking they had a wonderful conversation. Go beyond hearing what a person says. Internalize what they say. Ask meaningful & constructive questions, and speak their name frequently during discussions. These things will demonstrate your true intentions. Through this process the others will become a part of you, and you will become a part of them.
Combine the above ingredients, bring them to a boil, and then immediately lower the temperature allowing everything to simmer for a good long time. Keeping the pot boiling will only overheat the soup and make a mess. Simmering will keep many of the individual parts intacked, enable the flavors to mellow, and give you time to set the table for the next stage of the process.

Finally, making stone soup does not require fancy tools. A cast iron pot will work just as well as one made from aluminium or teflon. What is needed is a container large enough to hold the ingredients and withstand the heat. It doesn’t matter whether or not the heat source is gas, electric, or fire. It just has to be hot enough to allow boiling and then simmering. Similarly, stone soup in the workplace does not require Google Drive, Microsoft Office 365, nor any type of wiki. Sure, those things can facilitate project work, but they are not the means for getting to know your colleagues. Only through personal interaction will such knowledge be garnered.

Working together for the advancement of learning & teaching — or just about any other type of project work — is a lot like making stone soup. Everybody contributes a little something, and the result is nourishing meal for all.

‡ This essay was written as a presentation for the AMICAL annual conference which took place in Rome (May 12-14, 2016), and this essay is available as a one-page handout.

† http://fraternalthoughts.blogspot.it/2011/02/person-next-to-you.html

Posted in Librarianship | Comments Off on Making stone soup: Working together for the advancement of learning and teaching

Editing authorities at the speed of four records per minute

April 7th, 2016

This missive outlines and documents an automated process I used to “cleanup” and “improve” a set of authority records, or, to put it another way, how I edited authorities at the speed of four records per minute.

spring As you may or may not know, starting in September 2015, I commenced upon a sort of “leave of absence” from my employer.† This leave took me to Tuscany, Venice, Rome, Provence, Chicago, Philadelphia, Boston, New York City, and back to Rome. In Rome I worked for the American Academy of Rome doing short-term projects in the library. The first project revolved around authority records. More specifically, the library’s primary clientele were Americans, but the catalog’s authority records included a smattering of Italian headings. The goal of the project was to automatically convert as many of the “invalid” Italian headings into “authoritative” Library of Congress headings.

Identify “invalid” headings

pantheon When I first got to Rome I had the good fortune to hang out with Terry Reese, the author of the venerable MarcEdit.‡ He was there giving workshops. I participated in the workshops. I listened, I learned, and I was grateful for a Macintosh-based version of Terry’s application.

When the workshops were over and Terry had gone home I began working more closely with Sebastian Hierl, the director of the Academy’s library.❧ Since the library was relatively small (about 150,000 volumes), and because the Academy used Koha for its integrated library system, it was relatively easy for Sebastian to give me the library’s entire set of 124,000 authority records in MARC format. I fed the authority records into MarcEdit, and ran a report against them. Specifically, I asked MarcEdit to identify the “invalid” records, which really means, “Find all the records not found in the Library of Congress database.” The result was a set of approximately 18,000 records or approximately 14% of the entire file. I then used MarcEdit to extract the “invalid” records from the complete set, and this became my working data.

Search & download

alter I next created a rudimentary table denoting the “invalid” records and the subsequent search results for them. This tab-delimited file included values of MARC field 001, MARC field 1xx, an integer denoting the number of times I searched for a matching record, an integer denoting the number of records I found, an identifier denoting a Library of Congress authority record of choice, and a URL providing access to the remote authority record. This table was initialized using a script called authority2list.pl. Given a file of MARC records, it outputs the table.

I then systematically searched the Library of Congress for authority headings. This was done with a script called search.pl. Given the table created in the previous step, this script looped through each authority, did a rudimentary search for a valid entry, and output an updated version of the table. This script was a bit “tricky”.❦ It first searched the Library of Congress by looking for the value of MARC 1xx$a. If no records were found, then no updating was done and processing continued. If one record was found, then the Library of Congress identifier was saved to the output and processing continued. If many records were found, then a more limiting search was done by adding a date value extracted from MARC 1xx$d. Depending on the second search result, the output was updated (or not), and processing continued. Out of original 18,000 “invalid” records, about 50% of them were identified with no (zero) Library of Congress records, about 30% were associated with multiple headings, and the remaining 20% (approximately 3,600 records) were identified with one and only one Library of Congress authority record.

I now had a list of 3,600 “valid” authority records, and I needed to download them. This was done with a script called harvest.pl. This script is really a wrapper around a program called GNU Wget. Given my updated table, the script looped through each row, and if it contained a URL pointing to a Library of Congress authority record, then the record was cached to the file system. Since the downloaded records were formatted as MARCXML, I then needed to transform them into MARC21. This was done with a pair of scripts: xml2marc.sh and xml2marc.pl. The former simply looped through each file in a directory, and the later did the actual transformation but along the way updated MARC 001 to the value of the local authority record.

Verify and merge

backyard In order to allow myself as well as others to verify that correct records had been identified, I wrote another pair of programs: marc2compare.pl and compare2html.pl. Given two MARC files, marc2compare.pl created a list of identifiers, original authority values, proposed authority values, and URLs pointing to full descriptions of each. This list was intended to be poured into a spreadsheet for compare & contrast purposes. The second script, compare2html.pl, simply took the output of the first and transformed it into a simple HTML page making it easier for a librarian to evaluate correctness.

Assuming the 3,600 records were correct, the next step was to merge/overlay the old records with the new records. This was a two-step process. The first step was accomplished with a script called rename.pl. Given two MARC files, rename.pl first looped through the set of new authorities saving each identifier to memory. It then looped through the original set of authorities looking for records to update. When records to update were found, each was marked for deletion by prefixing MARC 001 with “x-“. The second step employed MarcEdit to actually merge the set of new authorities with the original authorities. Consequently, the authority file increased in size by 3,600 records. It was then up to other people to load the authorities into Koka, re-evaluate the authorities for correctness, and if everything was okay, then delete each authority record prefixed with “x-“.

Done.❀

Summary and possible next steps

In summary, this is how things happened. I:

got a complete dump of original authority 123,329 records
extracted 17,593 “invalid” records
searched LOC for “valid” records and found 3,627 of them
harvested the found records
prefixed the 3,627 001 fields in the original file with “x-“
merged the original authority records with the harvested records
made the new set of 126,956 updated records available

academy There were many possible next steps. One possibility was to repeat the entire process but with an enhanced search algorithm. This could be difficult considering the fact that searches using merely the value of 1xx$a returned zero hits for half of the working data. A second possibility was to identify authoritative records from a different system such as VIAF or Worldcat. Even if this was successful, I wonder how possible it would have been to actually download authority records as MARC. A third possibility was to write a sort of disambiguation program allowing librarians to choose from a set of records. This could have been accomplished by searching for authorities, presenting possibilities, allowing librarians to make selections via an HTML form, caching the selections, and finally, batch updating the master authority list. Here at the Academy we denoted the last possibility as the “cool” one.

Now here’s an interesting way to look at the whole thing. This process took me about two weeks worth of work, and in that two weeks I processed 18,000 authority records. That comes out to 9,000 records/week. There are 40 hours in work week, and consequently, I processed 225 records/hour. Each hour is made up of 60 minutes, and therefore I processed approximately 4 records/minute, and that is 1 record every fifteen seconds for the last two weeks. Wow!?

Finally, I’d like to thank the Academy (with all puns intended). Sebastian, his colleagues, and especially my office mate (Kristine Iara) were all very supportive throughout my visit. They provided intellectual stimulation and something to do while I contemplated my navel during the “adventure”.

Notes

bicycles † Strictly speaking, my adventure was not a sabbatical nor a leave of absence because: 1) as a librarian I was not authorized to take a sabbatical, and 2) I did not have any healthcare issues. Instead, after bits of negotiation, my contract was temporarily changed from full-time faculty to adjunct faculty, and I worked for my employer 20% of the time. The other 80% of time was spent on my “adventure”. And please don’t get me wrong, this whole thing was a wonderful opportunity for which I will be eternally grateful. “Thank you!”

‡ During our overlapping times there in Rome, Terry & I played tourist which included the Colosseum, a happenstance mass at the Pantheon, a Palm Sunday Mass in St. Peter’s Square with tickets generously given to us by Joy Nelson of ByWater Solutions, and a day-trip to Florence. Along the way we discussed librarianship, open source software, academia, and life in general. A good time was had by all.

❧ Ironically, Sebastian & I were colleagues during the dot-com boom when we both worked at North Caroline State University. The world of librarianship is small.

❦ This script — search.pl — was really a wrapper around an application called curl, and thanks go to Jeff Young of OCLC who pointed me to the ATOM interface of the LC Linked Data Service. Without Jeff’s helpful advice, I would have wrestled with OCLC’s various authentication systems and Web Service interfaces.

❀ Actually, I skipped a step in this narrative. Specifically, there are some records in the authority file that were not expected to be touched, even if they are “invalid”. This set of records was associated with a specific call number pattern. Two scripts (fu-extract.pl and fu-remove.pl) did the work. The first extracted a list of identifiers not to touch and the second removed them from my table of candidates to validate.

Posted in MARC and technical services | Comments Off on Editing authorities at the speed of four records per minute

Failure to communicate

March 22nd, 2016

In my humble opinion, what we have here is a failure to communicate.

shrine Libraries, especially larger libraries, are increasingly made up of many different departments, including but not limited to departments such as: cataloging, public services, collections, preservation, archives, and now-a-days departments of computer staff. From my point of view, these various departments fail to see the similarities between themselves, and instead focus on their differences. This focus on the differences is amplified by the use of dissimilar vocabularies and subdiscipline-specific jargon. This use of dissimilar vocabularies causes a communications gap and left unresolved ultimately creates animosity between groups. I believe this is especially true between the more traditional library departments and the computer staff. This communications gap is an impediment to when it comes to achieving the goals of librarianship, and any library — whether it be big or small — needs to address these issues lest it wastes both its time and money.

Here are a few examples outlining failures to communicate:

MARC – MARC is a data structure. The first 24 characters are called the leader. The second section is called the directory, and the third section is intended to contain bibliographic data. The whole thing is sprinkled with ASCII characters 29, 30, and 31 denoting the ends of fields, subfields, and the record itself. MARC does not denote the kinds of data it contains. Yet, many catalogers say they know MARC. Instead, what they really know are sets of rules defining what goes into the first and third sections of the data structure. These rules are known as AACR2/RDA. Computer staff see MARC (and MARCXML) as a data structure. Librarians see MARC as the description of an item akin to a catalog card.
Databases & indexes – Databases & indexes are two sides of the same information retrieval coin. “True” databases are usually relational in nature and normalized accordingly. “False” databases are flat files — simple tables akin to Excel spreadsheets. Librarians excel (no puns intended) at organizing information, and this usually manifests itself through the creation of various lists. Lists of books. Lists of journals. Lists of articles. Lists of authoritative names. Lists of websites. Etc. In today’s world, the most scalable way to maintain lists is through the use of a database, yet most librarians wouldn’t be able to draw an entity relationship diagram — the literal illustration of a database’s structure — to save their lives. With advances in computer technology, the problem of find is no longer solved through the searching of databases but instead through the creation of an index. In reality, modern indexes are nothing more than enhancements of traditional back-of-the-book indexes — lists of words and associated pointers to where those words can be found in a corpus. Computer staff see databases as MySQL and indexes as Solr. Librarians see databases as a matrix of rows & columns, and the searching of databases in a light of licensed content such as JSTOR, Academic Search Primer, or New York Times.
Collections – Collections, from the point of view of a librarian, are sets of curated items with a common theme. Taken as a whole, these collections embody a set of knowledge or a historical record intended for use by students & researchers for the purposes of learning & scholarship. The physical arrangment of the collection — especially in archives — as well as the intellectual arrangment of the collection is significant because they bring together like items or represent the development of an idea. This is why libraries have classification schemes and archives physically arrange their materials in the way they do. Unfortunately, computer staff usually do not understand the concept of “curation” and usually see the arrangements of books — classification numbers — as rather arbitrary.
Services – Many librarians see the library profession as being all about service. These services range from literacy programs to story hours. They range from the answering of reference questions to the circulation of books. They include social justice causes, stress relievers during exam times, and free access to computers with Internet connections. Services are important because the provide the means for an informed public, teaching & learning, and the improvement society in general. Many of these concepts are not in the forefront of the minds of computer staff. Instead, their idea of service is making sure the email system works, people can log into their computers, computer hardware & software are maintained, and making sure the connections to the Internet are continual.

room with a view As a whole, what the profession does not understand is that everybody working in a library has more things in common than differences. Everybody is (suppose to be) working towards the same set of goals. Everybody plays a part in achieving those goals, and it behooves everybody to learn & respect the roles of everybody else. A goal is to curate collections. This is done through physical, intellectual, and virtual arrangment, but it also requires the use of computer technology. Collection managers need to understand more of the computer technology, and the technologist needs to understand more about curation. The application of AACR2/RDA is an attempt to manifest inventory and the dissemination of knowledge. The use of databases & indexes also manifest inventory and dissemination of knowledge. Catalogers and database administrators ought to communicate on the similar levels. Similarly, there is much more to preservation of materials than putting bits on tape. “Yikes!”

What is the solution to these problems? In my opinion, there are many possibilities, but the solution ultimately rests with individuals willing to take the time to learn from their co-workers. It rests in the ability to respect — not merely tolerate — another point of view. It requires time, listening, discussion, reflection, and repetition. It requires getting to know other people on a personal level. It requires learning what others like and dislike. It requires comparing & contrasting points of view. It demands “walking a mile in the other person’s shoes”, and can be accomplished by things such as the physical intermingling of departments, cross-training, and simply by going to coffee on a regular basis.

Again, all of us working in libraries have more similarities than differences. Learn to appreciate the similarities, and the differences will become insignificant. The consequence will be a more holistic set of library collections and services.

Posted in Librarianship | Comments Off on Failure to communicate

Using BIBFRAME for bibliographic description

March 6th, 2016

Bibliographic description is an essential process of librarianship. In the distant past this process took the form of simple inventories. In the last century we saw bibliographic description evolve from the catalog card to the MARC record. With the advent of globally networked computers and the hypertext transfer protocol, we are seeing the emergence of a new form of description called BIBFRAME which is based on the principles of RDF (Resource Description Framework). This essay describes, illustrates, and demonstrates how BIBFRAME can be used to fulfill the promise and purpose of bibliographic description.†

Librarianship as collections & services

Philadelphia Flowers Libraries are about a number of things. Some of those things surround the collection and preservation of materials, most commonly books. Some of those things surround services, most commonly the lending of books.†† But it is asserted here that collections are not really about books nor any other physical medium because those things are merely the manifestation of the real things of libraries: data, information, and knowledge. It is left to another essay as to the degree libraries are about wisdom. Similarly, the primary services of libraries are not really about the lending of materials, but instead the services surround learning and intellectual growth. Librarians cannot say they have lent somebody a book and conclude they have done their job. No, more generally, libraries provide services enabling the reader to use & understand the content of acquired materials. In short, it is asserted that libraries are about the collection, organization, preservation, dissemination, and sometimes evaluation of data, information, knowledge, and sometimes wisdom.

With the advent of the Internet the above definition of librarianship is even more plausible since the materials of libraries can now be digitized, duplicated (almost) exactly, and distributed without diminishing access to the whole. There is no need to limit the collection to physical items, provide access to the materials through surrogates, nor lend the materials. Because these limitations have been (mostly) removed, it is necessary for libraries to think differently their collections and services. To the author’s mind, librarianship has not shifted fast enough nor far enough. As a long standing and venerable profession, and as an institution complete with its own set of governance, diversity, and shear size, change & evolution happen very slowly. The evolution of bibliographic description is a perfect example.

Bibliographic description: an informal history

Bibliographic description happens in between the collections and services of libraries, and the nature of bibliographic description has evolved with technology. Think of the oldest libraries. Think clay tablets and papyrus scrolls. Think of the size of library collections. If a library’s collection was larger than a few hundred items, then the library was considered large. Still, the collections were so small that an inventory was relatively easy for sets of people (librarians) to keep in mind.

Think medieval scriptoriums and the development of the codex. Consider the time, skill, and labor required to duplicate an item from the collection. Consequently, books were very expensive but now had a much longer shelf life. (All puns are intended.) This increased the size of collections, but remembering everything in a collection was becoming more and more difficult. This, coupled with the desire to share the inventory with the outside world, created the demand for written inventories. Initially, these inventories were merely accession lists — a list of things owned by a library and organized by the date they were acquired.

With the advent of the printing press, even more books were available but at a much lower cost. Thus, the size of library collections grew. As it grew it became necessary to organize materials not necessarily by their acquisition date nor physical characteristics but rather by various intellectual qualities — their subject matter and usefulness. This required the librarian to literally articulate and manifest things of quality, and thus the profession begins to formalize the process of analytics as well as supplement their inventory lists with this new (which is not really new) information.

Consider some of the things beginning in the 18th and 19th centuries: the idea of the “commons”, the idea of the informed public, the idea of the “free” library, and the size of library collections numbering 10’s of thousands of books. These things eventually paved the way in the 20th century to open stacks and the card catalog — the most recent incarnation of the inventory list written in its own library short-hand and complete with its ever-evolving controlled vocabulary and authority lists — becoming available to the general public. Computers eventually happen and so does the MARC record. Thus, the process of bibliographic description (cataloging) literally becomes codified. The result is library jargon solidified in an obscure data structure. Moreover, in an attempt to make the surrogates of library collections more meaningful, the information of bibliographic description bloats to fill much more than the traditional three to five catalog cards of the past. With the advent of the Internet comes less of a need for centralized authorities. Self-service and connivence become the norm. When was the last time you used a travel agent to book airfare or reserve a hotel room?

Librarianship is now suffering from a great amount of reader dissatisfaction. True, most people believe libraries are “good things”, but most people also find libraries difficult to use and not meeting their expectations. People search the Internet (Google) for items of interest, and then use library catalogs to search for known items. There is then a strong desire to actually get the item, if it is found. After all, “Everything in on the ‘Net”. Right? To this author’s mind, the solution is two-fold: 1) digitize everthing and put the result on the Web, and 2) employ a newer type of bibliographic description, namely RDF. The former is something for another time. The later is elaborated upon below.

Resource Description Framework

Resource Description Framework (RDF) is essentially relational database technology for the Internet. It is comprised of three parts: keys, relationships, and values. In the case of RDF and akin to relational databases, keys are unique identifiers and usually in the form of URIs (now called “IRIs” — Internationalized Resource Identifiers — but think “URL”). Relationships take the form of ontologies or vocabularies used to describe things. These ontologies are very loosely analogous to the fields in a relational database table, and there are ontologies for many different sets of things, including the things of a library. Finally, the values of RDF can also be URIs but are ultimately distilled down to textual and numeric information.

RDF is a conceptual model — a sort of cosmology for the universe of knowledge. RDF is made real through the use of “triples”, a simple “sentence” with three distinct parts: 1) a subject, 2) a predicate, and 3) an object. Each of these three parts correspond to the keys, relationships, and values outlined above. To extend the analogy of the sentence further, think of subjects and objects as if they were nouns, and think of predicates as if they were verbs. And here is a very important distinction between RDF and relational databases. In relational databases there is the idea of a “record” where an identifier is associated with a set of values. Think of a book that is denoted by a key, and the key points to a set of values for titles, authors, publishers, dates, notes, subjects, and added entries. In RDF there is no such thing as the record. Instead there are only sets of literally interlinked assertions — the triples.

Triples (sometimes called “statements”) are often illustrated as arced graphs where subjects and objects are nodes and predicates are lines connecting the nodes:

[ subject ] --- predicate ---> [ object ]

The “linking” in RDF statements happens when sets of triples share common URIs. By doing so, the subjects of statements end up having many characteristics, and the objects of URIs point to other subjects in other RDF statements. This linking process transforms independent sets of RDF statements into a literal web of interconnections, and this is where the Semantic Web gets its name. For example, below is a simple web of interconnecting triples:

              / --- a predicate ---------> [ an object ]
[ subject ] - | --- another predicate ---> [ another object ]
              \ --- a third predicate ---> [ a third object ]
                                                   |
                                                   |
                                          yet another predicate
                                                   |
                                                   |
                                                  \ /

                                         [ yet another object ]

An example is in order. Suppose there is a thing called Rome, and it will be represented with the following URI: http://example.org/rome. We can now begin to describe Rome using triples:

subjects                 predicates         objects
-----------------------  -----------------  -------------------------
http://example.org/rome  has name           "Rome"
http://example.org/rome  has founding date  "1000 BC"
http://example.org/rome  has description    "A long long time ago,..."
http://example.org/rome  is a type of       http://example.org/city
http://example.org/rome  is a sub-part of   http://example.org/italy

The corresponding arced graph would look like this:

                               / --- has name ------------> [ "Rome" ]
                              |  --- has description -----> [ "A long time ago..." ]
[ http://example.org/rome ] - |  --- has founding date ---> [ "1000 BC" ]
                              |  --- is a sub-part of  ---> [ http://example.org/italy ]
                               \ --- is a type of --------> [ http://example.org/city ]

In turn, the URI http://example.org/italy might have a number of relationships asserted against it also:

subjects                  predicates         objects
------------------------  -----------------  -------------------------
http://example.org/italy  has name           "Italy"
http://example.org/italy  has founding date  "1923 AD"
http://example.org/italy  is a type of       http://example.org/country
http://example.org/italy  is a sub-part of   http://example.org/europe

Now suppose there were things called Paris, London, and New York. They can be represented in RDF as well:

subjects                    predicates          objects
--------------------------  -----------------   -------------------------
http://example.org/paris    has name            "Paris"
http://example.org/paris    has founding date   "100 BC"
http://example.org/paris    has description     "You see, there's this tower..."
http://example.org/paris    is a type of        http://example.org/city
http://example.org/paris    is a sub-part of    http://example.org/france
http://example.org/london   has name            "London"
http://example.org/london   has description     "They drink warm beer here."
http://example.org/london   has founding date   "100 BC"
http://example.org/london   is a type of        http://example.org/city
http://example.org/london   is a sub-part of    http://example.org/england
http://example.org/newyork  has founding date   "1640 AD"
http://example.org/newyork  has name            "New York"
http://example.org/newyork  has description     "It is a place that never sleeps."
http://example.org/newyork  is a type of        http://example.org/city
http://example.org/newyork  is a sub-part of    http://example.org/unitedstates

Furthermore, each of “countries” can be have relationships denoted against them:

subjects                         predicates         objects
-------------------------------  -----------------  -------------------------
http://example.org/unitedstates  has name           "United States"
http://example.org/unitedstates  has founding date  "1776 AD"
http://example.org/unitedstates  is a type of       http://example.org/country
http://example.org/unitedstates  is a sub-part of   http://example.org/northamerica
http://example.org/england       has name           "England"
http://example.org/england       has founding date  "1066 AD"
http://example.org/england       is a type of       http://example.org/country
http://example.org/england       is a sub-part of   http://example.org/europe
http://example.org/france        has name           "France"
http://example.org/france        has founding date  "900 AD"
http://example.org/france        is a type of       http://example.org/country
http://example.org/france        is a sub-part of   http://example.org/europe

The resulting arced graph of all these triples might look like this:

[IMAGINE A COOL LOOKING ARCED GRAPH HERE.]

From this graph, new information can be inferred as long as one is able to trace connections from one node to another node through one or more arcs. For example, using the arced graph above, questions such as the following can be asked and answered:

What things are denoted as types of cities, and what are their names?
What is the oldest city?
What cities were founded after the year 1 AD?
What countries are sub-parts of Europe?
How would you describe Rome?

In summary, RDF is data model — a method for organizing discrete facts into a coherent information system, and to this author, this sounds a whole lot like a generalized form of bibliographic description and a purpose of library catalogs. The model is built on the idea of triples whose parts are URIs or literals. Through the liberal reuse of URIs in and between sets of triples, questions surrounding the information can be answered and new information can be inferred. RDF is the what of the Semantic Web. Everything else (ontologies & vocabularies, URIs, RDF “serializations” like RDF/XML, triple stores, SPARQL, etc.) are the how’s. None of them will make any sense unless the reader understands that RDF is about establishing relationships between data for the purposes of sharing information and increasing the “sphere of knowledge”.

Linked data

Linked data is RDF manifested. It is a process of codifying triples and systematically making them available on the Web. It first involves selecting, creating (“minting”), and maintaining sets of URIs denoting the things to be described. When it comes to libraries, there are many places where authoritative URIs can be gotten including: OCLC’s Worldcat, the Library of Congress’s linked data services, Wikipedia, institutional repositories, or even licensed indexes/databases.

Second, manifesting RDF as linked data involves selecting, creating, and maintaining one or more ontologies used to posit relationships. Like URIs, there are many existing bibliographic ontologies for the many different types of cultural heritage institutions: libraries, archives, and museums. Example ontologies include but are by no means limited to: BIBFRAME, bib.schema.org, the work of the (aged) LOCAH project, EAC-CPF, and CIDOC CRM.

The third step to implementing RDF as linked data is to actually create and maintain sets of triples. This is usually done through the use of a “triple store” which is akin to a relational database. But remember, there is no such thing as a record when it comes to RDF! There are a number of not a huge number of toolkits and applications implementing triple stores. 4store is (or was) a popular open source triple store implementation. Virtuoso is another popular implementation that comes in both open sources as well as commercial versions.

The forth step in the linked data process is the publishing (making freely available on the Web) of RDF. This is done in a combination of two ways. The first is to write a report against the triple store resulting in a set of “serializations” saved at the other end of a URL. Serializations are textual manifestations of RDF triples. In the “old days”, the serialization of one or more triples was manifested as XML, and might have looked something like this to describe the Declaration of Independence and using the Dublin Core and FOAF (Friend of a friend) ontologies:

<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dcterms="http://purl.org/dc/terms/" xmlns:foaf="http://xmlns.com/foaf/0.1/">
<rdf:Description rdf:about="http://en.wikipedia.org/wiki/Declaration_of_Independence">
  <dcterms:creator>
	<foaf:Person rdf:about="http://id.loc.gov/authorities/names/n79089957">
	  <foaf:gender>male</foaf:gender>
	</foaf:Person>
  </dcterms:creator>
</rdf:Description>
</rdf:RDF>

Many people think the XML serialization is too verbose and thus difficult to read. Consequently other serializations have been invented. Here is the same small set of triples serialized as N-Triples:

@prefix foaf: <http://xmlns.com/foaf/0.1/>.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix dcterms: <http://purl.org/dc/terms/>.
<http://en.wikipedia.org/wiki/Declaration_of_Independence> dcterms:creator <http://id.loc.gov/authorities/names/n79089957>.
<http://id.loc.gov/authorities/names/n79089957> a foaf:Person;
  foaf:gender "male".

Here is yet another example, but this time serialized as JSON, a data structure first implemented as a part of the Javascript language:

{
"http://en.wikipedia.org/wiki/Declaration_of_Independence": {
  "http://purl.org/dc/terms/creator": [
	{
	  "type": "uri", 
	  "value": "http://id.loc.gov/authorities/names/n79089957"
	}
  ]
}, 
 "http://id.loc.gov/authorities/names/n79089957": {
   "http://xmlns.com/foaf/0.1/gender": [
	 {
	   "type": "literal", 
	   "value": "male"
	 }
   ], 
   "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
	 {
	   "type": "uri", 
	   "value": "http://xmlns.com/foaf/0.1/Person"
	 }
   ]
 }
}

RDF has even been serialized in HTML files by embedding triples into attributes. This is called RDFa, and a snippet of RDFa might look like this:

<div xmlns="http://www.w3.org/1999/xhtml"
  prefix="
    foaf: http://xmlns.com/foaf/0.1/
    rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
    dcterms: http://purl.org/dc/terms/
    rdfs: http://www.w3.org/2000/01/rdf-schema#"
</div>
<div typeof="rdfs:Resource" about="http://en.wikipedia.org/wiki/Declaration_of_Independence">
  <div rel="dcterms:creator">
    <div typeof="foaf:Person" about="http://id.loc.gov/authorities/names/n79089957">
      <div property="foaf:gender" content="male"></div>
    </div>
  </div>
</div>
</div>

Once the RDF is serialized and put on the Web, it is intended to be harvested by Internet spiders and robots. They cache the data locally, read it, and update their local triples stores. This data is then intended to be analyzed, indexed, and used to find or discover new relationships or knowledge.

The second way of publishing linked data is through a “SPARQL endpoint”. SPARQL is a query language very similar to the query language of relational databases (SQL). SPARQL endpoints are usually Web-accesible interfaces allowing the reader to search the underlying triple store. The result is usually a stream of XML. Admitted, SPARQL is obtuse at the very least.

Just like the published RDF, the output of SPARQL queries can be serialized in many different forms. And just like relational databases, triple stores and SPARQL queries are not intended to be used directly by the reader. Instead, something more friendly (but ultimately less powerful and less flexible) is always intended.

So what does this have to do with libraries and specifically bibliographic description? The answer is not that complicated. The what of librarianship has not really changed over the millenium. Librarianship is still about processes of collection, organization, preservation, dissemination, and sometimes evaluation. On the other hand, with the evolution of technology and cultural expectations, the how’s of librarianship have changed dramatically. Considering the current environment, it is time to evolve, yet again. The next evolution is the employment of RDF and linked data as the means of bibliographic description. By doing so the data, information, and knowledge contained in libraries will be more accessible and more useful to the wider community. As time has gone on, the data and metadata of libraries has become less and less librarian-centric. By taking the leap to RDF and linked data, this will only become more true, and this is a good thing for both libraries and the people they serve.

BIBFRAME

Enter BIBFRAME, an ontology designed for libraries and their collections. It is not the only ontology intended to describe libraries and their collections. There are other examples as well, notably, bib.schema.org, FRBR for RDF, MODS and MADS for RDF, and to some extent, Dublin Core. Debates rage on mailing lists regarding the inherent advantages & disadvantages of each of these ontologies. For the most part, the debates seem to be between BIBFRAME, bib.schema.org, and FRBR for RDF. BIBFRAME is sponsored by the Library of Congress and supported by a company called Zepheira. At its very core are the ideas of a work and its instance. In other words, BIBFRAME boils the things of libraries down to two entities. Bib.schema.org is a subset of schema.org, an ontology endorsed by the major Internet search engines (Google, Bing, and Yahoo). And since schema.org is designed to enable the description of just about anything, the implementation of bib.schema.org is seen as a means of reaching the widest possible audience. On the other hand, bib.schema.org is not always seen as being as complete as BIBFRAME. The third contender is FRBR for RDF. Personally, the author has not seen very many examples of its use, but it purports to better serve the needs/desires of the reader through the concepts of WEMI (Work, Expression, Manifestation, and Item).

That said, it is in this author’s opinion, that the difference between the various ontologies is akin to debating the differences between vanilla and chocolate ice cream. It is a matter of opinion, and the flavors are not what is important, but rather it is the ice cream itself. Few people outside libraries really care which ontology is used. Besides, each ontology includes predicates for the things everybody expects: titles, authors, publishers, dates, notes, subjects/keywords, added entries, and locations. Moreover, in this time of transition, it is not feasible to come up with the perfect solution. Instead, this evolution is an iterative process. Give something a go. Try it for a limited period of time. Evaluate. And repeat. We also live in a world of digital data and information. This data and information is, by its very nature, mutable. There is no reason why one ontology over another needs to be debated ad nauseum. Databases (triple stores) support the function of find/replace with ease. If one ontology does not seem to be meeting the desired needs, then (simply) change to another one.††† In short, BIBFRAME may not be the “best” ontology, but right now, it is good enough.

Workflow

Now that the fundamentals have been outlined and elaborated upon, a workflow can be articulated. At the risk of mixing too many metaphors, here is a “recipe” for doing bibliographic description using BIBFRAME (or just about any other bibliographic ontology):

Answer the questions, “What is bibliographic description, and how does it help facilitate the goals of librarianship?”
Understand the concepts of RDF and linked data.
Embrace & understand the strengths & weaknesses of BIBFRAME as a model for bibliographic description.
Design or identify and then install a system for creating, storing, and editing your bibliographic data. This will be some sort of database application whether it be based on SQL, non-SQL, XML, or a triple store. It might even be your existing integrated library system.
Using the database system, create, store, import/edit your bibliographic descriptions. For example, you might simply use your existing integrated library for these purposes, or you might transform your MARC data into BIBFRAME and pour the result into a triple store, like this:
1. Dump MARC records
2. Transform MARC into BIBFRAME
3. Pour the result into a triple-store
4. Sort the triples according to the frequency of literal values
5. Find/replace the most frequently found literals with URIs††††
6. Go to Step #D until tired
7. Use the triple-store to create & maintain ongoing bibliographic description
8. Go to Step #D
Expose your bibliographic description as linked data by writing a report against the database system. This might be as simple as configuring your triple store, or as complicated as converting MARC/AACR2 from your integrated library system to BIBFRAME.
Facilitate the discovery process, ideally through the use of linked data publishing and SPARQL, or directly against the integrated library system.
Go to Step #5 on a daily basis.
Go to Step #1 on an annual basis.

If the profession continues to use its existing integrated library systems for maintaining bibliographic data (Step #4), then the hard problem to solve is transforming and exposing the bibliographic data as linked data in the form of the given ontology. If the profession designs a storage and maintenance system rooted in the given ontology to begin with, then the problem is accurately converting existing data into the ontology and then designing mechanisms for creating/editing the data. The later option may be “better”, but the former option seems less painful and requires less retooling. This author advocates the “better” solution.

After a while, such a system may enable a library to meet the expressed needs/desires of its constituents, but it may present the library with a different set of problems. On one hand, the use of RDF as the root of a discovery system almost literally facilitates a “Web of knowledge”. But on the other hand, to what degree can it be used to do (more mundane) tasks such as circulation and acquisitions? One of the original purposes of bibliographic description was to create a catalog — an inventory list. Acquisitions adds to the list, and circulation modifies the list. To what degree can the triple store be used to facilitate these functions? If the answer is “none”, then there will need to be some sort of outside application interfacing with the triple store. If the answer is “a lot”, then the triple store will need to include an ontology to facilitate acquisitions and circulation.

Prototypical implementation

In the spirit of putting the money where the mouth is, the author has created the most prototypical and toy implementations possible. It is merely a triple store filled with a tiny set of automatically transformed MARC records and made publicly accessible via SPARQL. The triple store was built using a set of Perl modules called Redland. The system supports initialization of a triple store, the adding of items to the store via files saved on a local file system, rudimentary command-line search, a way to dump the contents of the triple store in the form of RDF/XML, and a SPARQL endpoint. [1] Thus, Step #4 from the recipe above has been satisfied.

To facilitate Step #5 a MARC to BIBFRAME transformation tool was employed [2]. The transformed MARC data was very small, and the resulting serialized RDF was valid. [3, 4] The RDF was imported into the triple store and resulted in the storage of 5,382 triples. Remember, there is no such thing as a record in the world of RDF! Using the SPARQL endpoint, it is now possible to query the triple store. [5] For example, the entire store can be dumped with this (dangerous) query:

# dump of everything
SELECT ?s ?p ?o 
WHERE { ?s ?p ?o }

To see what types of things are described one can list only the objects (classes) of the store:

# only the objects
SELECT DISTINCT ?o
WHERE { ?s a ?o }
ORDER BY ?o

To get a list of all the store’s properties (types of relationships), this query is in order:

# only the predicates
SELECT DISTINCT ?p
WHERE { ?s ?p ?o }
ORDER BY ?p

BIBFRAME denotes the existence of “Works”, and to get a list of all the works in the store, the following query can be executed:

# a list of all BIBFRAME Works
SELECT ?s 
WHERE { ?s a <http://bibframe.org/vocab/Work> }
ORDER BY ?s

This query will enumerate and tabulate all of the topics in the triple store. Thus providing the reader with an overview of the breadth and depth of the collection in terms of subjects. The output is ordered by frequency:

# a breadth and depth of subject analsysis
SELECT ( COUNT( ?l ) AS ?c ) ?l
WHERE {
  ?s a <http://bibframe.org/vocab/Topic> . 
  ?s <http://bibframe.org/vocab/label> ?l
}
GROUP BY ?l
ORDER BY DESC( ?c )

All of the information about a specific topic in this particular triple store can be listed in this manner:

# about a specific topic
SELECT ?p ?o 
WHERE { <http://bibframe.org/resources/Ssh1456874771/vil_134852topic10> ?p ?o }

The following query will create the simplest of title catalogs:

# simple title catalog
SELECT ?t ?w ?c ?l ?a
WHERE {
  ?w a <http://bibframe.org/vocab/Work>           .
  ?w <http://bibframe.org/vocab/workTitle>    ?wt .
  ?wt <http://bibframe.org/vocab/titleValue>  ?t  .
  ?w <http://bibframe.org/vocab/creator>      ?ci .
  ?ci <http://bibframe.org/vocab/label>       ?c  .
  ?w <http://bibframe.org/vocab/subject>      ?s  .
  ?s <http://bibframe.org/vocab/label>        ?l  .
  ?s <http://bibframe.org/vocab/hasAuthority> ?a
}
ORDER BY ?t

The following query is akin to a phrase search. It looks for all the triples (not records) containing a specific key word (catholic):

# phrase search
SELECT ?s ?p ?o
WHERE {
  ?s ?p ?o
  FILTER REGEX ( ?o, 'catholic', 'i' )
}
ORDER BY ?p

Automatically transformed MARC data into BIBFRAME RDF will contain a preponderance of literal values when URIs are really desired. The following query will find all of the literals and sort them by the number of their individual occurrences:

# find all literals
SELECT ?p ?o ( COUNT ( ?o ) as ?c )
WHERE { ?s ?p ?o FILTER ( isLiteral ( ?o ) ) }
GROUP BY ?o 
ORDER BY DESC( ?c )

It behooves the cataloger to identify URIs for these literal values and replace the literals (or supplement) the triples accordingly (Step #5E in the recipe, above). This can be accomplished both programmatically as well as manually by first creating a list of appropriate URIs and then executing a set of INSERT or UPDATE commands against the triple store.

“Blank nodes” (URIs that point to nothing) are just about as bad as literal values. The following query will list all of the blank nodes in a triple store:

# find all blank nodes
SELECT ?s ?p ?o WHERE { ?s ?p ?o FILTER ( isBlank( ?s ) ) }

And the data associated with a particular blank node can be queried in this way:

# learn about a specific blank node
SELECT distinct ?p WHERE { _:r1456957120r7483r1 ?p ?o } ORDER BY ?p

In the case of blank nodes, the cataloger will then want to “mint” new URIs and perform an additional set of INSERT or UPDATE operations against the underlying triple store. This is a continuation of Step #5E.

These SPARQL queries applied against this prototypical implementation have tried to illustrate how RDF can fulfill the needs and requirements of bibliographic description. One can now begin to see how an RDF triple store employing a bibliographic ontology can be used to fulfill some of the fundamental goals of a library catalog.

Summary

This essay defined librarianship as a set of interlocking collections and services. Bibliographic description was outlined in an historical context, with the point being that the process of bibliographic description has evolved with technology and cultural expectations. The principles of RDF and linked data were then described, and the inherent advantages & disadvantages of leading bibliographic RDF ontologies were touched upon. The essay then asserted the need for faster evolution regarding bibliographic description and advocated the use of RDF and BIBFRAME for this purpose. Finally, the essay tried to demonstrate how RDF and BIBFRAME can be used to satisfy the functionality of the library catalog. It did this through the use of a triple store and a SPARQL endpoint. In the end, it is hoped the reader understands that there is no be-all end-all solution for bibliographic description, but the use of RDF technology is the wave of the future, and BIBFRAME is good enough when it comes to the ontology. Moving to the use of RDF for bibliographic description will be painful for the profession, but not moving to RDF will be detrimental.

Notes

† This presentation ought to be also be available as a one-page handout in the form of a PDF document.

†† Moreover, collections and services go hand-in-hand because collections without services are useless, and services without collections are empty. As a buddhist monk once said, “Collections without services is the sound of one hand clapping.” Librarianship requires a healthy balance of both.

††† That said, no matter what a person does, things always get lost in translation. This is true of human language just as much as it is true for the language (data/information) of computers. Yes, data & information will get lost when moving from one data model to another, but still I contend the fundamental and most useful elements will remain.

†††† This process (Step #5E) was coined by Roy Tennant and his colleagues at OCLC as “entification”.

Links

[1] toy implementation – http://infomotions.com/sandbox/bibframe/
[2] MARC to BIBFRAME – http://bibframe.org/tools/transform/start
[3] sample MARC data – http://infomotions.com/sandbox/bibframe/data/data.xml
[4] sample RDF data – http://infomotions.com/sandbox/bibframe/data/data.rdf
[5] SPARQL endpoint – http://infomotions.com/sandbox/bibframe/sparql/

Posted in Semantic Web and Linked Data | 2 Comments »

XML 101

January 6th, 2016

This past Fall I taught “XML 101” online and to library school graduate students. This posting echoes the scripts of my video introductions, and I suppose this posting could also be used as very gentle introduction to XML for librarians.

Introduction

another field I work at the University of Notre Dame, and my title is Digital Initiatives Librarian. I have been a librarian since 1987. I have been writing software since 1976, and I will be your instructor. Using materials and assignments created by the previous instructors, my goal is to facilitate your learning of XML.

XML is a way of transforming data into information. It is a method for marking up numbers and text, giving them context, and therefore a bit of meaning. XML includes syntactical characteristics as well as semantic characteristics. The syntactical characteristics are really rather simple. There are only five or six rules for creating well-formed XML, such as: 1) there must be one and only one root element, 2) element names are case-sensitive, 3) elements must be close properly, 4) elements must be nested properly, 4) attributes must be quoted, and 5) there are a few special characters (&, <, and >) which must be escaped if they are to be used in their literal contexts. The semantics of XML is much more complicated and they denote the intended meaning of the XML elements and attributes. The semantics of XML are embodied in things called DTDs and schemas.

Again, XML is used to transform data into information. It is used to give data context, but XML is also used to transmit this information in an computer-independent way from one place to another. XML is also a data structure in the same way MARC, JSON, SQL, and tab-delimited files are data structures. Once information is encapsulated as XML, it can unambiguously transmitted from one computer to another where it can be put to use.

This course will elaborate upon these ideas. You will learn about the syntax and semantics of XML in general. You will then learn how to manipulate XML using XML-related technologies called XPath and XSLT. Finally, you will learn library-specific XML “languages” to learn how XML can be used in Library Land.

Well-formedness

In this, the second week of “XML 101 for librarians”, you will learn about well-formed XML and valid XML. Well-formed XML is XML that conforms to the five or six syntactical rules. (XML must have one and only one root element. Element names are case sensitive. Elements must be closed. Elements must be nested correctly. Attributes must be quoted. And there are a few special characters that must be escaped (namely &, <, and >). Valid XML is XML that is not only well-formed but also conforms to a named DTD or schema. Think of valid XML as semantically correct.

Jennifer Weintraub and Lisa McAulay, the previous instructors of this class, provide more than a few demonstrations of how to create well-formed as well as valid XML. Oxygen, the selected XML editor for this course is both powerful and full-featured, but using it efficiently requires practice. That’s what the assignments are all about. The readings supplement the demonstrations.

DTD’s and namespaces

DTD’s, schemas, and namespaces put the “X” in XML. They make XML extensible. They allow you to define your own elements and attributes to create your own “language”.

DTD’s — document type declarations — and schemas are the semantics of XML. They define what elements exists, what order they appear in, what attributes they can contain, and just as importantly what the elements are intended to contain. DTD’s are older than schemas and not as robust. Schemas are XML documents themselves and go beyond DTD’s in that they provide the ability to define the types of data elements and attributes contain.

Namespaces allow you, the author, to incorporate multiple DTD and schema definitions into a single XML document. Namespaces provide a way for multiple elements of the same name to exist concurrently in a document. For example, two different DTD’s may contain an element called “title”, but one DTD refers to a title as in the title of a book, and the other refers to “title” as if it were an honorific.

Schemas

Schemas are an alternative and more intelligent alternative to DTDs. While DTDs define the structure of XML documents, schemas do it with more exactness. While DTDs only allow you to define elements, the number of elements, the order of elements, attributes, and entities, schemas allow you to do these things and much more. For example, they allow you to define the types of content that go into elements or attributes. Strings (characters). Numbers. Lists of characters or numbers. Boolean (true/false) values. Dates. Times. Etc. Schemas are XML documents in an of themselves, and therefore they can be validated just like any other XML document with a pre-defined structure.

The reading and writing of XML schemas is very librarian-ish because it is about turning data into information. It is about structuring data so it makes sense, and it does this in an unambiguous and computer-independent fashion. It is too bad our MARC (bibliographic) standards are not as rigorous.

RelaxNG, Schematron, and digital libraries

fields The first is yet another technology for modeling your XML, and it is called RelaxNG. This third modeling technology is intended to be more human readable than schemas and more robust that DTDs. Frankly, I have not seen RelaxNG implements very many times, but it behooves you to know it exists and how it compares to other modeling tools.

The second is Schematron. This tool too is used to validate XML, but instead of returning “ugly” computer-looking error messages, its errors are intended to be more human-readable and describe why things are the way they are instead of just saying “Wrong!”

Lastly, there is an introduction to digital libraries and trends in their current development. More and more, digital libraries are really and truly implementing the principles of traditional librarianship complete with collection, organization, preservation, and dissemination. At the same time, they are pushing the boundaries of the technology and stretching our definitions. Remember, it is not so much about the technology (the how of librarianship) that is important, but rather the why of libraries and librarianship. The how changes quickly. The why changes slowly, albiet sometimes too slowly.

XPath

This week is all about XPath, and it is used to select content from your XML files. It is akin to navigating a computer’s filesystem from the command line in order to learn what is located in different directories.

XPath is made up of expressions which return values of true, false, strings (characters), numbers, or nodes (subsets of XML files). XPath is used in conjunction with other XML technologies, most notably XSTL and XQuery. XSLT is used to transform XML files into other plain text files. XQuery is akin to the structured query language of relational databases.

You will not be able to do very much with XML other than read or write it, unless you understand XPath. An understanding XPath is essencial if you want to do truly interesting things with XML.

XSLT

This week you will be introduced to XSLT, a programming language used to transform XML into other plain text files.

XML is all about information, and it is not about use nor display. In order for XML to be actually useful — to be applied towards some sort of end — specific pieces of data need to be extracted from XML or the whole of the XML file needs to be converted into something else. The most common conversion (or “transformation”) is from some sort of XML into HTML for display in a Web browser. For example, bibliographic XML (MARCXML or MODS) may be transformed into a sort of “catalog card” for display, or a TEI file may be transformed into a set of Web pages, or an EAD file may be transformed into a guide intended for printing. Alternatively, you may want to tranform the bibliographic data into a tab-delimited text file for a spreadsheet or an SQL file for a relational database. Along with other sets of information, an XML file may contain geographic coordinates, and you may want to extract just those coordinates to create a KML file — a sort of map file.

XSLT is a programming language but not like most programming languages you may know. Most programming languages are “procedural” (like Perl, PHP, or Python), meaning they execute their commands in a step-wise manner. “First do this, then do that, then do the other thing.” This can be contrasted with “declarative” programming languages where events occur or are encountered in a data file, and then some sort of execution happens. There are relatively few declarative programming languages, but LISP is/was one of them. Because of the declarative nature of XSLT, the apply-templates command is so important. The apply-templates command sort of tells the XSLT processor to go off and find more events.

Now that you are beginning to learn XSLT and combining it with XPath, you are beginning to do useful things with the XML you have been creating. This is where the real power is. This is where it gets really interesting.

TEI — Text Encoding Initiative

TEI is a granddaddy, when it comes to XML “languages”. It started out as a different from of mark-up, a mark-up called SGML, and SGML was originally a mark-up language designed at IBM for the purposes of creating, maintaining, and distributing internal documentation. Now-a-days, TEI is all but a hallmark of XML.

TEI is a mark-up language for any type of literature: poetry or prose. Like HTML, it is made up of head and body sections. The head is the place for administrative, bibliographic, and provenance metadata. The body is where the poetry or prose is placed, and there are elements for just about anything you can imagine: paragraphs, lines, headings, lists, figures, marginalia, comments, page breaks, etc. And if there is something you want to mark-up, but an element does not explicitly exist for it, then you can almost make up your own element/attribute combination to suit your needs.

TEI is quite easily the most well-documented XML vocabulary I’ve ever seen. The community is strong, sustainable, albiet small (if not tiny). The majority of the community is academic and very scholarly. Next to a few types of bibliographic XML (MARCXML, MODS, OAIDC, etc.), TEI is probably the most commonly used XML vocabulary in Library Land, with EAD being a close second. In libraries, TEI is mostly used for the purpose of marking-up transcriptions of various kinds: letters, runs of out-of-print newsletters, or parts of a library special collection. I know of no academic journals marked-up in TEI, no library manuals, nor any catalogs designed for printing and distribution.

TEI, more than any other type of XML designed for literature, is designed to support the computed critical analysis of text. But marking something up in TEI in a way that supports such analysis is extraordinarily expensive in terms of both time and expertise. Consequently, based on my experience, there are relatively very few such projects, but they do exist.

XSL-FO

As alluded to throughout this particular module, XSL-FO is not easy, but despite this fact, I sincerely believe it is under-utilized tool.

FO stands for “Formatting Objects”, and it in an of itself is an XML vocabulary used to define page layout. It has elements defining the size of a printed page, margins, running headers & footers, fonts, font sizes, font styles, indenting, pagination, tables of contents, back-of-the-book indexes, etc. Almost all of these elements and their attributes use a syntax similar to the syntax of HTML’s cascading stylesheets.

Once an XML file is converted into an FO document, you are expected to feed the FO document to a FO processor, and the FO processor will convert the document into something intended for printing — usually a PDF document.

FO is important because not everything is designed nor intended to be digital. Digital everything is mis-nomer. The graphic design of a printed medium is different from the graphic design of computer screens or smart phones. In my opinion, important XML files ought to be transformed into different formats for different mediums. Sometimes those mediums are screen oriented. Sometimes it is better to print something, and printed somethings last a whole lot longer. Sometimes it is important to do both.

FO is another good example of what XML is all about. XML is about data and information, not necessarily presentation. XSL transforms data/information into other things — things usually intended for reading by people.

EAD — Encoded Archival Description

Encoded Archival Description (or EAD) is the type of XML file used to enumerate, evaluate, and make accessible the contents of archival collections. Archival collections are often the raw and primary materials of new humanities scholarship. They are usually “the papers” of individuals or communities. They may consist of all sorts of things from letters, photographs, manuscripts, meeting notes, financial reports, audio cassette tapes, and now-a-days computers, hard drives, or CDs/DVDs. One thing, which is very important to understand, is that these things are “collections” and not intended to be used as individual items. MARC records are usually used as a data structure for bibliographically describing individual items — books. EAD files describe an entire set of items, and these descriptions are more colloquially called “finding aids”. They are intended to be read as intellectual works, and the finding aids transform collections into coherent wholes.

Like TEI files, EAD files are comprised of two sections: 1) a header and 2) a body. The header contains a whole lot or very little metadata of various types: bibliographic, administrative, provenance, etc. Some of this metadata is in the form of lists, and some of it is in the form of narratives. More than TEI files, EAD files are intended to be displayed on a computer screen or printed on paper. This is why you will find many XSL files transforming EAD into either HTML or FO (and then to PDF).

RDF

RDF is an acronym for Resource Description Framework. It is a data model intended to describe just about anything. The data model is based on an idea called triples, and as the name implies, the triples have three parts: 1) subjects, 2) predicates, and 3) objects.

Subjects are always URIs (think URLs), and they are the things described. Objects can be URIs or literals (words, phrases, or numbers), and objects are the descriptions. Predicates are also always URIs, and they denote the relationship between the subjects and the objects.

The idea behind RDF was this. Describe anything and everthing in RDF. Resuse as many of the URIs used by other people as possible. Put the RDF on the Web. Allow Internet robots/spiders to harvest and cache the RDF. Allow other computer programs to ingest the RDF, analyse it for the similar uses of subjects, predicates, and objects, and in turn automatically uncover new knowledge and new relationships between things.

RDF is/was originally expressed as XML, but the wider community had two problems with RDF. First, there were no “killer” applications using RDF as input, and second, RDF expressed as XML was seen as too verbose and too confusing. Thus, the idea of RDF languished. More recently, RDF is being expressed in other forms such as JSON and Turtle and N3, but there are still no killer applications.

You will hear the term “linked data” in association with RDF, and linked data is the process of making RDF available on the Web.

RDF is important for libraries and “memory” or “cultural heritage” institutions, because the goal of RDF is very similar to the goals of libraries, archives, and museums.

MARC

waves The MARC standard has been the bibliographic bread & butter of Library Land since the late 1960’s. When it was first implemented it was an innovative and effect data structure used primarily for the production of catalog cards. With the increasing availability of computers, somebody got the “cool” idea of creating an online catalog. While logical, the idea did not mature with a balance of library and computing principles. To make a long story short, library principles prevailed and the result has been and continues to be painful for both the profession as well as the profession’s clientele.

MARCXML was intended to provide a pathway out of this morass, but since it was designed from the beginning to be “round tripable” with the original MARC standard, all of the short-comings of the original standard have come along for the ride. The Library Of Congress was aware of these short-comings, and consequently MODS was designed. Unlike MARC and MARCXML, MODS has no character limit and its field names are human-readable, not based on numeric codes. Given that MODS is flavor of XML, all of this is a giant step forward.

Unfortunately, the library profession’s primary access tools — the online catalog and “discovery system” — still heavily rely on traditional MARC for input. Consequently, without a wholesale shift in library practice, the intellectual capital the profession so dearly wants to share is figuratively locked in the 1960’s.

Not a panacea

XML really is an excellent technology, and it is most certainly apropos for the work of cultural heritage institutions such as libraries, archives, and museums. This is true for many reasons:

it is computing platform independent
it requires a minimum of computer technology to read and write
to some degree, it is self-documenting, and
especially considering our profession, it is all about data, information, and knowlege

On the other hand, it does have a number of disadvantages, for example:

it is verbose — not necessarily succinct
while easy to read and write, it can be difficult to process
like all things computer program-esque, it imposes a set of syntactical rules, which people can sometimes find frustrating
its adoption as standard has not been as ubiquitous as desired

To date you have learned how to read, write, and process XML and a number of its specific “flavors”, but you have by no means learned everything. Instead you have received a more than adequate introduction. Other XML topics of importance include:

evolutions in XSLT and XPath
XML-based databases
XQuery, a standardized method for querying sets of XML similar to the standard query language of relational databases
additional XML vocabularies, most notably RSS
a very functional way of making modern Web browsers display XML files
XML processing instructions as well as reserved attributes like lang

In short, XML is not a panacea, but it is an excellent technology for library work.

Summary

You have all but concluded a course on XML in libraries, and now is a good time for a summary.

First of all, XML is one of culture’s more recent attempts at formalizing knowledge. At its root (all puns intended) is data, such as the number like 1776. Through mark-up we might say this number is a year, thus turning the data into information. By putting the information into context, we might say that 1776 is when the Declaration of Independence was written and a new type of government was formed. Such generalizations fall into the realm of knowledge. To some degree, XML facilitates the transformation of data into knowledge. (Again, all puns intended.)

Second, understand that XML is also a data structure defined by the characteristics of well-formedness. By that I mean XML has one and only one root element. Elements must be opened and closed in a hierarchal manner. Attributes of elements must be quoted, and a few special characters must always be escaped. The X in XML stands for “extensible”, and through the use of DTDs and schemas, specific XML “flavors” can be specified.

With this under your belts you then experimented with at least a couple of XML flavors: TEI and EAD. The former is used to mark-up literature. The later is used to describe archival collections. You then learned about the XML transformation process through the application of XSL and XPath, two rather difficult technologies to master. Lastly, you made strong efforts to apply the principles of XML to the principles of librarianship by marking up sets of documents or creating your own knowledge entity. It is hoped you have made a leap from mere technology to system. It is not about Oxygen nor graphic design. It is about the chemistry of disseminating data as unambiguously as possible for the purposes of increasing the sphere of knowledge. With these things understood, you are better equipped to practice librarianship in the current technological environment.

Finally, remember, there is no such thing as a Dublin Core record.

Epilogue — Use and understanding

iceburg This course in XML was really only an introduction. You were expected to read, write, and transform XML. This process turns data into information. All of this is fine, but what about knowledge?

One of the original reasons texts were marked up was to facilitate analysis. Researchers wanted to extract meaning from texts. One way to do that is to do computational analysis against text. To facilitate computational analysis people thought is was necessary for essential characteristics of a text to be delimited. (It is/was thought computers could not really do natural language processing.) How many paragraphs exists? What are the names in a text? What about places? What sorts of quantitative data can be statistically examined? What main themes does the text include? All of these things can be marked-up in a text and then counted (analyzed).

Now that you have marked up sets of letters with persname elements, you can use XPath to not only find persname elements but count them as well. Which document contains the most persnames? What are the persnames in each document. Tabulate their frequency. Do this over a set of documents to look for trends across the corpus. This is only a beginning, but entirely possible given the work you have already done.

Libraries do not facilitate enough quantitative analysis against our content. Marking things up in XML is a good start, but lets go to the next step. Let’s figure out how the profession can move its readership from discovery to analysis — towards use & understand.

Posted in MARC and technical services | Comments Off on XML 101

Mr. Serials continues

January 6th, 2016

The (ancient) Mr. Serials Process continues to support four mailing list archives, specifically, the archives of ACQNET, Colldv-l, Code4Lib, and NGC4Lib, and this posting simply makes the activity explicit.

flowers Mr. Serials is/was a process I developed quite a number of years ago as a method for collecting, organizing, archiving electronic journals (serials). The process worked well for a number of years, until electronic journals were no longer distributed via email. Now-a-days, Mr. Serials only collects the content of a few mailing lists. That’s okay. Things change. No big deal.

On the other hand, from a librarian’s and archivist’s point-of-view, it is important to collect mailing list content in its original form — email. Email uses the SMTP protocol. The communication sent back and forth, between email server and client, is well-structured albiet becoming verbose. Probably “the” standard for saving email on a file system is called mbox. Given a mbox file, it is possible to use any number of well-known applications to read/write mbox data. Heck, all you need is a text editor. Increasingly, email archives are not available from mailing list applications, and if they are, then they are available only to mailing list administrators and/or in a proprietary format. For example, if you host a mailing list on Google, can you download an archive of the mailing list in a form that is easily and universally readable? I think not.

Mr. Serials circumvents this problem. He subscribes to mailing lists, saves the incoming email to mbox files, and processes the mbox files to create searchable/browsable interfaces. The interfaces are not hugely aesthetically appealing, but they are more than functional, and the source files are readily available. Just ask.

Most recently both the ACQNET and Colldv-l mailing lists moved away from their hosting institutions to servers hosted by the American Library Association. This has not been the first time these lists have moved. It probably won’t be the last, but since Mr. Serials continues subscribe to these lists, comprehensive archives persevere. Score a point for librarianship and the work of archives. Long live Mr. Serials.

Posted in MARC and technical services | Comments Off on Mr. Serials continues

Re-MARCable

November 17th, 2015

This blog posting contains: 1) questions/statements about MARC and posted by graduate library school students taking an online XML class I’m teaching this semester, and 2) my replies. Considering my previously published blog posting, you might say this posting is “re-MARCable”.

I’m having some trouble accessing the file named data.marc for the third question in this week’s assignment. It keeps opening in word and all I get is completely unreadable. Is there another way of going about finding the answer for that particular question?

Okay. I have to admit. I’ve been a bit obtuse about the MARC file format.

MARC is/was designed to contain ASCII characters, and therefore it ought to be human-readable. MARC does not contain binary characters and therefore ought to be readable in text editors. DO NOT open the .marc file in your word processor. Use your text editor to open it up. If you have line wrap turned off, then you ought to see one very long line of ugly text. If you turn on line wrap, then you will see many lines of… ugly text. Attached (hopefully) is a screen shot of many MARC records loaded into my text editor. And I rhetorically ask, “How many records are displayed, and how do you know?”

marc

I’m trying to get y’all to answer a non-rhetorical question asked against yourself, “Considering the state of today’s computer technology, how viable is MARC? What are the advantages and disadvantages of MARC?”

I am taking Basic Cataloging and Classification this semester, but we did not discuss octets or have to look at an actual MARC file. Since this is supposed to be read by a machine, I don’t think this file format is for human consumption which is why it looks scary.

[Student], you continue to be a resource for the entire class. Thank you.

Everybody, yes, you will need to open the .marc file in your text editor. All of the files we are creating in this class ought to be readable in your text editor. True and really useful data files ought to be text files so they can be transferred from application to application. Binary files are sometimes more efficient, but not long-lasting. Here in Library Land we are in it for the long haul. Text files are where it is at. PDF is bad enough. Knowing how to manipulate things in a text editor is imperative when it comes to really using a computer. Imperative!!! Everything on the Web is in plain text.

In any event, open the .marc file in your text editor. On a Macintosh that is Text Edit. On Windows it is NotePad or WordPad. Granted all of these particular text editors are rather brain-dead, but they all function necessarily. A better text editor for Macintosh is Text Wrangler, and for Windows is NotePad++. When you open the .marc file, it will look ugly. It will seem unreadable, but that is not the case at all. Instead, a person needs to know the “secret codes” of cataloging, as well as a bit of an obtuse data structure in order to make sense of the whole thing.

Okay. Octets. Such are 8-bit characters, as opposed to the 7-bit characters of ASCII enclosing. The use of 8-bit characters enabled Library Land to integrate characters such as ñ, é, or å into its data. And while Library Land was ahead of the game in this regard, it did not embrace Unicode when it came along:

Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world’s writing systems. Developed in conjunction with the Universal Character Set standard and published as The Unicode Standard, the latest version of Unicode contains a repertoire of more than 120,000 characters covering 129 modern and historic scripts, as well as multiple symbol sets. [1]

Nor did Library Land update its data when changes happened. Consequently, not only do folks outside Library Land need to know how to read and write MARC records (which they can’t), they also need to know and understand the weird characters encodings which we use. In short, the data of Library Land is not very easily readable by the wider community, let alone very many people within our own community. Now that is irony. Don’t you think so!? Our data is literally and figuratively stuck in 1965, and we continue to put it there.

Professor, is this data.marc file suppose to be read only by a machine as [a fellow classmate] suggested?

Only readable by a computer? The answer is both no and yes.

Any data file intended to be shared between systems (sets of applications) ought to be saved as plain text in order to facilitate transparency and eliminate application monopolies/tyrannies. Considering the time when MARC was designed, it fulfilled these requirements. The characters were 7-bits long (ASCII), the MARC codes were few and far between, and its sequential nature allowed it to be shipped back and forth on things like tape or even a modem. (“Remember modems?”) Without the use of an intermediary computer program, is is entirely possible to read and write a MARC records with a decent text editor. So, the answer is “No, MARC is not only readable by a machine.”

On the other hand, considering how much extra data (“information”) the profession has stuffed into MARC data structure, it is really really hard to edit MARC records with a text editor. Library Land has mixed three things into a single whole: data, presentation, and data structure. This is really bad when it comes to computing. For example, a thing may have been published in 1542, but the cataloger is not certain of this date. Consequently, they will enter a data value of [1542]. Well, that is not a date (a number), but rather a string (a word). To make matters worse, the cataloger may think the date (year) of publication is within a particular decade but not exactly sure, and the date may be entered like as [154?]. Ack! Then let’s get tricky and add a copyright notation to a more recent but uncertain date — [c1986]. Does it never end? Then lets’ talk about the names of people. The venerable Fred Kilgour — founder of OCLC — is denoted in cataloging rules as Kilgour, Fred. Well, I don’t think Kilgour, Fred ever backwards talked so make sure his ideas sortable. Given the complexity of cataloging rules, which never simplify, it is really not feasible to read and write MARC records without an intermediate computer program. So, on the other hand, “Yes, an intermediary computer is necessary.” But if this is true, then why don’t catalogers know to read and write MARC records? The answer lies in what I said above. We have mixed three things into a single whole, and that is a really bad idea. We can’t expect catalogers to be computer programmers too.

The bottom line is this. Library Land automated its processes but it never really went to the next level and used computers to enhance library collections and services. All Library Land has done is used computers to facilitate library practice; Library Land has not embraced the true functionality of computers such as its ability to evaluate data/information. We have simply done the same thing. We wrote catalog cards by hand. We then typed catalog cards. We then used a computer to create them.

One more thing, Library Land simply does not have enough computer programmer types. Libraries build collections. Cool. Libraries provide services against the collections. Wonderful. This worked well (more or less) when libraries were physical entities in a localized environment. Now-a-days, when libraries are a part of a global network, libraries need to speak the global language, and that global language is spoken through computers. Computers use relational databases to organize information. Computers use indexes to make the information findable. Computers use well-structured Unicode files (such XML, JSON, and SQL files) to transmit information from one computer to another. In order to function, people who work in libraries (librarians) need to know these sorts of technologies in order to work on a global scale, but realistically speaking, what percentage of librarians, now how to do these thing, let alone know what they are? Probably less than 10%. It needs to be closer to 33%. Where 33% of the people build collections, 33% of the people provide services, and 33% of the people glue the work of the first 66% into a coherent whole. What to do with the remaining 1%? Call them “administrators”.

[1] Unicode – https://en.wikipedia.org/wiki/Unicode

Posted in MARC and technical services | 2 Comments »

MARC, MARCXML, and MODS

November 11th, 2015

This is the briefest of comparisons between MARC, MARCXML, and MODS. Its was written for a set of library school students learning XML.

MARC is an acronym for Machine Readable Cataloging. It was designed in the 1960’s, and its primary purpose was to ship bibliographic data on tape to libraries who wanted to print catalog cards. Consider the computing context of the time. There were no hard drives. RAM was beyond expensive. And the idea of a relational database had yet to be articulated. Consider the idea of a library’s access tool — the card catalog. Consider the best practice of catalog cards. “Generate no more than four or five cards per book. Otherwise, we will not be able to accommodate all of the cards in our drawers.” MARC worked well, and considering the time, it represented a well-designed serial data structure complete with multiple checksum redundancy.

Someone then got the “cool” idea to create an online catalog from MARC data. The idea was logical but grew without a balance of library and computing principles. To make a long story short, library principles sans any real understanding of computing principles prevailed. The result was a bloating of the MARC record to include all sorts of administrative data that never would have made it on to a catalog card, and this data was delimited in the MARC record with all sorts of syntactical “sugar” in the form of punctuation. Moreover, as bibliographic standards evolved, the previously created data was not updated, and sometimes people simply ignored the rules. The consequence has been disastrous, and even Google can’t systematically parse the bibliographic bread & butter of Library Land.* The folks in the archives community — with the advent of EAD — are so much better off.

Soon after XML was articulated the Library Of Congress specified MARCXML — a data structure designed to carry MARC forward. For the most part, it addressed many of the necessary issues, but since it insisted on making the data in a MARCXML file 100% transformable into a “traditional” MARC record, MARCXML falls short. For example, without knowing the “secret codes” of cataloging — the numeric field names — it is very difficult to determine what are the authors, titles, and subjects of a book.

The folks at the Library Of Congress understood these limitations almost from the beginning, and consequently they created an additional bibliographic standard called MODS — Metadata Object Description Schema. This XML-based metadata schema goes a long way in addressing both the computing times of the day and the needs for rich, full, and complete bibliographic data. Unfortunately, “traditional” MARC records are still the data structure ingested and understood by the profession’s online catalogs and “discovery systems”. Consequently, without a wholesale shift in practice, the profession’s intellectual content is figuratively stuck in the 1960’s.

* Consider the hodgepodge of materials digitized by Google and accessible in the HathiTrust. A search for Walden by Henry David Thoreau returns a myriad of titles, all exactly the same.

Readings

MARC (http://www.loc.gov/marc/bibliographic/bdintro.html) – An introduction to the MARC standard
leader (http://www.loc.gov/marc/specifications/specrecstruc.html#leader) – All about the leader of a traditional MARC record
MARC Must Die (http://lj.libraryjournal.com/2002/10/ljarchives/marc-must-die/) – An essay by Roy Tennent outlining why MARC is not a useful bibliographic format. Notice when it was written.
MARCXML (https://www.loc.gov/standards/marcxml/marcxml-design.html) – Here are the design considerations for MARCXML
MODS (http://www.loc.gov/standards/mods/userguide/) – This is an introduction to MODS

Exercise

This is much more of an exercise than it is an assignment. The goal of the activity is not to get correct answers but instead to provide a framework for the reader to practice critical thinking against some of the bibliographic standards of the library profession. To the best of your ability, and in the form of an written essay between 500 and 1000 words long, answer and address the following questions based on the contents of the given .zip file:

Measured in characters (octets), what is the maximum length of a MARC record? (Hint: It is defined in the leader of a MARC record.)
Given the maximum length of a MARC record (and therefore a MARCXML record), what are some of the limitations this imposes when it comes to full and complete bibliographic description?
Given the attached .zip file, how many bibliographic items are described in the file named data.marc? How many records are described in the file named data.xml? How many records are described in the file named data.mods? How do did you determine the answers to the previous three questions? (Hint: Open and read the files in your favorite text and/or XML editor.)
What is the title of the book in the first record of data.marc? Who is the author of the second record in the file named data.xml. What are the subjects of the third record in the file named data.mods? How did you determine the answers the previous three questions? Be honest.
Compare & contrast the various bibliographic data structures in the given .zip file. There are advantages and disadvantages to all three.

Posted in MARC and technical services | 1 Comment »

“Sum reflextions” on travel

October 25th, 2015

These are “sum reflextions” on travel; travel is a good thing, for many reasons.

I am blogging in front of the Pantheon. Amazing? Maybe. Maybe not. But the ability to travel, see these sorts of things, experience the different languages and cultures truly is amazing. All too often we live in our own little worlds, especially in the United States. I can’t blame us too much. The United States is geographically large. It borders only two other countries. One country speaks Spanish. The other speaks English and French. While the United States is the proverbial “melting pot”, there really isn’t very much cultural diversity in the United States, not compared to Europe. Moreover, the United States does not nearly have the history of Europe. For example, I am sitting in front of a building that was build before the “New World” was even considered as existing. It doesn’t help that the United States’ modern version of imperialism tends to make “United Statesians” feel as if they are the center of the world. I guess, that is some ways, it is not much different than Imperial Rome. “All roads lead to Rome.”

As you may or may not know, I have commenced upon a sort of leave of absence from my employer. In the past six weeks I have moved all of belongings to a cabin in a remote part of Indiana, and I have moved myself to Chicago. From there I began a month-long adventure. It began in Tuscany where I painted and deepened my knowledge of Western art history. I spent a week in Venice where I did more painting, walked up to my knees in water because the streets flooded, and I experienced Giotto’s frescos in Padua. For the past week I experienced Rome and did my best to actively participate in a users group meeting called ADLUG — the remnants of a user’s group meeting surrounding one of the very first integrated library systems — Dobris Libris. I also painted and rode a bicycle along the Appian Way. I am now on my way to Avignon where I will take a cooking class and continue on a “artist’s education”.

appian way Travel is not easy. It requires a lot of planning and coordination. “Where will I be when, and how will I get there? Once I’m there, what am I going to do, and how will I make sure things don’t go awry?” In this way, travel is not for the fient of heart, especially when venturing into territory where you do not know the language. It can be scary. Nor is travel inexpensive. One needs to maintain two households.

Travel is a kind of education that can not be gotten through the reading of books, the watching of television, nor discussion with other people. It is something that must be experienced first hand. Like sculpture, it is literally an experience that can only exist time & space in order to fully appreciate.

What does this have to do with librarianship? On one hand, nothing. On the other hand, everthing. From my perspective, librarianship is about a number of processes applied against a number of things. These processes include collection, organization, preservation, dissemination, and sometimes evaluation. The things of librarianship are data, information, knowledge, and sometimes wisdom. Even today, with the advent of our globally networked computers, the activities of librarianship remain essentially unchanged when compared to the activities of more than a hundred years ago. Libraries still curate collections, organize the collections into useful sets, provide access to the collections, and endeavor to maintain all of these services for the long haul.

Like most people and travel, many librarians (and people who work in libraries) do not have a true appreciation for the work of their colleagues. Sure, everybody applauds everybody else’s work, but have they actually walked in those other people’s shoes? The problem is most acute between the traditional librarians and the people who write computer programs for libraries. Both sets of people have the same goals; they both want to apply the same processes to the same things, but their techniques for accomplishing those goals are disimilar. One wants to take a train to get where they are going, and other wants to fly. This must change lest the profession become even less relevant.

flowers What is the solution? In a word, travel. People need to mix and mingle with the other culture. Call it cross-training. Have the computer programmer do some traditional cataloging for a few weeks. Have the cataloger learn how to design, implement, and maintain a relational database. Have the computer programmer sit at the reference desk for a while in order to learn about service. Have the reference librarian work with the computer programmer and learn how to index content and make it searchable. Have the computer programmer work in an archive or conservatory making books and saving content in gray cardboard boxes. Have the archivist hang out with computer programmer and learn how content is backed up and restored.

How can all this happen? In my opinion, the most direct solution is advocacy from library administration. Without the blessing of library administration everybody will say, “I don’t have time for such ‘travel’.” Well, library work is never done, and time will need to be carved out and taken from the top, like retirement savings, in order for such trips abroad to come to fruition.

The waiters here at my cafe are getting restless. I have had my time here, and it is time to move on. I will come back, probably in the Spring, and I’ll stay longer. In the meantime, I will continue with my own personal education.

Posted in Travelogues | 1 Comment »

What is old is new again

October 22nd, 2015

The “how’s” of librarianship are changing, but not the “what’s”.

(This is an outline for my presentation given at the ADLUG Annual Meeting in Rome (October 21, 2015). Included here are also the one-page handout and slides, both in the form of PDF documents.)

Linked Data

Linked Data is a method of describing objects, and these objects can be the objects in a library. In this way, Linked Data is a type of bibliographic description.

Linked Data is a manifestation of the Semantic Web. It is an interconnection of virtual sentences known as triples. Triples are rudimentary data structures, and as the name implies, they are made of three parts: 1) subjects, 2) predicates, and 3) objects. Subjects always take the form of a URI (think “URL”), and they point to things real or imaginary. Objects can take the form of a URI or a literal (think “word”, “phrase” or “number”). Predicates also take the form of a URI, and they establish relationships between subjects and objects. Sets of predicates are called ontologies or vocabularies and they present the languages of Linked Data.

simple arced graph

Through the curation of sets of triples, and through the re-use of URIs, it is often possible to make explicit assuming information and new knowledge.

There are an increasing number of applications enabling libraries to transform and convert their bibliographic data into Linked Data. One such application is called the ALIADA.

When & if the intellectual content of libraries, archives, and museums is manifested as Linked Data, then new relationships between resources will be uncovered and discovered. Consequently, one of the purposes of cultural heritage institutions will be realized. Thus, Linked Data is a newer, more timely method of describing collections; what is old is new again.

Curation of digital objects

The curation of collections, especially in libraries, does not have to be limited to physical objects. Increasingly new opportunities regarding the curation of digital objects represent a growth area.
With the advent of the Internet there exists an abundance of full-text digital objects just waiting to be harvested, collected, and cached. It is not good enough to link and point to such objects because links break and institutions (websites) dissolve.

Curating digital objects is not easy, and it requires the application of traditional library principles of preservation in order to be fulfilled. It also requires systematic organization and evaluation in order to be useful.

Done properly, there are many advantages to the curation of such digital collections: long-term access, analysis & evaluation, use & re-use, and relationship building. Examples include: the creation of institutional repositories, the creation of bibliographic indexes made up of similar open access journals, and the complete works of an author of interest.

In the recent past I have created “browsers” used to do “distant reading” against curated collections of materials from the HathiTrust, the EEBO-TCP, and JSTOR. Given a curated list of identifiers each of the browsers locally caches the full text of digital object object, creates a “catalog” of the collection, does full text indexing against the whole collection, and generates a set of reports based on the principles of text mining. The result is a set of both HTML files and simple tab-delimited text files enabling the reader to get an overview of the collection, query the collection, and provide the means for closer reading.

How can these tools be used? A reader could first identify the complete works of a specific author from the HathiTrust, say, Ralph Waldo Emerson. They could then identify all of the journal articles in JSTOR written about Ralph Waldo Emerson. Finally the reader could use the HathiTrust and JSTOR browsers to curate the full text of all the identified content to verify previously established knowledge or discover new knowledge. On a broader level, a reader could articulate a research question such as “What are some of the characteristics of early American literature, and how might some of its authors be compared & contrasted?” or “What are some of the definitions of a ‘great’ man, and how have these definitions changed over time?”

The traditional principles of librarianship (collection, organization, preservation, and dissemination) are alive and well in this digital age. Such are the “whats” of librarianship. It is the “hows” of the librarianship that need to evolve in order the profession to remain relevant. What is old is new again.

Posted in Semantic Web and Linked Data | Comments Off on What is old is new again

Painting in Tuscany

October 13th, 2015

As you may or may not know, I have commenced upon a sort of leave of absence from my employer, and I spent the last the better part of the last two weeks painting in Tuscany.

Me and eight other students arrived in Arezzo (Italy) on Wednesday, October 1, and we were greeted by Yves Larocque of Walk The Arts. We then spent the next ten days on a farm/villa very close to Singalunga (Italy) where we learned about color theory, how to mix colors, a bit of Western art history, and art theory. All the while we painted and painted and painted. I have taken a few art classes in my day and this was quite honestly the best one I’ve ever attended. It was thorough, individualized, comprehensive, and totally immersive. Painting in Tuscany was a wonderful way to commence a leave of absence. The process gave me a chance to totally get away, see things from a different vantage point, and begin an assessment.

What does this have to do with librarianship? I don’t know, yet. When I find out I’ll let you know.

Posted in Travelogues | Comments Off on Painting in Tuscany

My water collection predicts the future

September 29th, 2015

As many of you may or may not know, I collect water, and it seems as if my water collection predicts the future, sort of.

Since 1979 or so, I’ve been collecting water. [1] The purpose of the collection is/was enable me to see and experience different parts of the world whenever I desired. As the collection grew and my computer skills developed, I frequently used the water collection as a kind of Guinea pig for digital library projects. For example, my water collection was once manifested as a HyperCard Stack complete with the sound of running water in the background. For a while my water collection was maintained in a FileMaker database that generated sets of HTML. Quite a number of years ago I migrated everything to MySQL and embedded images of the water bottles in fields of the database. This particular implementation also exploited XML and XSLT to dynamically make the content available on the Web. (There was even some RDF output.) After that I included geographic coordinates into the database. This made it easy for me to create maps illustrating whence the water came. To date, there are about two hundred and fifty waters in my collection, but active collecting has subsided in the past few years.

But alas, this past year I migrated my co-located host to a virtual machine. In the process I moved all of my Web-based applications — dating back more than two decades — to a newer version of the LAMP stack, and in the process I lost only a single application — my water collection. I still have all the data, but the library used to integrate XSLT into my web server (AxKit) simply would not work with Apache 2.0, and I have not had the time to re-implement a suitable replacement.

Concurrently, I have been negotiating a two-semester long leave-of-absence from my employer. The “leave” has been granted and commenced a few of weeks ago. The purpose of the leave is two-fold: 1) to develop my skills as a librarian, and 2) to broaden my experience as a person. The first part of my leave is to take a month-long vacation, and that vacation begins today. For the first week I will paint in Tuscany. For the second week I will drink coffee in Venice. During the third week I will give a keynote talk at ADLUG in Rome. [2] Finally, during the fourth week I will learn how to make croissants in Provence. After the vacation is over I will continue to teach “XML 101” to library school graduate students at San Jose State University. [3] I will also continue to work for the University of Notre Dame on a set of three text mining projects (EEBO, JSTOR, and HathiTrust). [4, 5, 6]

As I was getting ready for my “leave” I was rooting through my water collection, and I found four different waters, specifically from: 1) Florence, 2) Venice, 3) Rome, and 4) Nice. As I looked at the dates of when the water was collected, I realized I will be in those exact same four places, on those exact same four days, exactly thirty-three years after I originally collected them. My water collection predicted my future. My water collection is a sort of model of me and my professional career. My water collection has sent me a number of signs.

This “leave-of-absence” (which in not really a leave nor a sabbatical, but instead a temporary change to adjunct faculty status) is a whole lot like going to college for the first time. “Where in the world am I going? What in the world am I going to do? Who in the world will I meet?” It is both exciting and scary at once and at the same time. It is an opportunity I would be foolish to pass up, but it is not as easy as you might imagine. That said, I guess I am presently an artist- and librarian-at-large. I think I need new, albeit temporary, business cards to proclaim my new title(s).

Wish me luck, and “On my mark. Get set. Go!”

blog postings describing my water collection – ./2009/09/water-1-of-3/
ADLUG – http://www.adlug.net
“XML 101” at SJSU – http://ischoolapps.sjsu.edu/facultypages/view.php?fac=morgane
EEBO browser – https://github.com/ndlib/text-analysis-eebo
JSTOR browser – https://github.com/ndlib/text-analysis-jstor
HathiTrust browser – https://github.com/ndlib/text-analysis-htrc

Posted in Miscellaneous | 2 Comments »

Some automated analysis of Richard Baxter’s works

June 13th, 2015

baxter

This page describes a corpus named baxter. It is a programmatically generated report against the full text of all the writing of Richard Baxter (a English Puritan church leader, poet, and hymn-writer) as found in Early English Books Online. It was created using a (fledgling) tool called the EEBO Workset Browser.

General statistics

An analysis of the corpus’s metadata provides an overview of what and how many things it contains, when things were published, and the sizes of its items:

Number of items – 140
Publication date range – 1650 to 1697 (histogram : boxplot)
Sizes in pages – 1 to 1258 (histogram : boxplot)
Total number of pages – 33507
Average number of pages per item – 239

Possible correlations between numeric characteristics of records in the catalog can be illustrated through a matrix of scatter plots. As you would expect, there is almost always a correlation between pages and number of words. Are others exist? For more detail, browse the catalog.

Notes on word usage

By counting and tabulating the words in each item of the corpus, it is possible to measure additional characteristics:

Sizes of items in words – 858 to 532780 (histogram : boxplot)
Total number of words – 4986083
Average number of words per item – 35614
Total number of unique words – 70036
Most common words – god (65230) may (34500) us (32510) one (31437) church (30329) men (28306) would (27141) man (26623) hath (26209) yet (22972) many (21280) much (20667) chriſt (20084) love (19732) world (19611) make (18478) know (17281) therefore (17169) faith (16088) must (15544) though (15314) good (15283) doth (15163) muſt (15089) things (14996)

Perusing the list of all words in the corpus (and their frequencies) as well as all unique words can prove to be quite insightful. Are there one or more words in these lists connoting an idea of interest to you, and if so, then to what degree do these words occur in the corpus?

To begin to see how words of your choosing occur in specific items, search the collection.

Through the creation of locally defined “dictionaries” or “lexicons”, it is possible to count and tabulate how specific sets of words are used across a corpus. This particular corpus employs three such dictionaries — sets of: 1) “big” names, 2) “great” ideas, and 3) colors. Their frequencies are listed below:

Most common “big” names – james (567) augustine (148) aquinas (142) plato (128) aristotle (59) plutarch (55) smith (55) bacon (37) hobbes (35) virgil (30) aurelius (29) gilbert (25) plotinus (23) mill (22) epictetus (18) swift (17) apollonius (13) tacitus (12) galen (12) homer (12) gibbon (9) lucretius (7) sterne (7) hippocrates (6) herodotus (4) For more detail, see the list of “big” name frequencies.
Most common “great” ideas – god (65231) one (31438) man (26624) many (21281) love (19733) world (19612) good (15284) life (13768) law (12074) sin (9698) time (9296) nature (8487) duty (7094) death (6759) truth (6689) religion (6483) soul (5993) matter (5631) peace (5093) particular (4830) mind (4676) government (4538) knowledge (4099) evil (4045) cause (3536) For more detail, see the list of “great” idea frequencies.
Colors – white (302) black (93) red (88) green (54) purple (51) brown (46) orange (9) gray (6) yellow (3) blue (1) For more detail, see the list of color word frequencies.

The distribution of words (histograms and boxplots) and the frequency of words (wordclouds), and how these frequencies “cluster” together can be illustrated:

Histograms – “big” names; “great” ideas; colors
Boxplots – “big” names; “great” ideas; colors
Wordclouds – most common words; “big” names; “great” ideas; colors
Cluster dendrograms – most common words; “big” names; “great” ideas; colors

Items of interest

Based on the information above, the following items (and their associated links) are of possible interest:

Shortest item (1 p.) – Short instructions for the sick: Especially who by contagion, or otherwise, are deprived of the presence of a faithfull pastor. / By Richard Baxter. (TEI : HTML : plain text)
Longest item (1258 p.) – A Christian directory, or, A summ of practical theologie and cases of conscience directing Christians how to use their knowledge and faith, how to improve all helps and means, and to perform all duties, how to overcome temptations, and to escape or mortifie every sin : in four parts … / by Richard Baxter. (TEI : HTML : plain text)
Oldest item (1650) – The saints everlasting rest, or, A treatise of the blessed state of the saints in their enjoyment of God in glory wherein is shewed its excellency and certainty, the misery of those that lose it, the way to attain it, and assurance of it, and how to live in the continual delightful forecasts of it and now published by Richard Baxter … (TEI : HTML : plain text)
Most recent (1697) – Mr. Richard Baxter’s last legacy in select admonitions and directions to all sober dissenters. (TEI : HTML : plain text)
Most thoughtful item – Short instructions for the sick: Especially who by contagion, or otherwise, are deprived of the presence of a faithfull pastor. / By Richard Baxter. (TEI : HTML : plain text)
Least thoughtful item – Dattodiad y qwestiwn mawr, beth sydd raid i ni ei wneuthur fel y byddom gadwedig. Athrawiaethau i fuchedd sanctaidd. / O waith y disinydd parchedig Mr. Richard Baxter. (TEI : HTML : plain text)
Biggest name dropper – R. Baxter’s sence of the subscribed articles of religion (TEI : HTML : plain text)
Fewest quotations – Additions to the poetical fragments of Rich. Baxter written for himself and communicated to such as are more for serious verse than smooth. (TEI : HTML : plain text)
Most colorful – The certainty of the worlds of spirits and, consequently, of the immortality of souls of the malice and misery of the devils and the damned : and of the blessedness of the justified, fully evinced by the unquestionable histories of apparitions, operations, witchcrafts, voices &c. / written, as an addition to many other treatises for the conviction of Sadduces and infidels, by Richard Baxter. (TEI : HTML : plain text)
Ugliest – Richard Baxter his account to his dearly beloved, the inhabitants of Kidderminster, of the causes of his being forbidden by the Bishop of Worcester to preach within his diocess with the Bishop of Worcester’s letter in answer thereunto : and some short animadversions upon the said bishops letter. (TEI : HTML : plain text)

Posted in Text Mining and Natural Langauge Processing | Comments Off on Some automated analysis of Richard Baxter’s works

Marrying close and distant reading: A THATCamp project

April 12th, 2015

The purpose of this page is to explore and demonstrate some of the possibilities of marrying close and distant reading. By combining both of these processes there is a hope greater comprehension and understanding of a corpus can be gained when compared to using close or distant reading alone. (This text might also be republished at http://dh.crc.nd.edu/sandbox/thatcamp-2015/ as well as http://nd2015.thatcamp.org/2015/04/07/close-and-distant/.)

To give this exploration a go, two texts are being used to form a corpus: 1) Machiavelli’s The Prince and 2) Emerson’s Representative Men. Both texts were printed and bound into a single book (codex). The book is intended to be read in the traditional manner, and the layout includes extra wide margins allowing the reader to liberally write/draw in the margins. As the glue is drying on the book, the plain text versions of the texts were evaluated using a number of rudimentary text mining techniques and with the results made available here. Both the traditional reading as well as the text mining are aimed towards answering a few questions. How do both Machiavelli and Emerson define a “great” man? What characteristics do “great” mean have? What sorts of things have “great” men accomplished?

Comparison

Feature	The Prince	Representative Men
Author	Niccolò di Bernardo dei Machiavelli (1469 – 1527)	Ralph Waldo Emerson (1803 – 1882)
Title	The Prince	Representative Men
Date	1532	1850
Fulltext	plain text \| HTML \| PDF \| TEI/XML	plain text \| HTML \| PDF \| TEI/XML
Length	31,179 words	59,600 words
Fog score	23.1	14.6
Flesch score	33.5	52.9
Kincaid score	19.7	11.5
Frequencies	unigrams, bigrams, trigrams, quadgrams, quintgrams	unigrams, bigrams, trigrams, quadgrams, quintgrams
Parts-of-speech	nouns, pronouns, adjectives, verbs, adverbs	nouns, pronouns, adjectives, verbs, adverbs

Search

Search for “man or men” in The Prince. Search for “man or men” in Representative Men.

Observations

I observe this project to be a qualified success.

First, I was able to print and bind my book, and while the glue is still trying, I’m confident the final results will be more than usable. The real tests of the bound book are to see if: 1) I actually read it, 2) I annotate it using my personal method, and 3) if I am able to identify answers to my research questions, above.

bookmaking tools

almost done

Second, the text mining services turned out to be more of a compare & contrast methodology as opposed to a question-answering process. For example, I can see that one book was written hundreds of years before the other. The second book is almost twice as long and the first. Readability score-wise, Machiavelli is almost certainly written for the more educated and Emerson is easier to read. The frequencies and parts-of-speech are enumerative, but not necessarily illustrative. There are a number of ways the frequencies and parts-of-speech could be improved. For example, just about everything could be visualized into histograms or word clouds. The verbs ought to lemmatized. The frequencies ought to be depicted as ratios compared to the texts. Other measures could be created as well. For example, my Great Books Coefficient could be employed.

How do Emerson and Machiavelli define a “great” man. Hmmm… Well, I’m not sure. It is relatively easy to get “definitions” of men in both books (The Prince or Representative Men). And network diagrams illustrating what words are used “in the same breath” as the word man in both works are not very dissimilar:

“man” in The Prince

“man” in Representative men

I think I’m going to have to read the books to find the answer. Really.

Code

Bunches o’ code was written to produce the reports:

concordance.cgi – the simple search engine
fathom.pl – used to compute the readability scores
file2pos.py – create a parts-of-speech file for later use
network.cgi – used to display words used “in the same breath” a given word
ngrams.pl – compute ngrams
pos.py – count and tabulate parts-of-speech from a previously created file

You can download this entire project — code and all — from http://dh.crc.nd.edu/sandbox/thatcamp-2015/reports/thatcamp-2015.tar.gz or ./wp-content/uploads/2015/04/thatcamp-2015.tar.gz.

Posted in Text Mining and Natural Langauge Processing | Comments Off on Marrying close and distant reading: A THATCamp project

Great Books Survey

January 1st, 2015

I am happy to say that the Great Books Survey is still going strong. Since October of 2010 it has been answered 24,749 times by 2,108 people from people all over the globe. To date, the top five “greatest” books are Athenian Constitution by Aristotle, Hamlet by Shakespeare, Don Quixote by Cervantes, Odyssey by Homer, and the Divine Comedy by Dante. The least “greatest” books are Rhesus by Euripides, On Fistulae by Hippocrates, On Fractures by Hippocrates, On Ulcers by Hippocrates, On Hemorrhoids by Hippocrates. “Too bad Hippocrates”.

For more information about this Great Books of the Western World investigation, see the various blog postings.

Posted in Great Books of the Western World | Comments Off on Great Books Survey

Doing What I’m Not Suppose To Do

October 24th, 2014

I suppose I’m doing what I’m not suppose to do. One of those things is writing in books.

I’m attending a local digital humanities conference. One of the presenters described and demonstrated a program from MIT called Annotation Studio. Using this program a person can upload some text to a server, annotate the text, and share the annotations with a wider audience. Interesting!?

I then went for a walk to see an art show. It seems I had previously been to this art museum. The art was… art, but I did not find it beautiful. The themes were disturbing.

I then made it to the library where I tried to locate a copy of my one and only formally published book — WAIS And Gopher Servers. When I was here previously, I signed the book’s title page, and I came back to do the same thing. Alas, the book had been moved to remote storage.

I then proceeded to find another book in which I had written something. I was successful, and I signed the title page. Gasp! Considering the fact that no one had opened the book in years, and the pages were glued together I figured, “What the heck!”

Just as importantly, my contribution to the book — written in 1992 — was a short story called, “A day in the life of Mr. D“. It is an account of how computers would be used in the future. In it the young boy uses it to annotate a piece of text, and he gets to see the text of previous annotators. What is old is new again.

P.S. I composed this blog posting using an iPad. Functional but tedious.

Posted in Miscellaneous | Comments Off on Doing What I’m Not Suppose To Do

Publishing LOD with a bent toward archivists

August 16th, 2014

eye candy by Eric

This essay provides an overview of linked open data (LOD) with a bent towards archivists. It enumerates a few advantages the archival community has when it comes to linked data, as well as some distinct disadvantages. It demonstrates one way to expose EAD as linked data through the use of XSLT transformations and then through a rudimentary triple store/SPARQL endpoint combination. Enhancements to the linked data publication process are then discussed. The text of this essay in the form of a handout as well as a number of support files is can also be found at http://infomotions.com/sandbox/lodlamday/.

Review of RDF

The ultimate goal of LOD is to facilitate the discovery of new information and knowledge. To accomplish this goal, people are expected to make metadata describing their content available on the Web in one or more forms of RDF — Resource Description Framework. RDF is not so much a file format as a data structure. It is a collection of “assertions” in the form of “triples” akin to rudimentary “sentences” where the first part of the sentence is a “subject”, the second part is a “predicate”, and the third part is an “object”. Both the subjects and predicates are required to be Universal Resource Identifiers — URIs. (Think “URLs”.) The subject URI is intended to denote a person, place, or thing. The predicate URI is used to specify relationships between subjects and the objects. When verbalizing RDF assertions, it is usually helpful to prefix predicate URIs with a “is a” or “has a” phrase. For example, “This book ‘has a’ title of ‘Huckleberry Finn'” or “This university ‘has a’ home page of URL”. The objects of RDF assertions are ideally more URIs but they can also be “strings” or “literals” — words, phrases, numbers, dates, geo-spacial coordinates, etc. Finally, it is expected that the URIs of RDF assertions are shared across domains and RDF collections. By doing so, new assertions can be literally “linked” across the world of RDF in the hopes of establishing new relationships. By doing so new new information and new knowledge is brought to light.

Simple foray into publishing linked open data

Manifesting RDF from archival materials by hand is not an easy process because nobody is going to manually type the hundreds of triples necessary to adequately describe any given item. Fortunately, it is common for the description of archival materials to be manifested in the form of EAD files. Being a form of XML, valid EAD files must be well-formed and conform to a specific DTD or schema. This makes it easy to use XSLT to transform EAD files into various (“serialized”) forms of RDF such as XML/RDF, turtle, or JSON-LD. A few years ago such a stylesheet was written by Pete Johnston for the Archives Hub as a part of the Hub’s LOCAH project. The stylesheet outputs XML/RDF and it was written specifically for Archives Hub EAD files. It has been slightly modified here and incorporated into a Perl script. The Perl script reads the EAD files in a given directory and transforms them into both XML/RDF and HTML. The XML/RDF is intended to be read by computers. The HTML is intended to be read by people. By simply using something like the Perl script, an archive can easily participate in LOD. The results of these efforts can be seen in the local RDF and HTML directories. Nobody is saying the result is perfect nor complete, but it is more than a head start, and all of this is possible because the content of archives is often times described using EAD.

Triple stores and SPARQL endpoints

By definition, linked data (RDF) is structured data, and structured data lends itself very well to relational database applications. In the realm of linked data, these database applications are called “triple stores”. Database applications excel at the organization of data, but they are also designed to facilitate search. In the realm of relational databases, the standard query language is called SQL, and there is a similar query language for triples stores. It is called SPARQL. The term “SPARQL endpoints” is used denote a URL where SPARQL queries can be applied to a specific triple store.

4store is an open source triple store application which also supports SPARQL endpoints. Once compiled and installed, it is controlled and managed through a set of command-line applications. These applications support the sorts of things one expects with any other database application such as create database, import into database, search database, dump database, and destroy database. Two other commands turn on and turn off SPARQL endpoints.

For the purposes of LODLAM Training Day, a 4store triple store was created, filled with sample data, and made available as a SPARQL endpoint. If it has been turned on, then the following links ought to return useful information and demonstrating additional ways of publishing linked data:

status page – a sort of home pages for a 4store triple store
test page – an HTML form enabling one to apply SPARQL queries to the triple store
SELECT DISTINCT ?o WHERE {?s a ?o} – a list of all the objects which are types of subjects. Useful for learning what types of things are described in any triple store.
SELECT DISTINCT ?p WHERE {?s ?p ?o} – a list of all the predicates used in the triple store. Useful for learning how things are described in any triple store. If the ontologies used in the original RDF are well-documented, then by following the URIs (URLs) and reading the documentation found there, a person can use the previous two standard queries to understand the contents of any triple store. In this way, linked data (RDF) is self-documenting.
SELECT ?title ?description ?url WHERE { ?s <http://xmlns.com/foaf/0.1/page> ?url . ?s <http://www.w3.org/2000/01/rdf-schema#label> ?title . ?s <http://data.archiveshub.ac.uk/def/scopecontent> ?description } order by ?title – a rudimentary bibliography of all finding aids
SELECT DISTINCT ?subject WHERE { ?s <http://purl.org/dc/terms/subject> ?uri . ?uri <http://www.w3.org/2000/01/rdf-schema#label> ?subject } ORDER BY ?subject – an alphabetical list of all the store’s subject headings

Advantages and disadvantages

The previous sections demonstrate the ease at which archival metadata can be published as linked data. These demonstrations are not the the be-all nor end-all of linked data the publication process. Additional techniques could be employed. Exploiting content negotiation in response to a given URI is an excellent example. Supporting alternative RDF serializations is another example. It behooves the archivist to provide enhanced views of the linked data, which are sometimes called “graphs”. The linked data can be combined with the linked data of other publishers to implement even more interesting services, views, and graphs. All of these things are advanced techniques requiring the skills of additional people (graphic designers, usability experts, computer programmers, systems administrators, allocators of time and money, project managers, etc.). Despite this, given the tools outlined above, it is not too difficult to publish linked data now and today. Such are the advantages.

On the other hand, there are at least two distinct disadvantages. The most significant derives from the inherent nature of archival material. Archival material is almost always rare or unique. Because it is rare and unique, there are few (if any) previously established URIs for the people and things described in archival collections. This is unlike the world of librarianship, where the materials of libraries are often owned my multiple institutions. Union catalogs share authority lists denoting people and institutions. Shared URIs across domains is imperative for the idea of the Semantic Web to come to fruition. The archival community has no such collection of shared URIs. Maybe the community-wide implementation and exploitation of Encoded Archival Context for Corporate Bodies, Persons, and Families (EAC-CPF) can help resolve this problem. After all, it too is a form of XML which lends itself very will to XSLT transformation.

Second, and almost as importantly, the use of EAD is not really the best way manifest archival metadata for linked data publication. EADs are finding aids. They are essentially narrative essays describing collections as a whole. They tell stories. The controlled vocabularies articulated in the header do not necessarily apply to each of the items in the container list. For good reasons, the items in the container list are minimally described. Consequently, the resulting RDF statement come across rather thin and poorly linked to fuller descriptions. Moreover, different archivists put different emphases on different aspect of EAD description. This makes amalgamated collections of archival linked data difficult to navigate; the linked data requires cleaning and normalization. The solution to these problems might be to create and maintain archival collections in database applications, such as ArchivesSpace, and have linked data published from there. By doing so the linked data publication efforts of the archival community would be more standardized and somewhat centralized.

Summary

This essay has outlined the ease at which archival metadata in the form of EAD can be easily published as linked data. The result is far from perfect, but a huge step in the right direction. Publishing linked data is not an event, but rather an iterative process. There is always room for improvement. Starting today, publish your metadata as linked data.

Posted in Semantic Web and Linked Data | Comments Off on Publishing LOD with a bent toward archivists

Fun with Koha

July 19th, 2014

These are brief notes about my recent experiences with Koha.

Introduction

koha logo As you may or may not know, Koha is a grand daddy of library-related open source software, and it is an integrated library system to boot. Such are no small accomplishments. For reasons I will not elaborate upon, I’ve been playing with Koha for the past number of weeks, and in short, I want to say, “I’m impressed.” The community is large, international, congenial, and supportive. The community is divided into a number of sub-groups: developers, committers, commercial support employees, and, of course, librarians. I’ve even seen people from another open source library system (Evergreen) provide technical support and advice. For the most part, everything is on the ‘Net, well laid out, and transparent. There are some rather “organic” parts to the documentation akin to an “English garden”, but that is going to happen in any de-centralized environment. All in all, and without any patronizing intended, “Kudos to Koha!”

Installation

Looking through my collection of tarballs, I see I’ve installed Koha a number of times over the years, but this time it was challenging. Sparing you all the details, I needed to use a specific version of MySQL (version 5.5), and I had version 5.6. The installation failure was not really Koha’s fault. It is more the fault of MySQL because the client of MySQL version 5.6 outputs a warning message to STDOUT when a password is passed on the command line. This message confused the Koha database initialization process, thus making Koha unusable. After downgrading to version 5.5 the database initialization process was seamless.

My next step was to correctly configure Zebra — Koha’s default underlying indexer. Again, I had installed from source, and my Zebra libraries, etc. were saved in a directory different from the configuration files created by the Koha’s installation process. After correctly updating the value of modulePath to point to /usr/local/lib/idzebra-2.0/ in zebra-biblios-dom.cfg, zebra-authorities.cfg, zebra-biblios.cfg, and zebra-authorities-dom.cfg I could successfully index and search for content. I learned this from a mailing list posting.

Koha “extras”

Koha comes (for free) with a number of “extras”. For example, the Zebra indexer can be deployed as both a Z39.50 server as well as an SRU server. Turning these things on was as simple as uncommenting a few lines in the koha-conf.xml file and opening a few ports in my firewall. Z39.50 is inherently unusable from a human point of view so I didn’t go into configuring it, but it does work. Through the use of XSL stylesheets, SRU can be much more usable. Luckily I have been here before. For example, a long time ago I used Zebra to index my Alex Catalogue as well as some content from the HathiTrust (MBooks). The hidden interface to the Catalogue sports faceted searching and used to support spelling corrections. The MBooks interface transforms MARCXML into simple HTML. Both of these interfaces are quite zippy. In order to get Zebra to recognize my XSL I needed to add an additional configuration directive to my koha-conf.xml file. Specifically, I need to add a docpath element to my public server’s configuration. Once I re-learned this fact, implementing a rudimentary SRU interface to my Koha index was easy and results are returned very fast. I’m impressed.

My big goal is to figure out ways Koha can expose its content to the wider ‘Net. To this end sKoha comes with an OAI-PMH interface. It needs to be enabled, and can be done through the Koha Web-based backend under Home -> Koha Administration -> Global Preferences -> General Systems Preferences -> Web Services. Once enabled, OAI sets can be created through the Home -> Administration -> OAI sets configuration module. (Whew!) Once this is done Koha will respond to OAI-PMH requests. I then took it upon myself to transform the OAI output into linked data using a program called OAI2LOD. This worked seamlessly, and for a limited period of time you can browse my Koha’s cataloging data as linked data. The viability of the resulting linked data is questionable, but that is another blog posting.

Ideas and next steps

Library catalogs (OPACs, “discovery systems”, whatever you want to call them) are not simple applications/systems. They are a mixture of very specialized inventory lists, various types of people with various skills and authorities, indexing, and circulation, etc. Then we — as librarians — add things like messages of the day, record exporting, browsable lists, visualizations, etc. that complicate the whole thing. It is simply not possible to create a library catalog in the “Unix way“. The installation of Koha was not easy for me. There are expenses with open source software, and I all but melted down my server during the installation process. (Everything is now back to normal.) I’ve been advocating open source software for quite a while, and I understand the meaning of “free” in this context. I’m not complaining. Really.

Now that I’ve gotten this far, my next step is to investigate the feasibility of using a different indexer with Koha. Zebra is functional. It is fast. It is multi-faceted (all puns intended). But configuring it is not straight-forward, and its community of support is tiny. I see from rooting around in the Koha source code that Solr has been explored. I have also heard through the grapevine that ElasticSearch has been explored. I will endeavor to explore these things myself and report on what I learn. Different indexers, with more flexible API’s may make the possibility of exposing Koha content as linked data more feasible as well.

Wish me luck.

Posted in MARC and technical services | 1 Comment »

Fun with ElasticSearch and MARC

June 22nd, 2014

For a good time I have started to investigate how to index MARC data using ElasticSearch. This posting outlines some of my initial investigations and hacks.

ElasticSearch seems to be an increasingly popular indexer. Getting it up an running on my Linux host was… trivial. It comes withe a full-fledged Perl interface. Nice! Since ElasticSearch takes JSON as input, I needed to serialize my MARC data accordingly, and MARC::File::JSON seems to do a fine job. With this in hand, I wrote three programs:

index.pl – create an index of MARC records
get.pl – retrieve a specific record from the index
search.pl – query the index

I have some work to do, obviously. First of all, do I really want to index MARC in its raw, communications format? I don’t think so, but that is where I’ll start. Second, the search script doesn’t really search. Instead it simply gets all the records. This is because I really don’t know how to search yet; I don’t really know how to query fields like “245 subfield a”.

index.pl

#!/usr/bin/perl

# configure
use constant INDEX => 'pamphlets';
use constant MARC  => './pamphlets.marc';
use constant MAX   => 100;
use constant TYPE  => 'marc';

# require
use MARC::Batch;
use MARC::File::JSON;
use Search::Elasticsearch;
use strict;

# initialize
my $batch = MARC::Batch->new( 'USMARC', MARC );
my $count = 0;
my $e     = Search::Elasticsearch->new;

# process each record in the batch
while ( my $record = $batch->next ) {

  # debug
  print $record->title, "\n";
  
  # serialize the record into json
  my $json = &MARC::File::JSON::encode( $record );
  
  # increment
  $count++;
  
  # index; do the work
  $e->index(  index   => INDEX,
                type    => TYPE,
                id      => $count,
                body    => { "$json" }
    );
    
  # check; only do a few
  last if ( $count > MAX );
  
}

# done
exit;

get.pl

# configure 
use constant INDEX => 'pamphlets';
use constant TYPE  => 'marc';

# require
use MARC::File::JSON;
use Search::Elasticsearch;
use strict;

# initialize
my $e = Search::Elasticsearch->new;

# get; do the work
my $doc = $e->get( index   => INDEX,
                   type    => TYPE,
                   id      => $ARGV[ 0 ]
);

# reformat and output; done
my $record = MARC::Record->new_from_json( keys( $doc->{ '_source' } ) );
print $record->as_formatted, "\n";
exit;

search.pl

# configure 
use constant INDEX => 'pamphlets';

# require
use MARC::File::JSON;
use Search::Elasticsearch;
use strict;

# initialize
my $e = Search::Elasticsearch->new;

# search
my $results = $e->search(
  index => INDEX,
    body  => { query => { match_all => { $ARGV[ 0 ] } } }
);

# output
my $hits = $results->{ 'hits' }->{ 'hits' };
for ( my $i = 0; $i <= $#$hits; $i++ ) {

  my $record = MARC::Record->new_from_json( keys( $hits[ $i ]->{ '_source' } ) );
  print $record->as_formatted, "\n\n";

}

# done
exit;

Posted in MARC and technical services | 2 Comments »

LiAM source code: Perl poetry

February 17th, 2014

#!/usr/bin/perl # Liam Guidebook Source Code; Perl poetry, sort of # Eric Lease Morgan <emorgan@nd.edu> # February 16, 2014 # done exit;

#!/usr/bin/perl # marc2rdf.pl – make MARC records accessible via linked data # Eric Lease Morgan <eric_morgan@infomotions.com> # December 5, 2013 – first cut; # configure use constant ROOT => ‘/disk01/www/html/main/sandbox/liam’; use constant MARC => ROOT . ‘/src/marc/’; use constant DATA => ROOT . ‘/data/’; use constant PAGES => ROOT . ‘/pages/’; use constant MARC2HTML => ROOT . ‘/etc/MARC21slim2HTML.xsl’; use constant MARC2MODS => ROOT . ‘/etc/MARC21slim2MODS3.xsl’; use constant MODS2RDF => ROOT . ‘/etc/mods2rdf.xsl’; use constant MAXINDEX => 100; # require use IO::File; use MARC::Batch; use MARC::File::XML; use strict; use XML::LibXML; use XML::LibXSLT; # initialize my $parser = XML::LibXML->new; my $xslt = XML::LibXSLT->new; # process each record in the MARC directory my @files = glob MARC . “*.marc”; for ( 0 .. $#files ) { # re-initialize my $marc = $files[ $_ ]; my $handle = IO::File->new( $marc ); binmode( STDOUT, ‘:utf8’ ); binmode( $handle, ‘:bytes’ ); my $batch = MARC::Batch->new( ‘USMARC’, $handle ); $batch->warnings_off; $batch->strict_off; my $index = 0; # process each record in the batch while ( my $record = $batch->next ) { # get marcxml my $marcxml = $record->as_xml_record; my $_001 = $record->field( ‘001’ )->as_string; $_001 =~ s/_//; $_001 =~ s/ +//; $_001 =~ s/-+//; print ” marc: $marc\n”; print ” identifier: $_001\n”; print ” URI: http://infomotions.com/sandbox/liam/id/$_001\n”; # re-initialize and sanity check my $output = PAGES . “$_001.html”; if ( ! -e $output or -s $output == 0 ) { # transform marcxml into html print ” HTML: $output\n”; my $source = $parser->parse_string( $marcxml ) or warn $!; my $style = $parser->parse_file( MARC2HTML ) or warn $!; my $stylesheet = $xslt->parse_stylesheet( $style ) or warn $!; my $results = $stylesheet->transform( $source ) or warn $!; my $html = $stylesheet->output_string( $results ); &save( $output, $html ); } else { print ” HTML: skipping\n” } # re-initialize and sanity check my $output = DATA . “$_001.rdf”; if ( ! -e $output or -s $output == 0 ) { # transform marcxml into mods my $source = $parser->parse_string( $marcxml ) or warn $!; my $style = $parser->parse_file( MARC2MODS ) or warn $!; my $stylesheet = $xslt->parse_stylesheet( $style ) or warn $!; my $results = $stylesheet->transform( $source ) or warn $!; my $mods = $stylesheet->output_string( $results ); # transform mods into rdf print ” RDF: $output\n”; $source = $parser->parse_string( $mods ) or warn $!; my $style = $parser->parse_file( MODS2RDF ) or warn $!; my $stylesheet = $xslt->parse_stylesheet( $style ) or warn $!; my $results = $stylesheet->transform( $source ) or warn $!; my $rdf = $stylesheet->output_string( $results ); &save( $output, $rdf ); } else { print ” RDF: skipping\n” } # prettify print “\n”; # increment and check $index++; last if ( $index > MAXINDEX ) } } # done exit; sub save { open F, ‘ > ‘ . shift or die $!; binmode( F, ‘:utf8’ ); print F shift; close F; return; }

#!/usr/bin/perl # ead2rdf.pl – make EAD files accessible via linked data # Eric Lease Morgan <eric_morgan@infomotions.com> # December 6, 2013 – based on marc2linkedata.pl # configure use constant ROOT => ‘/disk01/www/html/main/sandbox/liam’; use constant EAD => ROOT . ‘/src/ead/’; use constant DATA => ROOT . ‘/data/’; use constant PAGES => ROOT . ‘/pages/’; use constant EAD2HTML => ROOT . ‘/etc/ead2html.xsl’; use constant EAD2RDF => ROOT . ‘/etc/ead2rdf.xsl’; use constant SAXON => ‘java -jar /disk01/www/html/main/sandbox/liam/bin/saxon.jar -s:##SOURCE## -xsl:##XSL## -o:##OUTPUT##’; # require use strict; use XML::XPath; use XML::LibXML; use XML::LibXSLT; # initialize my $saxon = ”; my $xsl = ”; my $parser = XML::LibXML->new; my $xslt = XML::LibXSLT->new; # process each record in the EAD directory my @files = glob EAD . “*.xml”; for ( 0 .. $#files ) { # re-initialize my $ead = $files[ $_ ]; print ” EAD: $ead\n”; # get the identifier my $xpath = XML::XPath->new( filename => $ead ); my $identifier = $xpath->findvalue( ‘/ead/eadheader/eadid’ ); $identifier =~ s/[^\w ]//g; print ” identifier: $identifier\n”; print ” URI: http://infomotions.com/sandbox/liam/id/$identifier\n”; # re-initialize and sanity check my $output = PAGES . “$identifier.html”; if ( ! -e $output or -s $output == 0 ) { # transform marcxml into html print ” HTML: $output\n”; my $source = $parser->parse_file( $ead ) or warn $!; my $style = $parser->parse_file( EAD2HTML ) or warn $!; my $stylesheet = $xslt->parse_stylesheet( $style ) or warn $!; my $results = $stylesheet->transform( $source ) or warn $!; my $html = $stylesheet->output_string( $results ); &save( $output, $html ); } else { print ” HTML: skipping\n” } # re-initialize and sanity check my $output = DATA . “$identifier.rdf”; if ( ! -e $output or -s $output == 0 ) { # create saxon command, and save rdf print ” RDF: $output\n”; $saxon = SAXON; $xsl = EAD2RDF; $saxon =~ s/##SOURCE##/$ead/e; $saxon =~ s/##XSL##/$xsl/e; $saxon =~ s/##OUTPUT##/$output/e; system $saxon; } else { print ” RDF: skipping\n” } # prettify print “\n”; } # done exit; sub save { open F, ‘ > ‘ . shift or die $!; binmode( F, ‘:utf8’ ); print F shift; close F; return; }

#!/usr/bin/perl # store-make.pl – simply initialize an RDF triple store # Eric Lease Morgan <eric_morgan@infomotions.com> # # December 14, 2013 – after wrestling with wilson for most of the day # configure use constant ETC => ‘/disk01/www/html/main/sandbox/liam/etc/’; # require use strict; use RDF::Redland; # sanity check my $db = $ARGV[ 0 ]; if ( ! $db ) { print “Usage: $0 <db>\n”; exit; } # do the work; brain-dead my $etc = ETC; my $store = RDF::Redland::Storage->new( ‘hashes’, $db, “new=’yes’, hash-type=’bdb’, dir=’$etc'” ); die “Unable to create store ($!)” unless $store; my $model = RDF::Redland::Model->new( $store, ” ); die “Unable to create model ($!)” unless $model; # “save” $store = undef; $model = undef; # done exit;

#!/user/bin/perl # store-add.pl – add items to an RDF triple store # Eric Lease Morgan <eric_morgan@infomotions.com> # # December 14, 2013 – after wrestling with wilson for most of the day # configure use constant ETC => ‘/disk01/www/html/main/sandbox/liam/etc/’; # require use strict; use RDF::Redland; # sanity check #1 – command line arguments my $db = $ARGV[ 0 ]; my $file = $ARGV[ 1 ]; if ( ! $db or ! $file ) { print “Usage: $0 <db> <file>\n”; exit; } # sanity check #2 – store exists die “Error: po2s file not found. Make a store?\n” if ( ! -e ETC . $db . ‘-po2s.db’ ); die “Error: so2p file not found. Make a store?\n” if ( ! -e ETC . $db . ‘-so2p.db’ ); die “Error: sp2o file not found. Make a store?\n” if ( ! -e ETC . $db . ‘-sp2o.db’ ); # open the store my $etc = ETC; my $store = RDF::Redland::Storage->new( ‘hashes’, $db, “new=’no’, hash-type=’bdb’, dir=’$etc'” ); die “Error: Unable to open store ($!)” unless $store; my $model = RDF::Redland::Model->new( $store, ” ); die “Error: Unable to create model ($!)” unless $model; # sanity check #3 – file exists die “Error: $file not found.\n” if ( ! -e $file ); # parse a file and add it to the store my $uri = RDF::Redland::URI->new( “file:$file” ); my $parser = RDF::Redland::Parser->new( ‘rdfxml’, ‘application/rdf+xml’ ); die “Error: Failed to find parser ($!)\n” if ( ! $parser ); my $stream = $parser->parse_as_stream( $uri, $uri ); my $count = 0; while ( ! $stream->end ) { $model->add_statement( $stream->current ); $count++; $stream->next; } # echo the result warn “Namespaces:\n”; my %namespaces = $parser->namespaces_seen; while ( my ( $prefix, $uri ) = each %namespaces ) { warn ” prefix: $prefix\n”; warn ‘ uri: ‘ . $uri->as_string . “\n”; warn “\n”; } warn “Added $count statements\n”; # “save” $store = undef; $model = undef; # done exit; 10.5 store-search.pl – query a triple store # Eric Lease Morgan <eric_morgan@infomotions.com> # December 14, 2013 – after wrestling with wilson for most of the day # configure use constant ETC => ‘/disk01/www/html/main/sandbox/liam/etc/’; my %namespaces = ( “crm” => “http://erlangen-crm.org/current/”, “dc” => “http://purl.org/dc/elements/1.1/”, “dcterms” => “http://purl.org/dc/terms/”, “event” => “http://purl.org/NET/c4dm/event.owl#”, “foaf” => “http://xmlns.com/foaf/0.1/”, “lode” => “http://linkedevents.org/ontology/”, “lvont” => “http://lexvo.org/ontology#”, “modsrdf” => “http://simile.mit.edu/2006/01/ontologies/mods3#”, “ore” => “http://www.openarchives.org/ore/terms/”, “owl” => “http://www.w3.org/2002/07/owl#”, “rdf” => “http://www.w3.org/1999/02/22-rdf-syntax-ns#”, “rdfs” => “http://www.w3.org/2000/01/rdf-schema#”, “role” => “http://simile.mit.edu/2006/01/roles#”, “skos” => “http://www.w3.org/2004/02/skos/core#”, “time” => “http://www.w3.org/2006/time#”, “timeline” => “http://purl.org/NET/c4dm/timeline.owl#”, “wgs84_pos” => “http://www.w3.org/2003/01/geo/wgs84_pos#” ); # require use strict; use RDF::Redland; # sanity check #1 – command line arguments my $db = $ARGV[ 0 ]; my $query = $ARGV[ 1 ]; if ( ! $db or ! $query ) { print “Usage: $0 <db> <query>\n”; exit; } # sanity check #2 – store exists die “Error: po2s file not found. Make a store?\n” if ( ! -e ETC . $db . ‘-po2s.db’ ); die “Error: so2p file not found. Make a store?\n” if ( ! -e ETC . $db . ‘-so2p.db’ ); die “Error: sp2o file not found. Make a store?\n” if ( ! -e ETC . $db . ‘-sp2o.db’ ); # open the store my $etc = ETC; my $store = RDF::Redland::Storage->new( ‘hashes’, $db, “new=’no’, hash-type=’bdb’, dir=’$etc'” ); die “Error: Unable to open store ($!)” unless $store; my $model = RDF::Redland::Model->new( $store, ” ); die “Error: Unable to create model ($!)” unless $model; # search #my $sparql = RDF::Redland::Query->new( “CONSTRUCT { ?a ?b ?c } WHERE { ?a ?b ?c }”, undef, undef, “sparql” ); my $sparql = RDF::Redland::Query->new( “PREFIX modsrdf: <http://simile.mit.edu/2006/01/ontologies/mods3#>\nSELECT ?a ?b ?c WHERE { ?a modsrdf:$query ?c }”, undef, undef, ‘sparql’ ); my $results = $model->query_execute( $sparql ); print $results->to_string; # done exit;

#!/usr/bin/perl # store-dump.pl – output the content of store as RDF/XML # Eric Lease Morgan <eric_morgan@infomotions.com> # # December 14, 2013 – after wrestling with wilson for most of the day # configure use constant ETC => ‘/disk01/www/html/main/sandbox/liam/etc/’; # require use strict; use RDF::Redland; # sanity check #1 – command line arguments my $db = $ARGV[ 0 ]; my $uri = $ARGV[ 1 ]; if ( ! $db ) { print “Usage: $0 <db> <uri>\n”; exit; } # sanity check #2 – store exists die “Error: po2s file not found. Make a store?\n” if ( ! -e ETC . $db . ‘-po2s.db’ ); die “Error: so2p file not found. Make a store?\n” if ( ! -e ETC . $db . ‘-so2p.db’ ); die “Error: sp2o file not found. Make a store?\n” if ( ! -e ETC . $db . ‘-sp2o.db’ ); # open the store my $etc = ETC; my $store = RDF::Redland::Storage->new( ‘hashes’, $db, “new=’no’, hash-type=’bdb’, dir=’$etc'” ); die “Error: Unable to open store ($!)” unless $store; my $model = RDF::Redland::Model->new( $store, ” ); die “Error: Unable to create model ($!)” unless $model; # do the work my $serializer = RDF::Redland::Serializer->new; print $serializer->serialize_model_to_string( RDF::Redland::URI->new, $model ); # done exit;

#!/usr/bin/perl # sparql.pl – a brain-dead, half-baked SPARQL endpoint # Eric Lease Morgan <eric_morgan@infomotions.com> # December 15, 2013 – first investigations # require use CGI; use CGI::Carp qw( fatalsToBrowser ); use RDF::Redland; use strict; # initialize my $cgi = CGI->new; my $query = $cgi->param( ‘query’ ); if ( ! $query ) { print $cgi->header; print &home; } else { # open the store for business my $store = RDF::Redland::Storage->new( ‘hashes’, ‘store’, “new=’no’, hash-type=’bdb’, dir=’/disk01/www/html/main/sandbox/liam/etc'” ); my $model = RDF::Redland::Model->new( $store, ” ); # search my $results = $model->query_execute( RDF::Redland::Query->new( $query, undef, undef, ‘sparql’ ) ); # return the results print $cgi->header( -type => ‘application/xml’ ); print $results->to_string; } # done exit; sub home { # create a list namespaces my $namespaces = &namespaces; my $list = ”; foreach my $prefix ( sort keys $namespaces ) { my $uri = $$namespaces{ $prefix }; $list .= $cgi->li( “$prefix – ” . $cgi->a( { href=> $uri, target => ‘_blank’ }, $uri ) ); } $list = $cgi->ol( $list ); # return a home page return <<EOF <html> <head> <title>LiAM SPARQL Endpoint</title> </head> <body style=’margin: 7%’> <h1>LiAM SPARQL Endpoint</h1> <p>This is a brain-dead and half-baked SPARQL endpoint to a subset of LiAM linked data. Enter a query, but there is the disclaimer. Errors will probably happen because of SPARQL syntax errors. Remember, the interface is brain-dead. Your milage <em>will</em> vary.</p> <form method=’GET’ action=’./’> <textarea style=’font-size: large’ rows=’5′ cols=’65’ name=’query’ /> PREFIX hub:<http://data.archiveshub.ac.uk/def/> SELECT ?uri WHERE { ?uri ?o hub:FindingAid } </textarea><br /> <input type=’submit’ value=’Search’ /> </form> <p>Here are a few sample queries:</p> <ul> <li>Find all triples with RDF Schema labels – <code><a href=”http://infomotions.com/sandbox/liam/sparql/?query=PREFIX+rdf%3A%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0D%0ASELECT+*+WHERE+%7B+%3Fs+rdf%3Alabel+%3Fo+%7D%0D%0A”>PREFIX rdf:<http://www.w3.org/2000/01/rdf-schema#> SELECT * WHERE { ?s rdf:label ?o }</a></code></li> <li>Find all items with MODS subjects – <code><a href=’http://infomotions.com/sandbox/liam/sparql/?query=PREFIX+mods%3A%3Chttp%3A%2F%2Fsimile.mit.edu%2F2006%2F01%2Fontologies%2Fmods3%23%3E%0D%0ASELECT+*+WHERE+%7B+%3Fs+mods%3Asubject+%3Fo+%7D’>PREFIX mods:<http://simile.mit.edu/2006/01/ontologies/mods3#> SELECT * WHERE { ?s mods:subject ?o }</a></code></li> <li>Find every unique predicate – <code><a href=”http://infomotions.com/sandbox/liam/sparql/?query=SELECT+DISTINCT+%3Fp+WHERE+%7B+%3Fs+%3Fp+%3Fo+%7D”>SELECT DISTINCT ?p WHERE { ?s ?p ?o }</a></code></li> <li>Find everything – <code><a href=”http://infomotions.com/sandbox/liam/sparql/?query=SELECT+*+WHERE+%7B+%3Fs+%3Fp+%3Fo+%7D”>SELECT * WHERE { ?s ?p ?o }</a></code></li> <li>Find all classes – <code><a href=”http://infomotions.com/sandbox/liam/sparql/?query=SELECT+DISTINCT+%3Fclass+WHERE+%7B+%5B%5D+a+%3Fclass+%7D+ORDER+BY+%3Fclass”>SELECT DISTINCT ?class WHERE { [] a ?class } ORDER BY ?class</a></code></li> <li>Find all properties – <code><a href=”http://infomotions.com/sandbox/liam/sparql/?query=SELECT+DISTINCT+%3Fproperty%0D%0AWHERE+%7B+%5B%5D+%3Fproperty+%5B%5D+%7D%0D%0AORDER+BY+%3Fproperty”>SELECT DISTINCT ?property WHERE { [] ?property [] } ORDER BY ?property</a></code></li> <li>Find URIs of all finding aids – <code><a href=”http://infomotions.com/sandbox/liam/sparql/?query=PREFIX+hub%3A%3Chttp%3A%2F%2Fdata.archiveshub.ac.uk%2Fdef%2F%3E+SELECT+%3Furi+WHERE+%7B+%3Furi+%3Fo+hub%3AFindingAid+%7D”>PREFIX hub:<http://data.archiveshub.ac.uk/def/> SELECT ?uri WHERE { ?uri ?o hub:FindingAid }</a></code></li> <li>Find URIs of all MARC records – <code><a href=”http://infomotions.com/sandbox/liam/sparql/?query=PREFIX+mods%3A%3Chttp%3A%2F%2Fsimile.mit.edu%2F2006%2F01%2Fontologies%2Fmods3%23%3E+SELECT+%3Furi+WHERE+%7B+%3Furi+%3Fo+mods%3ARecord+%7D%0D%0A%0D%0A%0D%0A”>PREFIX mods:<http://simile.mit.edu/2006/01/ontologies/mods3#> SELECT ?uri WHERE { ?uri ?o mods:Record }</a></code></li> <li>Find all URIs of all collections – <code><a href=”http://infomotions.com/sandbox/liam/sparql/?query=PREFIX+mods%3A%3Chttp%3A%2F%2Fsimile.mit.edu%2F2006%2F01%2Fontologies%2Fmods3%23%3E%0D%0APREFIX+hub%3A%3Chttp%3A%2F%2Fdata.archiveshub.ac.uk%2Fdef%2F%3E%0D%0ASELECT+%3Furi+WHERE+%7B+%7B+%3Furi+%3Fo+hub%3AFindingAid+%7D+UNION+%7B+%3Furi+%3Fo+mods%3ARecord+%7D+%7D%0D%0AORDER+BY+%3Furi%0D%0A”>PREFIX mods:<http://simile.mit.edu/2006/01/ontologies/mods3#> PREFIX hub:<http://data.archiveshub.ac.uk/def/> SELECT ?uri WHERE { { ?uri ?o hub:FindingAid } UNION { ?uri ?o mods:Record } } ORDER BY ?uri</a></code></li> </ul> <p>This is a list of ontologies (namespaces) used in the triple store as predicates:</p> $list <p>For more information about SPARQL, see:</p> <ol> <li><a href=”http://www.w3.org/TR/rdf-sparql-query/” target=”_blank”>SPARQL Query Language for RDF</a> from the W3C</li> <li><a href=”http://en.wikipedia.org/wiki/SPARQL” target=”_blank”>SPARQL</a> from Wikipedia</li> </ol> <p>Source code — <a href=”http://infomotions.com/sandbox/liam/bin/sparql.pl”>sparql.pl</a> — is available online.</p> <hr /> <p> <a href=”mailto:eric_morgan\@infomotions.com”>Eric Lease Morgan <eric_morgan\@infomotions.com></a><br /> January 6, 2014 </p> </body> </html> EOF } sub namespaces { my %namespaces = ( “crm” => “http://erlangen-crm.org/current/”, “dc” => “http://purl.org/dc/elements/1.1/”, “dcterms” => “http://purl.org/dc/terms/”, “event” => “http://purl.org/NET/c4dm/event.owl#”, “foaf” => “http://xmlns.com/foaf/0.1/”, “lode” => “http://linkedevents.org/ontology/”, “lvont” => “http://lexvo.org/ontology#”, “modsrdf” => “http://simile.mit.edu/2006/01/ontologies/mods3#”, “ore” => “http://www.openarchives.org/ore/terms/”, “owl” => “http://www.w3.org/2002/07/owl#”, “rdf” => “http://www.w3.org/1999/02/22-rdf-syntax-ns#”, “rdfs” => “http://www.w3.org/2000/01/rdf-schema#”, “role” => “http://simile.mit.edu/2006/01/roles#”, “skos” => “http://www.w3.org/2004/02/skos/core#”, “time” => “http://www.w3.org/2006/time#”, “timeline” => “http://purl.org/NET/c4dm/timeline.owl#”, “wgs84_pos” => “http://www.w3.org/2003/01/geo/wgs84_pos#” ); return \%namespaces; }

# package Apache2::LiAM::Dereference; # Dereference.pm – Redirect user-agents based on value of URI. # Eric Lease Morgan <eric_morgan@infomotions.com> # December 7, 2013 – first investigations; based on Apache2::Alex::Dereference # configure use constant PAGES => ‘http://infomotions.com/sandbox/liam/pages/’; use constant DATA => ‘http://infomotions.com/sandbox/liam/data/’; # require use Apache2::Const -compile => qw( OK ); use CGI; use strict; # main sub handler { # initialize my $r = shift; my $cgi = CGI->new; my $id = substr( $r->uri, length $r->location ); # wants RDF if ( $cgi->Accept( ‘text/html’ )) { print $cgi->header( -status => ‘303 See Other’, -Location => PAGES . $id . ‘.html’, -Vary => ‘Accept’ ) } # give them RDF else { print $cgi->header( -status => ‘303 See Other’, -Location => DATA . $id . ‘.rdf’, -Vary => ‘Accept’, “Content-Type” => ‘application/rdf+xml’ ) } # done return Apache2::Const::OK; } 1; # return true or die

Posted in Semantic Web and Linked Data | Comments Off on LiAM source code: Perl poetry

LiAM SPARQL Endpoint

December 15th, 2013

I have implemented a brain-dead and half-baked SPARQL endpoint to a subset of LiAM linked data, but there is the disclaimer. Errors will probably happen because of SPARQL syntax errors. Your milage will vary.

Here are a few sample queries:

Find all triples with RDF Schema labels – PREFIX rdf:<http://www.w3.org/2000/01/rdf-schema#> SELECT * WHERE { ?s rdf:label ?o }
Find all items with MODS subjects – PREFIX mods:<http://simile.mit.edu/2006/01/ontologies/mods3#> SELECT * WHERE { ?s mods:subject ?o }
Find every predicate of every triple – SELECT ?p WHERE { ?s ?p ?o }
Find everything – SELECT * WHERE { ?s ?p ?o }

Source code — sparql.pl — is online.

Posted in Semantic Web and Linked Data | Comments Off on LiAM SPARQL Endpoint

EAD2RDF

November 10th, 2013

I have played with an XSL stylesheet called EAD2RDF with good success.

Archivists use EAD as their “MARC” records. EAD has its strengths and weakness, just like any metadata standard, but EAD is a flavor of XML. As such it lends itself to XSLT processing. EAD2RDF is a stylesheet written by Pete Johnston. After running it through an XSLT 2.0 processor, it outputs an RDF/XML file. (I have made a resulting RDF/XML file available for you to peruse.) The result validates against the W3C RDF Validator but won’t have a graph created, probably because there are so many triples in the result.

I think archivists as well as computer technologists working in archives ought to take a closer look at EAD2RDF.

Posted in Semantic Web and Linked Data | Comments Off on EAD2RDF

OAI2LOD Server

November 10th, 2013

At first glance, a software package called OAI2LOD Server seems to work pretty well, and on a temporary basis, I have made one of my OAI repositories available as Linked Data — http://infomotions.com:2020/

OAI2LOD Server is a software package, written by Bernhard Haslhofer in 2008. Building, configuring, and running the server was all but painless. I think this has a great deal of potential, and I wonder why it has not been more widely exploited. For more information about the server, see “The OAI2LOD Server: Exposing OAI-PMH Metadata as Linked Data“

Posted in Semantic Web and Linked Data | Comments Off on OAI2LOD Server

TriLUG, open source software, and satisfaction

December 9th, 2011

This is posting about TriLUG, open source software, and satisfaction for doing a job well-done.

A long time ago, in a galaxy far far away, I lived in Raleigh (North Carolina), and a fledgling community was growing called the Triangle Linux User’s Group (TriLUG). I participated in a few of their meetings. While I was interested in open source software, I was not so interested in Linux. My interests were more along the lines of the application stack, not necessarily systems administration nor Internet networking.

I gave a presentation to the User’s Group on the combined use of PHP and MySQL — “Smart HTML pages with PHP“. Because of this I was recruited to write a Web-based membership application. Since flattery will get you everywhere with me, I was happy to do it. After a couple of weeks, the application was put into place and seemed to function correctly. That was a bit more than ten years ago, probably during the Spring of 2001.

The other day I got an automated email message from the User’s Group. The author of the message wanted to know if I wanted to continue my membership? I replied how that was not necessary since I had long since moved away to northern Indiana.

I then got to wondering whether or not the message I received had been sent by my application. It was a long shot, but I enquired anyway. Sure enough, I got a response from Jeff Schornick, a TriLUG board member, who told me “Yes, your application was the tool that had been used.” How satisfying! How wonderful to know that something I wrote more than ten years ago was still working.

Just as importantly, Jeff wanted to know about open source licensing. I had not explicitly licensed the software, something that I only learned was necessary from Dan Chudnov later. After a bit of back and forth, the original source code was supplemented with the GNU Public License, packaged up, and distributed from a Git repository. Over the years the User’s Group had modified it to overcome a few usability issues, and they wanted to distribute the source code using the most legitimate means possible.

This experience was extremely enriching. I originally offered my skills, and they returned benefits to the community greater than the expense of my time. The community then came back to me because they wanted to express their appreciation and give credit where credit was due.

Open source software not necessarily about computer technology. It is just as much, if not more, about people and the communities they form.

Posted in Hacks, Miscellaneous | Comments Off on TriLUG, open source software, and satisfaction

Use & understand: A DPLA beta-sprint proposal

September 1st, 2011

This essay describes, illustrates, and demonstrates how the Digital Public Library of America (DPLA) can build on the good work of others who support the creation and maintenance of collections and provide value-added services against texts — a concept we call “use & understand”.

This document is available in a three of formats: 1) HTML – for viewing on a desktop Web browser, 2) PDF – for printing, the suggested format, and 3) ePub – for reading on your portable device.

Eric Lease Morgan <emorgan@nd.edu>
University of Notre Dame

September 1, 2011

Executive summary
Introduction and assumptions
Find & get
Use & understand
Examples

Measure size
Measure difficulty
Side bar on quantitative bibliographic data
Measure concept
Plot on a timeline
Count word and phrase frequencies
Display in context
Display the proximity of a given word to other words
Display location of word in a text
Elaborate upon and visualize parts-of-speech analysis

Disclaimer
Software
Implementation how-to’s

Measurement services
Timeline services
Frequency, concordance, proximity, and locations in a text services
Parts-of-speech services
Priorities

Quick links

Word frequencies, concordances
Word/phrase locations
Proximity displays
Plato, Aristotle, and Shakespeare
Catholic Portal
Measuring size
Plot on a timeline
Lookup in Wikipedia and plot on a map
Parts-of-speech analysis
Measuring ideas

Summary
About the author

Executive summary

This Digital Public Library of America (DPLA) beta-sprint proposal “stands on the shoulders of giants” who have successfully implemented the processes of find & get — the traditional functions of libraries. We are sure the DPLA will implement the services of find & get very well. To supplement, enhance, and distinguish the DPLA from other digital libraries, we propose the implementation of “services against text” in an effort to support use & understand.

Globally networked computers combined with an abundance of full text, born-digital materials has made the search engines of Google, Yahoo, and Microsoft a reality. Advances in information retrieval have made relevancy ranking the norm as opposed to the exception. All of these things have made the problems of find & get less acute than they used to be. The problems of find & get will never be completely resolved, but they seem adequately addressed for the majority of people. Enter a few words into a search box. Click go. And select items of interest.

Use & understand is an evolutionary step in the processes and functions of a library. These processes and functions enable the reader to ask and answer questions of large and small sets of documents relatively easily. Through the use of various text mining techniques, the reader can grasp quickly the content of documents, extract some of their meaning, and evaluate them more thoroughly when compared to the traditional application of metadata. Some of these processes and functions include: word/phrase frequency lists, concordances, histograms illustrating the location of words/phrases in a text, network diagrams illustrating what author say “in the same breath” when they mention a given word, plotting publication dates on a timeline, measuring the weight of a concept in a text, evaluating texts based on parts-of-speech, supplementing texts with Wikipedia articles, and plotting place names on a world maps.

We do not advocate the use of these services as replacements for “close” reading. Instead we advocate them as tools to supplement learning, teaching, and scholarship – functions of any library.

Use & understand: A video introduction

Introduction and assumptions

Libraries are almost always a part of a larger organization, and their main functions can be divided into collection building, conservation & preservation, organization & classification, and public service. These functions are very much analogous to the elements of the DPLA articulated by John Palfrey: community, content, metadata, code, and tools & services.

This beta-Sprint proposal is mostly about tools & services, but in order to provide the proposed tools & services, we make some assumptions about and build upon the good work of people working on community, content, metadata, and code. These assumptions follow.

First, the community the DPLA encompasses is just about everybody in the United States. It is not only about the K-12 population. It is not only about students, teachers, and scholars in academia. It is not only about life-long learners, the businessperson, or municipal employees. It is about all of these communities at once and at the same time because we believe all of these communities have more things in common than they have differences. The tools & services described in this proposal can be useful to anybody who is able to read.

Second, the content of the DPLA is not licensed, much of it is accessible in full-text, and freely available for downloading and manipulation. More specifically, this proposal assumes the collections of the DPLA include things like but not necessarily limited to: digitized versions of public domain works, the full-text of open access scholarly journals and/or trade magazines, scholarly and governmental data sets, theses & dissertations, a substantial portion of the existing United States government documents, the archives of selected mailing lists, and maybe even the archives of blog postings and Twitter feeds. Moreover, we assume the DPLA is not merely a metadata repository, but also makes immediately available plain text versions of much of its collection.

Third, this proposal does not assume very many things regarding metadata beyond the need for the most basic of bibliographic information such as unique identifiers, titles, authors, subject/keyword terms, and location codes such as URLs. It does not matter to this proposal how the bibliographic metadata is encoded (MARC, XML, linked data, etc.). On the other hand, this proposal will advocate for additional bibliographic metadata, specifically, metadata that is quantitative in nature. These additions are not necessary for the fulfillment of the proposal, but rather side benefits because of it.

Finally, this proposal assumes the code & infrastructure of the DPLA supports the traditional characteristics of a library. In other words, it is assumed the code & infrastructure of the DPLA provide the means for the creation of collections and the discovery of said items. As described later, this proposal is not centered on the processes of find & get. Instead this proposal assumes the services of find & get are already well-established. This proposal is designed to build on the good work of others who have already spent time and effort in this area. We hope to “stand on the shoulders of giants” in this regard.

Given these assumptions about community, content, metadata, and infrastructure, we will now describe how the DPLA can exploit the current technological environment to provide increasingly useful services to its clientele. Through the process we hope to demonstrate how libraries could evolve and continue to play a meaningful role in our society.

Find & get

While it comes across as trite, with the advent of ubiquitous and globally networked computers, the characteristics of data and information have fundamentally changed. More specifically, since things like books and journals — the traditional meat and potatoes of libraries — no longer need to be manifested in analog forms, their digital manifestations lend themselves to new functionality. For example, digital versions of books and journals can be duplicated exactly, and they are much less limited to distinct locations in space and time. Similarly, advances in information retrieval have made strict Boolean logic applied to against relational databases less desirable to the reader than relevancy ranking algorithms and the application of term frequency/inverse document frequency models against indexes. Combined together these things have made the search engines of Google, Yahoo, and Microsoft a reality. Compared to twenty years ago, this has made the problem of find & get much less acute.

While the problem of find & get will never completely be resolved, many readers (not necessarily librarians) feel the problem is addressed simply enough. Enter a few words into a search box, click Go, and select items of interest. We don’t know about you, but we can find plenty of data & information. The problem now is what to do with it once it is identified.

We are sure any implementation of the DPLA will include superb functionality for find & get. In fact, our proposal assumes such functionality will exist. Some infrastructure will be created allowing for the identification of relevant content. At the very least this content will be described using metadata and/or the full-text will be mirrored locally. This metadata and/or full-text will be indexed and a search interface applied against it. Search results will probably be returned in any number of ordered lists: relevancy, date, author, title, etc. The interface may very well support functionality based on facets. The results of these searches will never be perfect, but in the eyes of most readers, the results will probably be good enough. This being the case, our proposal is intended to build on this good work and enable the reader to do things with content they identify. Thus we propose to build on the process of find & get to support a process we call use & understand.

Use & understand

The problem of find & get is always a means to an end, and very rarely the end itself. People want to do things with the content they find. We call these things “services against texts”, and they are denoted by action verbs including but not limited to:

* analyze * annotate * cite * compare & contrast * confirm * count & tabulate words, phrases, and ideas * delete * discuss * evaluate * find opposite * find similar * graph & visualize * learn from * plot on a map * plot on a timeline * purchase * rate * read * review * save * share * summarize * tag * trace idea * transform

We ask ourselves, “What services can be provisioned to make the sense of all the content one finds on the Internet or in a library? How can the content of a digital work be ‘read’ in such a way that key facts and concepts become readily apparent? And can this process be applied to an entire corpus and/or a reader’s personal search results?” Thus, we see the problem of find & get evolving into the problem of use & understand.

In our opinion, the answers to these questions lie in the combination of traditional library principles with the application of computer science. Because libraries are expected to know the particular information needs of their constituents, libraries are uniquely positioned to address the problem of use & understand. What do people do with the data and information they find & get from libraries, or for that matter, any other place? In high school and college settings, students are expected to read literature and evaluate it. They are expected to compare & contrast it with similar pieces of literature, extract themes, and observe how authors use language. In a more academic setting scholars and researchers are expected to absorb massive amounts of non-fiction in order to keep abreast of developments in their fields. Each disciplinary corpus is whittled down by peer-review. It is reduced through specialization. Now-a-days the corpus is reduced even further through the recommendation processes of social networking. The resulting volume of content is still considered overwhelming by many. Use & understand is a next step in the information flow. It comes after find & get, and it is a process enabling the reader to better ask and answer questions of an entire collection, subcollection, or individual work. By applying digital humanities computing process, specifically text mining and natural language processing, the process of use & understand can be supported by the DPLA. The examples in the following sections demonstrate and illustrate how this can be done.

Again, libraries are almost always a part of a larger organization, and there is an expectation libraries serve their constituents. Libraries do this in any number ways, one of which is attempting to understanding the “information needs” of the broader organization to provide both just-in-time as well as just-in-case collections and services. We are living, working, and learning in an environment of information abundance, not scarsity. Our production economy has all but migrated to a service economy. One of the fuels of service economies is data and information. As non-profit organizations, libraries are unable to compete when it comes to data provision. Consequently libraries may need to refocus and evolve. By combining its knowledge of the reader with the content of collections, libraries can fill a growing need. Because libraries are expected to understand the partiular needs of their particular clientele, libraries are uniquely positioned to fill this niche. Not Google. Not Yahoo. Not Microsoft.

Examples

Measure size

One of the simplest and most rudimentary services against texts the DPLA could provide in order to promote use & understand is to measure the size of documents in terms of word counts in addition to page counts.

Knowing the size of a document is important to the reader because it helps them determine the time necessary to consume the document’s content as well as implies the document’s depth of elaboration. In general, shorter books require less time to read, and longer books go into greater detail. But denoting the sizes of books in terms of page counts is too ambiguous to denote length. For any given book, a large print addition will contain more pages than the same book in paperback form, which will be different again from its first edition hard cover manifestation.

Not only can much of the ambiguity of document lengths be eliminated if they were denoted with word counts, but if bibliographic descriptions were augmented with word counts then meaningful comparisons between texts could easily be brought to light.

Suppose the DPLA has a collection of one million full-text items. Suppose the number of words in each item were counted and saved in bibliographic records. Thus, search results could then be sorted by length. Once bibliographic records were supplemented with word counts it would be possible to calculate the average length of a book in the collection. Similarly, the range of lengths could be associated with a relative scale such as: tiny books, short books, average length books, long books, and tome-like books. Bibliographic displays could then be augmented with gauge-like graphics to illustrate lengths.

Such was done against the Alex Catalogue of Electronic Texts. There are (only) 14,000 full-text documents in the collection, but after counting all the words in all the documents it was determined that the average length of a document is about 150,000 words. A search was then done against the Catalogue for Charles Dickens’s A Christmas Carol, Oliver Twist and David Copperfield, and the lengths of the resulting documents were compared using gauge-like graphics, as illustrated below:

A Christmas Carol

Oliver Twist

David Copperfield

At least a couple of conclusions can be quickly drawn from this comparison. A Christmas Carol is much shorter than David Copperfield, and Oliver Twist is an average length document.

There will certainly be difficulties counting the number of words in documents. Things will need to be considered in order to increase accuracy, things like: whether or not the document in question has been processed with optical character recognition, whether or not things like chapter headers are included, whether or not back-of-the-book indexes are included, whether nor not introductory materials are included. All of this also assumes a parsing program can be written which accurately extracts “words” from a document. The later is, in fact, fodder for an entire computer science project.

Despite these inherent difficulties, denoting the number of words in a document and placing the result in bibliographic records can help foster use & understand. We believe counting the number of words in a document will result in a greater number of benefits when compared to costs.

Measure difficulty

Measuring the inherent difficulty — readability score — of texts enables the reader to make judgements about those texts, and in turn, fosters use & understand. By including such measurements in the bibliographic records and search results, the DPLA will demonstrate ways it can “save the time of the reader”.

In the last century J. Peter Kincaid, Rudolf Flesch, and Robert Gunning worked both independently as well as collaboratively to create models of readability. Based on a set of factors (such as but not limited to: lengths of documents measured in words, the number of paragraphs in documents, the number of sentences in paragraphs, the number of words in sentences, the complexity of words, etc.) numeric values were calculated to determined the reading levels of documents. Using these models things like Dr. Seuss books are consistently determined to be easy to read while things like insurance policies are difficult. Given the full-text of a document in plain text form, it is almost trivial to compute any number of readability scores. The resulting values could be saved in bibliographic records, and these values could be communicated to the reader with the use of gauge-like graphics.

In a rudimentary way, the Alex Catalogue of Electronic texts has implemented this idea. For each item in the Catalogue the Fog, Flesch, and Kincaid readability scores have been calculated and saved to the underlying MyLibrary database. Searches were done against the Catalogue for Charles Dickens’s David Copperfield, Henry David Thoreau’s Walden, and Immanual Kant’s Fundamental Principles Of The Metaphysics Of Morals. The following graphics illustrate the readability scores of each. We believe the results are not surprising, but they are illustrative of this technique’s utility:

David Cooperfield

Walden

Metaphysics of Morals

If readability scores were integrated into bibliographic search engines (“catalogs”), then it would be possible to limit search results by reading level or even sort search results by them. Imagine being able to search a library catalog for all items dealing with Neo-Platonism, asking for shorter items as opposed to longer items, and limiting things further by readability score.

Readability scores are not intended to be absolute. Instead they are intended to be used as guidelines. If the reader is a novice when it comes to particular topic, and the reader is of high school age, that does not mean they are unable to read college level material. Instead, the readability scores would be used to set the expectations of the reader and help them make judgements before they begin reading a book.

Side bar on quantitative bibliographic data

Bibliographic systems are notoriously qualitative in nature making the process of compare & contrast between bibliographic items very subjective. If there were more quantitative data associated with bibliographic records, then mathematical processes could be applied against collections as a whole, subsets of the collection, or even individual items.

Library catalogs are essentially inventory lists denoting what a library owns (or licenses). For the most part, catalogs are used to describe the physical nature of a library collection: authors, titles, publication dates, pagination and size, notes (such as “Includes index.”), and subject terms. Through things like controlled vocabularies and authority lists, the nature of a collection can be posited, and some interesting questions can be answered. Examples include: what is the average age of the items in the collection, what are the collection’s major subject areas, who are the predominate authors of the works in the collection. These are questions whose answers are manifested now-a-days through faceted browse interfaces, but they are questions of the collection as a whole or subsets of the collection, not individual works. They are questions librarians find interesting, not necessarily readers who want to evaluate the significance of a given work.

If the bibliographic systems were to contain quantitative data, then the bibliographic information systems would be more meaningful and more useful. Dates are a very good example. The dates (years) in a library catalog denote when the item in hand (a book) was published, not when the idea in the book was manifested. Consequently, if Plato’s Dialogs were published today, then its library catalog record would have a value of 2011. While such a thing is certainly true, it is misleading. Plato did not write the Dialogs this year. They were written more than 2,500 years ago. Given our current environment, why can’t a library catalog include this sort of information?

Suppose the reader wanted to read all the works of Henry David Thoreau. Suppose the library catalog had accurately denoted the all the items in its collection by this author with the authority term, “Thoreau, Henry David”. Suppose the reader did an author search for “Thoreau, Henry David” and a list of twenty-five items was returned. Finally, suppose the reader wanted to begin by reading Thoreau’s oldest work first and progress to his latest. Using a library catalog, such a thing would not be possible because the dates in bibliographic records denote the date of publication, not the date of first conception or manifestation.

Suppose the reader wanted to plot on a timeline when Thoreau’s works were published, and the reader wanted to compare this with the complete works of Longfellow or Walt Whitman. Again, such a thing would not be possible because the dates in a library catalog denote publication dates, not when ideas were originally manifested. Why shouldn’t a library catalog enable the reader to easily create timelines?

To make things even more complicated, publication dates are regularly denoted as strings, not integers. Examples include: [1701], 186?, 19–, etc. These types of values are ambiguous. Their meaning and interpretation is bound to irregularly implemented “syntactical sugar”. Consequently, without all but heroic efforts, it is not easy to do any sort of compare & contrast evaluation when it comes to dates.

The DPLA has the incredible opportunity to make a fresh start when it comes to the definition of library catalogs. We know the DPLA will not want to reinvent the wheel. At the same time we believe the DPLA will want to exploit the current milieu, re-evaluate the possibilities of computer technology, and consequently refine and evolve the meaning of “catalog”. Traditional library catalogs were born in an era of relative information scarcity. Today we are dealing with problems of abundance. Library catalogs need to do many things differently in order to satisfy the needs/desires of the current reader. “Next-generation library catalogs” can do so much more than provide access to local collections. Facilitating ways to evaluate collections, sub-collections, or individual items through the use of quantitative analysis is just one example.

Measure concept

By turning a relevancy ranking algorithm on its head, it is be possible to measure the existence of concepts of a given work. If this were done for many works, then new comparisons between works would be possible, and again, making it possible for the reader to easily compare & contrast items in a corpus or search results. Of all the services against texts examples in this proposal, we know this one is the most avant-garde.

Term frequency/inverse document frequency (TFIDF) is a model at the heart of many relevancy ranking algorithms. Mathematically stated, TFIDF equals:

( c / t ) * log( d / f )

where:

c = number of times the query terms appear in a document
t = total number of words in a document
d = total number of documents in a corpus
f = total number of documents containing the query terms

In other words, TFIDF calculates relevancy (“aboutness”) by multiplying the ratio of query words and document sizes to the ratio of number of documents in a corpus and total frequency of query terms. Thus, if there are three documents each containing the word “music” three times, but one of them is 100 words long and the other two are 200 words long, then the first document is considered more relevant than the other two.

Written language — which is at the very heart of library content — is ambiguous, nuanced, and dynamic. Few, if any, concepts can be completely denoted by a single word or phrase. Instead, a single concept may be better described using a set of words or phrases. For example, music might be denoted thusly:

art, Bach, Baroque, beat, beauty, blues, composition, concert, dance, expression, guitar, harmony, instrumentation, key, keyboard, melody, Mozart, music, opera, percussion, performance, pitch, recording, rhythm, scale, score, song, sound, time, violin

If any document used some or all of these words with any degree of frequency, then it would probably be safe to say the document was about music. This “aboutness” could then be calculated by summing the TFIDF scores of all the music terms in a given document — a thing called the “document overlap measure”. Thus, one document might have a total music “aboutness” measure of 105 whereas another document might have a measure of 55.

We used a process very similar to the one outlined above in an effort to measure the “greatness” of the set of books called The Great Books Of The Western World. Each book in the set was evaluated in terms of it use of the 102 “great ideas” enumerated in the set’s introduction. We summed the computed TFIDF values of each great idea in each book, a value we call the Great Ideas Coefficient. Through this process we determined the “greatest” book in the set was Aristotleʼs Politics because it alluded to the totality of “great ideas” more than the others. Furthermore, we determined that Shakespeare wrote seven of the top ten books when it comes to the idea of love. The following figure illustrates the result of these comparisons. The bars above the line represent books greater than the hypothetical average great book, and the bars below the line are less great than the others.

Measuring the “greatness” of The Great Books of the Western World

The DPLA could implement very similar services against texts in one and/or two ways. First, it could denote any number of themes (like music or “great ideas”) and calculate coefficients denoting the aboutness of those themes for every book in the collection. Readers could then limit their searches by these coefficients or sort their search results accordingly. Find all books with subjects equal to philosophy. Sort the result by the philosophy coefficient.

Second, and possibly better, the DPLA could enable readers to denote their own more specialized and personalized themes. These themes and their aboutness coefficients could then be applied, on-the-fly, to search results. For example, find all books with subject terms equal to gardening, and sort the result by the reader’s personal definition of biology.

As stated earlier, written language is ambiguous and nuanced, but at the same time it is, to some degree, predicable. If it were not predicable, then no one would be able to understand another. Because of this predicability, language, to some degree, can be quantified. Once quantified, it can be measured. Once measured it can be sorted and graphed, and thus new meanings can be expressed and evaluated. The coefficients described in this section, like the measurements of length and readability, are to be taken with a grain of salt, but they can help the reader use & understand library collections, sub-collections, and individual items.

Plot on a timeline

Plotting things on a timeline is an excellent way to put events into perspective, and when written works are described with dates, then they are amenable to visualizations.

The DPLA could put this idea into practice by applying it against search results. The reader could do a search in the “catalog”, and the resulting screen could have a link labeled something like “Plot on a timeline”. By clicking the link the dates of search results could be extracted from the underlying metadata, plotted on a timeline, and displayed. At the very least such a function would enable the reader to visualize when things were published and answer rudimentary questions such as: are there clusters of publications, do the publications span a large swath of time, did one particular author publishing things on regular basis?

The dates in traditional bibliographic metadata denote the publication of an item, as mentioned previously. Consequently the mapping of monographs may not be useful as desired. On the other hand, the dates associated with things of a serial nature (blog postings, twitter feeds, journal articles, etc.) are more akin to dates of conception. We imagine the DPLA systematically harvesting, preserving, and indexing freely available and open access serial literature. This content is much more amenable to plotting on a timeline as illustrated below:

Timeline illustrating when serial literature was published

The timeline was created by aggregating selected RSS feeds, parsing out the dates, and plotting them accordingly. Different colored items represent different feeds. Each item in the timeline is hot providing the means to read the items’ abstracts and optionally viewing the items’ full text.

Plotting things on a timeline is another way the DPLA can build on the good work of find & get and help the reader use & understand.

Count word and phrase frequencies

Akin to traditional back-of-the-book indexes, word and phrase frequency tabulations are one of the simplest and most expedient ways of providing access to and overviews of a text. Like tables of contents and indexes, word and phrase frequecies increase a text’s utility and make texts easier to understand.

Back-of-the-book indexes are expensive to create and the product of an individual’s perspective. Moreover, back-of-the-book indexes are not created for fiction. Why not? Given the full-text of a work any number of back-of-the-book index-like displays could be created to enhance the reader’s experience. For example, by simply tabulating the occurrences of every word in a text (sans, maybe, stop words), and then displaying the resulting list alphabetically, the reader can have a more complete back-of-the-book index generated for them without the help of a subjective indexer. The same tabulation could be done again but instead of displaying the content alphabetically, the results could be ordered by frequency as in a word cloud. In either case each entry in the “index” could be associated with an integer denoting the number of times the word (or phrase) occurs in the text. The word (or phrase) could then be linked to a concordance (see below) in order to display how the word (or phrase) was used in context.

Take for example, Henry David Thoreaus’s Walden. This is a piece of non-fiction about a man who lives alone in the woods by a pond for just about two years. In the book’s introduction Ralph Waldo Emerson describes Thoreau as a man with a keen sense of physical space and an uncanny ability for measurement. The book itself describes one person’s vision of what it means to be human. Upon the creation and display of the 100 most frequently used two-word phrases (bigrams), these statements about the book are born out. Notice the high frequency of quantitative references as well as reference to men:

Compare Walden to James Joyce’s Ulysses, a fictional work describing a day in the life of Leopold Bloom as he walks through Dublin. Notice how almost every single bigram is associated with the name of a person

Interesting? Some people may react to these illustrations and say, “So what? I already knew that.” To which we reply, “Yes, but what about those people who haven’t read these texts?” Imagine being able to tabulate the word frequencies against any given set of texts — a novel, a journal article, a piece of non-fiction, all of the works by a given author or in a given genre. The results are able to tell the reader things about the works. For example, it might alert the reader to the central importance of a person named Bloom. When Bloom is mentioned in the text, then maybe the reader ought to be extra attention to what is being said. Frequency tabulations and word cloud can also alert the reader to what is not said in a text. Apparently religion is not a overarching theme in either of the above examples.

The 100 most frequent two-word phrases in Walden

The 100 most frequent two-word phrases in Ulysses

It is possible to tabulate word frequencies across texts. Again, using A Christmas Carol, Oliver Twist, and David Copperfield as examples, we discover the 6-word phrase “taken with a violent fit of” appears in both David Copperfield and A Christmas Carol. Moreover, the bigram “violent fit” appears on all three works. Specifically, characters in these three Dickens stories have violent fits of laughter, crying, trembling, and coughing. By concatenating the stories together and applying concordancing methods to them (described below) we see there are quite a number of violent things in the three stories:

  n such breathless haste and violent agitation, as seemed to betoken so
  ood-night, good-night!' The violent agitation of the girl, and the app
  sberne) entered the room in violent agitation. 'The man will be taken,
  o understand that, from the violent and sanguinary onset of Oliver Twi
  one and all, to entertain a violent and deeply-rooted antipathy to goi
  eep a little register of my violent attachments, with the date, durati
  cal laugh, which threatened violent consequences. 'But, my dear,' said
  in general, into a state of violent consternation. I came into the roo
  artly to keep pace with the violent current of her own thoughts: soon
  ts and wiles have brought a violent death upon the head of one worth m
   There were twenty score of violent deaths in one long minute of that
  id the woman, making a more violent effort than before; 'the mother, w
   as it were, by making some violent effort to save himself from fallin
  behind. This was rather too violent exercise to last long. When they w
   getting my chin by dint of violent exertion above the rusty nails on
  en who seem to have taken a violent fancy to him, whether he will or n
  peared, he was taken with a violent fit of trembling. Five minutes, te
  , when she was taken with a violent fit of laughter; and after two or
  he immediate precursor of a violent fit of crying. Under this impressi
  and immediately fell into a violent fit of coughing: which delighted T
  of such repose, fell into a violent flurry, tossing their wild arms ab
   and accompanying them with violent gesticulation, the boy actually th
  ght I really must have laid violent hands upon myself, when Miss Mills
   arm tied up, these men lay violent hands upon him -- by doing which,
   every aggravation that her violent hate -- I love her for it now -- c
   work himself into the most violent heats, and deliver the most wither
  terics were usually of that violent kind which the patient fights and
   me against the donkey in a violent manner, as if there were any affin
   to keep down by force some violent outbreak. 'Let me go, will you,--t
  hands with me - which was a violent proceeding for him, his usual cour
  en.' 'Well, sir, there were violent quarrels at first, I assure you,'
  revent the escape of such a violent roar, that the abused Mr. Chitling
  t gradually resolved into a violent run. After completely exhausting h
  , on which he ever showed a violent temper or swore an oath, was this
  ullen, rebellious spirit; a violent temper; and an untoward, intractab
  fe of Oliver Twist had this violent termination or no. CHAPTER III REL
  in, and seemed to presage a violent thunder-storm, when Mr. and Mrs. B
  f the theatre, are blind to violent transitions and abrupt impulses of
  ming into my house, in this violent way? Do you want to rob me, or to

These observations simply beg other questions. Is violence a common theme in Dickens’ works? What other adjectives are used to a greater or lesser degree in Dickens’ works? How do the use of these adjectives differ from other authors of the same time period or within the canon of English literature?

While works of fiction are the basis of most of the examples, there is no reason why similar processes couldn’t be applied to non-fiction as well. We also understand that the general reader will not be interested in these sorts of services against texts. Instead we see these sorts of services more applicable to students in high school and college. We also see these sorts of services being applicable to the scholar or researcher who needs to “read” large numbers of journal article. Finally, we do not advocate the use of these sorts of tools as a replacement for traditional “close” reading. These tools are supplements and additions to the reading process just as tables of contents and back-of-the-book indexes are today.

Display in context

Concordances — one of the oldest literary tools in existence — have got to be some of the more useful services against texts a library could provide because they systematically display words and concepts within the context of the larger written work making it very easy to compare & contrast usage. Originally implemented by Catholic priests as early as 1250 to study religious texts, concordances (sometimes called “key word in context” or KWIC indexes) trivialize the process of seeing how a concept is expressed in a work.

As an example of how concordances can be used to analyze texts, we asked ourselves, “How do Plato, Aristotle, and Shakespeare differ in their definition of man?” To answer this question we amassed all the works of the authors, searched each for the phrase “man is”, and displayed the results in a concordance-like fashion. From the results the reader can see how the definitions of Plato and Aristotle are very similar but much different from Shakespeare’s:

Plato’s definitions

  stice, he is met by the fact that man is a social being, and he tries to harmoni
  ption of Not-being to difference. Man is a rational animal, and is not -- as man
  ss them. Or, as others have said: Man is man because he has the gift of speech;
  wise man who happens to be a good man is more than human (daimonion) both in lif
  ied with the Protagorean saying, 'Man is the measure of all things;' and of this

Aristotle’s definitions

  ronounced by the judgement 'every man is unjust', the same must needs hold good
  ts are formed from a residue that man is the most naked in body of all animals a
  ated piece at draughts. Now, that man is more of a political animal than bees or
  hese vices later. The magnificent man is like an artist; for he can see what is
  lement in the essential nature of man is knowledge; the apprehension of animal a

Shakespeare’s definitions

   what I have said against it; for man is a giddy thing, and this is my conclusio
   of man to say what dream it was: man is but an ass, if he go about to expound t
  e a raven for a dove? The will of man is by his reason sway'd; And reason says y
  n you: let me ask you a question. Man is enemy to virginity; how may we barricad
  er, let us dine and never fret: A man is master of his liberty: Time is their ma

We do not advocate the use of concordances as the be-all and end-all of literary analysis but rather a pointer to bigger questions. Think how much time and energy would have been required if the digitized texts of each of these authors was not available, and if computers could not be applied against them. Concordances, as well as the other services against texts outlined in this proposal, make it easier to ask questions of collections, sub-collections, and individual works. This ease-of-use empowers the reader to absorb, observe, and learn from texts in ways that was not possible previously. We do not advocate these sort of services against texts as replacements for traditional reading processes, but rather we advocate them as alternative and supplemental tools for understanding the human condition or physical environment as manifested in written works.

Herein lies one of the main points of our proposal. By creatively exploiting the current environment where full-text abounds and computing horsepower is literally at everybody’s fingertips, libraries can assist the reader to “read” texts in new and different ways — ways that make it easier to absorb larger amounts of information and ways to understand it from new and additional perspectives. Concordances are just one example.

Display the proximity of a given word to other words

Visualizing the words frequently occurring near a given word is often descriptive and revealing. With the availability of full-text content, creating such visualization is almost trivial and have the potencial for greatly enhancing the reader’s experience. This enhanced reading process is all but impossible when the written word is solely accessible in analog forms, but in a digital form the process is almost easy.

For example, first take the word woodchuck as found in Henry David Thoreau’s Walden. Upon reading the book the reader learns of his literal distaste for the woodchuck. They eat is beans, and he wants to skin them. Compare the same author’s allusions to woodchucks in his work Two Weeks On The Concord And Merrimack Rivers. In this work, when woodchucks are mentioned he also alludes to other small animals such as foxes, minks, muskrats, and squirrels. In other words, the connotations surrounding woodchucks and between the two books are different as illustrated by the following network diagrams:

“woodchuck” in Walden

“woodchuck” in Rivers

The given word — woodchuck — is in the center. Each of the words connected to the given word are the words appearing most frequently near the given word. This same process is then applied to the connected words. Put another way, these network diagrams literally illustrate what an author says, “in the same breath” when they use a given word. Such visualizations are simply not possible through the process of traditional reading without spending a whole lot of effort. The DPLA could implement the sort of functionality described in this section and make the reader’s experience richer. It demonstrates how libraries can go beyond access (a problem that is increasingly not a problem) and move towards use & understand.

We do not advocate the use of this technology to replace traditional analysis, but rather to improve upon it. This technology, like all of the examples in the proposal, makes it easier to find interesting patterns for further investigation.

Display location of word in a text

Sometimes displaying where in a text, percentage-wise, a word or phrase exists can raise interesting questions, and by providing tools to do such visualizations the DPLA will foster the ability to more easily ask interesting questions.

For example, what comes to mind when you think of Daniel Defoe’s Robinson Curose? Do you think of a man shipwrecked on an island and the cannibal named Friday? Ask yourself, when in the story is the man shipwrecked and when does he meet Friday? Early in the story? In the middle? Towards the end? If you guessed early in the story, then you would be wrong because most of the story takes place on a boat, and only three-quarters of the way through the book does Friday appear, as illustrated by the following histogram:

We all know that Herman Melville’s book Moby Dick is about a sailor hunting a great white whale. Looking at a histogram of where the word “white” appears in the story, we see a preponderance of its occurrence forty percent the way through the book. Why? Upon looking at the book more closely we see that one of the chapters is entitled “The Whiteness of the Whale”, and it is almost entirely about the word “white”. This chapter appears about forty percent through the text. Who ever heard of an entire book chapter whose theme was a color?

“friday” in Crusoe

“white” in Moby Dick

In a Catholic pamphlet entitled Letters of an Irish Catholic Layman the word “catholic” is one of the more common and appears frequently in the text towards the beginning as well as the end

“catholic” in Layman

“lake erie” in Layman

“niagara falls” in Layman

After listing the most common two-word phrases in the book we see that there are many references to places in upper New York state:

The 100 most frequently used two-word phrases in Letters of an Irish Catholic Layman

Looking more closely at the locations of “Lake Erie” and “Niagra Falls” in the text, we see that these things are referenced in the places where the word “catholic” is not mentioned

Does the author go off on a tangent? Are there no catholics in these areas? The answers to the questions, and the question of why are left up to the reader, but the important point is the ability to quickly “read” the texts in ways that were not feasible when the books were solely in analog form. Displaying where in a text words or phrases occur literally illustrates new ways to view the content of libraries. These are examples of how the DPLA can build on find & get and increase use & understand.

Elaborate upon and visualize parts-of-speech analysis

Written works can be characterized through parts-of-speech analysis. This analysis can be applied to the whole of a library collection, subsets of the collection, or individual works. The DPLA has the opportunity to increase the functionality of a library by enabling the reader to elaborate upon and visualize parts-of-speech analysis. Such a process will facilitate greater use of the collection and improve understanding of it.

Because the English language follows sets of loosely defined rules, it is possible to systematically classify the words and phrases of written works into parts-of-speech. These include but are not limited to: nouns, pronouns, verbs, adjectives, adverbs, prepositions, punctuation, etc. Once classified, these parts-of-speech can be tabulated and quantitative analysis can begin.

Our own foray’s into parts-of-speech analysis, where the relative percentage use of parts-of-speech were compared, proved fruitless. But the investigation inspired other questions whose answers may be more broadly applied. More specifically, students and scholars are often times more interested in what an author says as opposed to how they say it. Such investigations can gleaned not so much from gross parts-of-speech measurements but rather the words used to denote each parts-of-speech. For example, the following table lists the 10 most frequently used pronouns and the number of times they occur in four works. Notice the differences:

Walden	Rivers	Northanger	Sense
I (1,809)	it (1,314)	her (1,554)	her (2,500)
it (1,507)	we (1,101)	I (1,240)	I (1,917)
my (725)	his (834)	she (1,089)	it (1,711)
he (698)	I (756)	it (1,081)	she (1,553)
his (666)	our (677)	you (906)	you (1,158)
they (614)	he (649)	he (539)	he (1,068)
their (452)	their (632)	his (524)	his (1,007)
we (447)	they (632)	they (379)	him (628)
its (351)	its (487)	my (342)	my (598)
who (340)	who (352)	him (278)	they (509)

While the lists are similar, they are characteristic of work from which they came. The first — Walden — is about an individual who lives on a lake. Notice the prominence of the word “I” and “my”. The second — Rivers — is written by the same author as the first but is about brothers who canoe down a river. Notice the higher occurrence of the word “we” and “our”. The later two works, both written by Jane Austin, are works with females as central characters. Notice how the words “her” and “she” appear in these lists but not in the former two. It looks as if there are patterns or trends to be measured here.

If the implementation of the DPLA were to enable the reader to do this sort of parts-of-speech analysis against search results, then the search results may prove to be more useful.

Nouns and pronouns play a special role in libraries because they are the foundation of controlled vocabularies, authority lists, and many other reference tools. Imagine being able to extract and tabulate all the nouns (things, names, and places) from a text. A word cloud like display would convey a lot of meaning about the text. On the other hand, a simple alphabetical list of the result could very much function like a back-of-the-book index. Each noun or noun phrase could be associated with any number of functions such as but not limited to:

look-up in a controlled vocabulary list in order to find more
look-up in an authority list in order to find more
show in context of the given work (concordance)
elaborate upon using a dictionary, thesaurus, encyclopedia, etc.
plot on a map

We demonstrated the beginnings of the look-up functions in a Code4Lib Journal article called “Querying OCLC Web Services for Name, Subject, and ISBN“. The concordance functionality is described above. The elaboration service is common place in today’s ebook readers. Through an interface designed for mobile devices, we implemented a combination of the elaborate and plot on a map services as a prototype. In this implementation the reader is presented with a tiny collection of classic works. The reader is then given the opportunity to browse the names or places index. After the reader selects a specific name or place the application displays a descriptive paragraph of the selection, an image of the selection, and finally, hypertext links to a Wikipedia article or a Google Maps display.


Screen shots of services against texts on a mobile device

Given the amount of full text content that is expected to be in or linked from the DPLA’s collection, there is so much more potential functionality for the reader. The idea of a library being a storehouse of books and journals is rapidly become antiquated. Because content is so readily available on the ‘Net, there is a need for libraries to evolve beyond its stereotypical function. By combining a knowledge of what readers do with information with the possibilities for full text analysis, the DPLA will empower the reader to more easily ask and answer questions of texts. And in turn, make it easier for the reader to use & understand what they are reading.

Disclaimer

People may believe the techiques described herein run contrary to the traditional processes of “close” reading. From our point of view, nothing could be further from the truth. We sincerely believe the techniques described in this proposal suppliment and enhance the reading process.

We are living in an age where we feel like we are drowning in data and information. But according to Ann Blair this is not a new problem. In her book, Too Much to Know, Blair chronicles in great detail the ways scholars since the 3rd Century have dealt with information overload. While they seem obvious in today’s world, they were innovations in their time. They included but were not limited to: copying texts (St. Jerome in the 3rd Century), creating concordances (Hugh St. Cher in the 13th Century), and filing wooden “cards” in a “catalog” (Athanasius Kircher 17th Century).

St. Jerome

Hugh St. Cher

Athanasius Kircher

Think of all the apparatus associated with a printed book. Books have covers, and sometimes there are dust jackets complete with a description of the book and maybe the author. On the book’s spine is the title and publisher. Inside the book there are cover pages, title pages, tables of contents, prefaces & introductions, tables of figures, the chapters themselves complete with chapter headings at the top of every page, footnotes & references & endnotes, epilogues, and an index or two. These extras — tables of contents, chapter headings, indexes, etc. — did not appear in books with the invention of the codex. Instead their existence was established and evolved over time.

In scholarly detail, Blair documents how these extras — as well as standard reference works like dictionaries, encyclopedias, and catalogs — came into being. She asserts the creation of these things became necessary as the number and lengths of books grew. These tools made the process of understanding the content of books easier. They reenforced ideas, and made the process of returning to previously read information faster. Accordingl to Blair, not everybody thought these tools — especially reference works — were a good idea. To paraphrase, “People only need a few good books, and people should read them over and over again. Things like encyclopedias only make the mind weaker since people area not exercising their memories.” Despite these claims, reference tools and the aparatus of printed books continue to exist and our venerable “sphere of knowledge” continues to grow.

Nobody can claim undertanding of a book if they read only the table of contents, flip through the pages, and glance at the index. Yes, they will have some understanding, but it will only be tertiary. We see the tools described in this proposal akin to tables of contents and back-of-the-book indexes. They are tools to find, get, use, and understand the data, information, and knowledge a book contains. They are a natural evolution considering the existence of books in digital forms. The services against texts described in this proposal enhance and supplement the reading process. They make it easier to compare & contrast the content of single books or an entire corpus. They make it faster and easier to extract pertinate information. Like a back-of-the-book index, they make it easier to ask questions of a text and get answers quickly. The tools described in this proposal are not intended to be end-all and be-all of textual analysis. Instead, they are intended to be pointers to interesting ideas, and it is left up to the reader to flesh out and confirm the ideas after closer reading.

Digital humanities investigations and specifically text mining computing techniques like the ones in this proposal can be viewed as modern-day processes for dealing with and taking advantage of information overload. Digital humanists use computers to evaluate all aspects of human expression. Writing. Music. Theator. Dance. Etc. Text mining is a particular slant on the digital humanities applying this evaluation process against sets of words. We are simply advocating these proceses become integrated with library collections and services.

Software

This section lists the software used to create our Beta-Sprint Propoal examples. All of the software is open source or freely accessible. None of the software is one-of-a-kind because each piece could be replaced by something else providing similar functionality.

Alex Catalogue of Electronic Texts – This is a collection and full-text index of approximately 14,000 public domain documents from the areas of American and English literature as well as Western philosophy. This “digital library”, created and maintined by the author since 1994, is a personal “sandbox” and laboratory for the implementation of new ideas in librarianship.
Google Charts – Implemented through a Javascript API (application programmer interface), Google Charts enabled us to create the histograms in the “display location of word in a text service”. It also provided the guage-like graphics for the “measure size” and “measure difficulty” services.
Google Maps – Another Javascript API, Google Maps was a part of the “plot on a map” service.
Lingua::Concordance – A Perl module, Lingua::Concordance was used to implement the “display in context” service. This module was written by the author.
Lingua::EN::Ngram – Another Perl module written by the author, Lingua::EN::Ngram was used to count and tabulate the words and n-length phrases in a given text. It plays a crucial role “count word and phrase frequencies” service.
Lingua::Fathom – This Perl module formed the basis of the “measure size” and “measure difficulty” services since its primary purpose is to calculate Fog, Flesch, and Kincaid readability scores.
Lingua::Stem::Snowball – This Perl module plays a role in the “measure concept” service. Given words as input, it outputs the words’ roots (or “stems”). These roots were then searched against the index of Alex Catalogue to determine the number of documents (f) containing the root. This value was then used to calculate TFIDF.
Lingua::TreeTagger – This a Perl interface to set of cross-platform binary applications whose purpose is to classify parts-of-speech. Lingua::TreeTagger was used to compare & contrast the ways pronouns were used in four classic works of literature.
MyLibrary – This is a digital library framework written in Perl. At its core are modules to manage library resources, librarians, and patron descriptions. Inter-relationships between resources, librarians, and patrons can be controlled through the creation and maintenance of facet/term combinations. MyLibrary was co-written by the author and implemented the concept of facets before faceted browse became popular. MyLibrary, in combination with Solr, forms the functional basis of the Alex Catalogue.
Protovis – This is the Javascript library used to visualize the “display the proximity of a given word to other words” service.
SIMILE Widgets Timeline – This is a Javascript library used to display timelines. It was used in the “plot on a timeline” service.
Solr – Solr is probably the most popular open source indexer in use by the library community, if not else where. It is used to index the full-text of the Alex Catalogue. It was also used to determine the value of f in the “measure concept” service.
Stanford Named Entity Recognizer – This is the set of Java programs used to extract the names and places from a document. These names and places were then linked to Wikipedia or plotted on a map — the “elaborate upon and visualize parts-of-speech” service.

This short list of software can be used to create a myriad of enhanced library services and tools, but the specific pieces of software listed above are not so important in and of themselves. Instead, they represent types of software which already exist and are freely available for use by anybody. Services against texts facilitating use & understand can be implemented with a wide variety of software applications. The services against texts outlined in this proposal are not limited to the software listed in this section.

Implementation how-to’s

Putting into practice the services against text described in this proposal would not be a trivial task, but process is entirely feasible. This section outlines a number of implementation how-to’s.

Measurement services

The measurement services (size, readability, and concept) would idealy be done against texts as they were added to the collection. The actual calculation of the size and readability scores are not difficult. All that is needed is the full text of the documents and software to do the counting. (Measuring concepts necessitates additional work since TFIDF requires a knowledge of the collection as a whole; measuring concepts can only be done once the bulk of the collection has been built. Measuring concepts is also a computationally intensive process.)

Instead, the challenge includes denoting locations to store the metadata, deciding whether or not to index the metadata, and figuring out how to display the metadata to the reader. The measurements themselves will be integers or decimal numbers. If MARC were the container for the bibliogrpahic data, then any one of a number of local notes could be used for storage. If a relational database were used, then additional fields could be used. If the DPLA wanted to enable the reader to limit or sort search results by any of the measurments, then the values will need to be indexed. We would be willing to guess the underlying indexer for the DPLA will be Solr, since it seems to be the current favorite. Indexing the measurements in Solr will be as easy as creating the necessary fields to a Solr configuration file, and adding the measurements to the fields as the balance of the bibliographic data is indexed. We would not suggest creating any visualizations of the measurements ahead of time, but rather on-the-fly and only as they were needed; the visualizations could probably be implemented using Javascript and embedded into the DPLA’s “catalog”.

Timeline services

Like the measurements, plotting the publication dates or dates of conception on a timeline can be implemented using Javascript and embedded into the DPLA’s “catalog”. For serial literature (blogs, open access journal articles, Twitter feeds, etc.) the addition of meaningful dates will have already been done. For more more traditional library catalog materials (books), the addition of dates of conception will be labor intensive. Therefore such a thing might not be feasible. On the other hand, this might be a great opportunity to practice a bit of crowdsourcing. Consider making a game out of the process, and try to get people outside the DPLA to denote when Plato, Thoreau, Longfellow, and Whitman wrote their great works.

Frequency, concordance, proximity, and locations in a text services

Implementing the frequency, concordance, proximity, and locations in a text services require no preprocessing. Instead these services can all be implemented on-the-fly by a program linked from the DPLA’s “catalog”. These services will require a single argument (a unique identifier) and some optional input parameters. Given a unique identifier, the program can look up basic bibliographic information from the catalog including the URL where the full-text resides, retrieve the full-text, and do the necessary processing. This URL could point to the local file system, or, if the network was deemed fast and reliable, the URL could point to the full-text in remote repositories such as the Internet Archive or the HathiTrust. These specific services against texts have been implemented in the Catholic Research Resources Alliance “Catholic Portal” application using “Analyze using text mining techniques” as the linked text. This is illustrated below:

Screen shot of the “Catholic Portal”

By the middle of September 2011 we expect the Hesburgh Libraries at the University of Notre Dame will have included very similar links in their catalog and “discovery system”. These links will provide access to frequency, concordance, and locations in a text services for sets of digitized Catholic pamphlets.

Parts-of-speech services

Based on our experience, the parts-of-speech services will require pre-processing. This is because the process of classifying words into categories of parts-of-speech is a time- and computing-intensive process. It does not seem feasible to extract the parts-of-speech from a document in real time.

To overcome this limitation, we classified our small sample of texts and saved the result in easily parsable text files. Our various scripts were then applied against these surrogates as opposed to the original documents. It should be noted that these surrogates, while not only computationally expensive, were also expensive in terms of disk space consuming more than double the space of the original.

We suggest one or two alternative strategies for the DPLA. First, determine what particular items from the DPLA’s collection may be the more popular. Once determined, have those items pre-processed outputting the surrogate files. These pre-processed items can then be used for demonstration purposes and generate interest in the parts-of-speech services. Second, when readers want to use these services against items that have not been pre-processed, then have the readers select their items, supply an email address, process the content, and notifiy the readers when the surrogates have been created. This second approach is akin to the just-in-time approach to collection development as opposed to the just-in-case philosophy.

Priorities

Obviously, we think all of the services against texts outlined above are useful, but practically speaking, it is not feasible to implement all of them once. Instead we advocate the following phased approach:

Word/phrase frequency, concordance, proximity, and locations in a text services – We suggest these services be implemented first, mostly because they can be written outside any “discovery system” hosted by the DPLA. Second, these services are the root of many of the other services, so it will be easier to build the others once these have been made available.
Measurments of size and readability – Calculating the values of size and readability on-the-fly is possible but is limiting in functionality. Pre-processing these values is relatively easy, and incorporating the result into the “discovery system” has many benefits. This is why we see these two services as the second highest priority.
Plot dates of publication on a timeline – Plotting dates will be easy enough if the content in question is of a serial nature and the dates represent “dates of conception”. But we are not sure content of a serial nature (blog postings, open access journal literature, Twitter feeds, etc.) will be included in the DPLA’s collection. Consequently, we suggest this service be implemented third.
Parts-of-speech analysis – Implementing services based on parts-of-speech will almost certainly require pre-processing as increase local storage requirements. While these costs are withing the DPLA’s control, they are expenses that may inhibit implementation feasibility. That is why they are listed fourth in the priority order.
After crowdsourcing the content, plot dates of conception on a timeline – We think this is one of the easier and more interesting services, especially if the dates in question are “dates of conception” for books, but alas, this data is not readily available. After figuring out how to acquire dates of conception for traditional catalog-like material — through something like crowdsourcing — implementing this service my be very enlightinging.
Measure ideas – This is probably the most avant-garde service described in the proposal. Its implementation can only be done after the bulk of the DPLA’s collection has been created. Furthermore, calculating TFIDF for a set of related keyword is computationally expensive. This can be a truly useful and innovative service, especially if the reader were able to create a personal concept for comparison. But because of the time and expense, we advocate this service be implemented last.

Quick links

This section lists most of the services outlined in the proposal as well as links to blog postings and example implementations.

Word frequencies, concordances

These URLs point to services generating word frequencies, concordances, histograms illustrating word locations, and network diagrams illustrating word proximities for Walden and Ulysses.

Word/phrase locations

Using the text mining techniques built into the “Catholic Portal” the reader can see where the words/phrases “catholic”, “lake erie”, and “niagara falls” are used in the text.

http://www.catholicresearch.net/concordances/?id=tormarc_lettersofirishca00iris

Proximity displays

Using network diagrams, the reader can see what words Thoreau uses “in the same breath” when he mentions the word “woodchuck”. These proximity displays are also incorporated into just about every item in the Alex Catalogue

./2011/01/visualizing-co-occurrences-with-protovis/

Plato, Aristotle, and Shakespeare

This blog posting first tabulates the most frequently used words by the authors, as well as their definitions of “man” and a “good man”.

./2010/06/the-next-next-generation-library-catalog/

Catholic Portal

The “Portal” is collection of rare, uncommon, and infrequently held materials brought together to facilitate Catholic studies. It includes some full text materials, and they are linked to text mining services.

http://www.catholicresearch.net/Record/tormarc_lettersofirishca00iris

Measuring size

In this blog posting a few works by Charles Dickens are compared & contrasted. The comparisons include size and word/phrase usage.

./2010/12/text-mining-charles-dickens/

Plot on a timeline

This blog posting describes how a timeline was created by plotting the publication dates of RSS feeds.

./2010/12/mits-simile-timeline-widget/

Lookup in Wikipedia and plot on a map

After extracting the names and places from a text, this service grabs Linked Data from DBedia, displays a descriptive paragraph, and allows the reader to look the name or place up in Wikipedia and/or plot it on a world map. This service is specifically designed for mobile devices.

http://dh.crc.nd.edu/sandbox/ner/mobile.html

Parts-of-speech analysis

This blog posting elaborates on how various parts of speech were used in a number of selected classic works.

./2011/02/forays-into-parts-of-speech/

Measuring ideas

The “greatness” of the Great Books was evaluated in a number of blog postings, and the two listed here give a good overview of the methodology.

Summary

In our mind, the combination of digital humanities computing techniques — like all the services against texts outined above — and the practices of librarianship would be a marriage made in heaven. By supplementing the DPLA’s collections with full text materials and then enhancing its systems to facilitate text mining and natural language processing, the DPLA can not only make it easier for readers to find data and information, but it can also make that data and information easier to use & understand.

We know the ideas outlined in this proposal are not typical library functions. But we also apprehend the need to take into account the changing nature of the information landscape. Digital content lends itself to a myriad of new possibilities. We are not saying analog forms of books and journals are antiquated nor useless. No, far from it. Instead, we believe the library profession has figured out pretty well how to exploit and take advantage of that medium and its metadata. On the other hand, the posibilities for full text digital content are still mostly unexplored and represent a vast untapped potencial. Building on and expanding on the education mission of libraries, services against texts may be a niche the profession — and the DPLA — can help fill. The services & tools described in this proposal are really only examples. Any number of additional services against texts could be implemented. We are only limited by our ability to think of action words denoting the things people want to do with texts once they find & get them. By augmenting a library’s traditional functions surrounding collection and sevices with the sorts of things described above, the role of libraries can expand and evolve to include use & understand.

About the author

Eric Lease Morgan considers himself to be a librarian first and a computer user second. His professional goal is to discover new ways to use computers to provide better library service. He has a BA in Philosophy from Bethany College in West Virginia (1982), and an MIS from Drexel University in Philadelphia (1987).

While he has been a practicing librarian for more than twenty years he has been writing software for more than thirty. He wrote his first library catalog in 1989, and it won him an award from Computers in Libraries Magazine. In a reaction to the “serials pricing crisis” he implemented the Mr. Serials Process to collect, organize, archive, index, and disseminate electronic journals. For these efforts he was awarded the Bowker/Ulrich’s Serials Librarianship Award in 2002. An advocate of open source software and open access publishing since before the phrases were coined, just about all of his software and publications are freely available online. One of his first pieces of open source software was a database-driven application called MyLibrary, a term which has become a part of the library vernacular.

As a member of the LITA/ALA Top Technology Trends panel for more than ten years, as well as the owner/moderator of a number of library-related mailing lists (Code4Lib, NGC4Lib, and Usability4Lib), Eric has his fingers on the pulse of the library profession. He coined the phrase “‘next-generation’ library catalog”. More recently, Eric has been applying text mining and other digital humanities computing techniques to his Alex Catalogue of Electronic Texts which he has been maintaining since 1994. Eric relishes all aspects of librarianship. He even makes and binds his own books. In his spare time, Eric plays blues guitar and Baroque recorder. He also enjoys folding origami, photography, growing roses, and fishing.

Tags: #dpla, digital humanities, DPLA (Digital Public Library of America), services against texts, text mining
Posted in Librarianship | 1 Comment »

Raising awareness of open access publications

August 2nd, 2011

I was asked the other day about ways to make people aware of open access journal publications, and this posting echoes much of my response.

Thanks again for taking the time this morning to discuss some of the ways open-access journals are using social media and other technology to distribute content and engage readers. I am on the board of [name deleted] recently transitioned to an open access format, and we are looking to maximize the capabilities of this new, free, and on-line format. To that end, any additional insights you might be able to share about effective social media applications for open-access sources, or other exemplary electronic journals you may be able to recommend, would be most helpful.

As you know, I have not been ignoring you as much as I have been out of town. Thank you for your patience.

I am only able to share my personal experiences here, and they are not intended to be standards of best practices. Yet, here are some ideas:

Exploit RSS – RSS is an XML technology used to syndicate content. It is the foundation of blogs. Do what you can to make sure your journal content is syndicated via RSS. This way people can “subscribe” to your journal and they will get alerts when new content becomes available.
Create a mailing list – On your journal’s site, allow people to submit their email addresses. Keep these email addresses in a list (database) and when new issues of your journal are created, send messages to the people in the list. Do not use the list for any other purpose.
Advertise – Identify mailing lists where discussions take place surrounding the topic of your journal. When your journal creates new issues, send a table of contents sort of message to the mailing lists.
Blog about your journal – If you or any of your colleagues who edit the journal blog, then write up things you find interesting in your journal in your blog. As long as your write up are sincere, people will not see this sort of things as self-promotion.
Use Facebook & Twitter – Do you and your editorial colleagues use Facebook or Twitter? Maybe your journal can have a Facebook page and/or a Twitter account. In either case, post messages about your journal on social networks.
Exploit SEO – SEO is code for “search engine optimization” which itself is code for “make it easy for Google to crawl your site”. If Google can easily crawl your site, then your content will more likely appear in Google search results, and therefore you will get more exposure.
Be regular – Publishing serial publications (blogs, journal articles, etc.) is difficult, but I believe your readers will build up trust for you if you make content available on a consistent basis. Otherwise, I think your publication will loose credibility.
Make your content searchable – When people come to your website, make sure people can easily search & browse the backfires. People will say, “I remember seeing an article on that topic at… I wonder if I can find it again?” Put another way, make sure your website is “usable”.
Allow for comments – While the articles you publish go through some sort of review, make it possible for the readership to comment as well. We no longer live in isolation, nor are we governed by the centralized elite. It is increasingly about the wisdom of the crowd.

The right software makes many of the tasks I outlined easier. I suggest you take a look at Open Journal Systems.

Good luck, and I commend you for going the open access route.

Tags: open access
Posted in Librarianship | Comments Off on Raising awareness of open access publications

Poor man’s restoration

July 25th, 2011

This posting describes a poor man’s restoration process.

Yesterday, I spent about an hour and a half writing down a work/professional to-do list intended to span the next few months. I prioritized things, elaborated on things, and felt I like had the good beginnings of an implementable plan.

I put the fruits of my labors into my pocket and then went rowing around in my boat. After my swim and on the way back to the dock I realized my to-do list was still in my pocket. Sigh. After pulling it out I and seeing the state it was in, I decided to try to salvage it. Opening it up was difficult. Naturally, the paper tore, but I laid it down as flat as I could. I went home to get a few pieces of paper to support and sandwich my soaked to-do list. For the next few hours, as the paper dried in the hot weather we are experiencing, I continually flipped and turned the to-do list so it would not stick to its supports.

Page #1

Page #2

This morning, after the list was was a dry as it was going to be, I photographed both sides of it, did my best color-correct the image, converted the whole thing into a PDF file, and printed the result. While the it looks like heck, the time I spent salvaging my intellectual efforts were much shorter than the time I would have spent recreating the list. Like a blues, such recreations are never exactly the same as the originals. But it would have been a whole lot better if I hadn’t gone swimming with my to-do list in the first place.

I might not have done this restoration process in the “best” way, but that does not detract from the effort itself. I really do enjoy all aspects of library work.

Posted in Librarianship | Comments Off on Poor man’s restoration

My DPLA Beta-Sprint Proposal: The movie

July 22nd, 2011

Please see my updated and more complete Digital Public Library of America Beta-Sprint Proposal. The following posting is/was a precursor.

The organizers of the Digital Public Library of America asked the Beta-Sprint Proposers to create a video outlining the progress of their work. Below is the script of my video as well as the video itself. Be gentle with me. Video editing is difficult.

Introduction

My name is Eric Morgan. I am a Digital Projects Librarian here at the University of Notre Dame, and I am going to outline, ever so briefly, my Digital Public Library of America Beta-Sprint Proposal. In a nutshell, the Proposal describes, illustrates, and demonstrates how the core functionality of a library can move away from “find & get” and towards “use & understand”.

Find & get

With the advent of ubiquitous and globally networked computers, the characteristics of data and information have fundamentally changed. More specifically, things like books and journals — the traditional meat and potatoes of libraries — no longer need to be manifested in analog forms, and their digital manifestations lend themselves to new functionality. For example, digital versions of books and journals can be duplicated exactly, and they are much less limited to distinct locations in space and time. This, in turn, has made things like the search engines of Google, Yahoo, and Microsoft a reality. Compared to twenty years ago, this has made the problem of find & get much less acute. While the problem of find & get will never completely be resolved, many people feel the problem is addressed simply enough. Enter a few words into a search box, click Go, and select items of interest.

Use & undertand

The problem of find & get is always a means to an end, and not the end itself. People want to do things with the content they find. I call these things “services against texts” and they are denoted by action verbs such as analyze, annotate, cite, compare & contrast, confirm, delete, discuss, evaluate, find opposite, find similar, graph & visualize, learn from, plot on a map, purchase, rate, read, review, save, share, summarize, tag, trace idea, or transform. Thus, the problem of find & get is evolving into the problem of use & understand. I ask myself, “What services can be provisioned to make the sense of all the content one finds on the Internet or in a library?” In my opinion, the answer lies in the combination of traditional library principles and the application of computer science. Because libraries are expected to know the particular information needs of their constituents, libraries are uniquely positioned to address the problem of use & understand. Not Google. Not Yahoo. Not Microsoft.

Examples

How do we go about doing this? We begin by exploiting the characteristics of the increasingly available of full text content. Instead of denoting the length of a book by the number of pages it contains, we measure it by the number of words. Thus, we will be able to unambiguously compare & contrast the lengths of documents. By analyzing the lengths of paragraphs, the lengths of sentences, and the lengths of words in a document, we will be able to calculate readability scores, and we will be better able to compare & contrast the intended reading levels of a book or article. By tabulating the words or phrases in multiple documents and then comparing those tabulations with each other libraries will make it easier for readers to learn about the similarities and differences between items in a corpus. Such a service will enable people to answer questions like, “How does the use of the phrase ‘good man’ differ between Plato, Aristotle, and Shakespeare?” If there were tools aware of the named people and places in a document, then a reader’s experience could be enriched with dynamic annotations and plots on a world map. Our ability to come up with ideas for additional services against texts is only limited by our imagination and our ability to understand the information needs of our clientele. My Beta Sprint Proposal demonstrates how many of these ideas can be implemented today and with the currently available technology.

Thank you

Thank you for the opportunity to share some of my ideas about the Digital Public Library of America, my Beta Sprint Proposal, and the role of libraries in the near future.

Tags: DPLA (Digital Public Library of America)
Posted in Librarianship | 1 Comment »

DPLA Beta Sprint Submission

June 20th, 2011

I decided to give it a whirl and particpate in the DPLA Beta Sprint, and below is my submission:

DPLA Beta Sprint Submission

My DPLA Beta Sprint submission will describe and demonstrate how the digitized versions of library collections can be made more useful through the application of text mining and various other digital humanities computing techniques.

Full text content abounds, and full text indexing techniques have matured. While the problem of discovery will never be completely solved, it is much less acute than it was even a decade ago. Whether the library profession or academia believes it or not, most people do not feel as if they have a problem finding data, information, and knowledge. To them it is as easy as entering a few words or phrases into a search box and clicking Go.

It is now time to move beyond the problem of find and spend increased efforts trying to solve the problem of use. What does one do with all the information they find and acquire? How can it be put into the context of the reader? What actions can the reader apply against the content they find? How can it be compared & contrasted? What makes one piece of information — such as a book, an article, a chapter, or even a paragraph — more significant than another? How might the information at hand be used to solve problems or create new insights?

There is no single answer to these questions, but this submission will describe and demonstrate one set of possibilities. It will assume the existence of full text content of just about any type — such as books the Internet Archive, open access journals, or blog postings. It will outline how these texts can be analyzed to find patterns, extract themes, and identify anomalies. It will describe how entire corpora or search results can be post-processed to not only refine the discovery process but also make sense of the results and enable the reader to quickly grasp the essence of textual documents. Since actions speak louder than words, this submission will also present a number of loosely joined applications demonstrating how this analysis can be implemented through Web browsers and/or portable computing devices such as tablet computers.

By exploiting the current environment — full text content coupled with ubiquitous computing horsepower — the DPLA can demonstrate to the wider community how libraries can remain relevant in the current century. This submission will describe and demonstrate a facet of that vision.

Posted in Librarianship | Comments Off on DPLA Beta Sprint Submission

Next-generation library catalogs, or ‘Are we there yet?’

June 1st, 2011

Next-generation library catalogs are really indexes, not catalogs, and increasingly the popular name for such things is “discovery system”. Examples include VuFind, Primo combined with Primo Central, Blacklight, Summon, and to a lesser extent Koha, Evergreen, OLE, and XC. While this may be a well-accepted summary of the situation, I really do not think it goes far enough. Indexers address the problem of find, but in my opinion, find is not the problem to be solved. Everybody can find. Most people believe Google has all but solved that problem. Instead, the problem to solve is use. Just as much as people want to find information, they want to use it, to put it into context, and to understand it. With the advent of so much full text content, the problem of find is much easier to solve than it used to be. What is needed is a “next-generation” library catalog including tools and interfaces designed to make the use and understanding of information easier. Both the “Catholic Portal” and the discovery systems of the Hesburgh Libraries at the University of Notre Dame are beginning to implement some of these ideas. When it comes to “next-generation” library catalogs we might ask the question, “Are we there yet?”. I think the answer is, “No, not yet.”

This text was originally written for a presentation to the Rare Books and Manuscripts Section of the American Library Association during a preconference meeting, June 23, 2011. It is available in a number of formats including this blog posting, a one-page PDF document intended as a handout, and an ePub file.

Numbers of choices

There are currently a number of discovery systems from which a library can choose, and it is very important to note that they have more things in common than differences. VuFind, Primo combined with Primo Central, Summon, and Blacklight are all essentially indexer/search engine combinations. Even more, they all use same “free” and open source software — Lucene — at their core. All of them take some sort of bibliographic data (MARC, EAD, metadata describing journal articles, etc.), stuff it into a data structure (made up authors, titles, key words, and control numbers), index it in the way the information retrieval community has been advocating for at least the past twenty years, and finally, provide a way to query the index with either one-box-one-button or fielded interfaces. Everything else — facets, cover art, reviews, favorites, etc. — is window dressing. When and if any sort of OCLC/EBSCOHost combination manifests itself, I’m sure the underlying technology will be very similar.

Koha, Evergreen, and OLE (Open Library Environment) are more traditional integrated library systems. They automate traditional library processes. Acquisitions. Cataloging. Serials Control. Circulation. Etc. They are database applications, not indexers, designed to manage an inventory. Search — the “OPAC” — is one of these processes. The primary difference between these applications and the integrated library systems of the recent past is their distribution mechanism. Koha and Evergreen are open source software, and therefore as “free as a free kitten”. OLE is still in development, but will be distributed as open source. Everything else is/was licensed for a fee.

When talking about “next-generation” library catalogs and “discovery systems”, many people allude to the Extensible Catalog (XC) which is not catalog nor an index. More accurately, it is system enabling and empowering the library community to manage and transform its bibliographic data on a massive scale. It offer ways for a library to harvest content from OAI-PMH data repositories (such as library catalogs), do extensive find/replace or enhancement operations against the harvested data, expose the result via OAI-PMH again, and finally, support the NCIP protocol so the circulation status of items found in an index can be determined. XC is middleware designed to provide functionality between an integrated library system and discovery system.

Find is not the problem

With the availability of wide-spread full text indexing, the need to organize content according to a classification system — to catalog items — has diminished. This need is not negated, but it is not as necessary as it used to be. In the past, without the availability of wide-spread full text indexing, classification systems provided two functions: 1) to organize the collection into a coherent whole with sub-parts, and 2) to surrogate physical items enumerated in a list. The aggregate of metadata elements — whether they be titles, authors, contributors, key words, subject terms, etc. — acted as “dummies” for the physical item containing the information. They are/were pointers to the book, the journal article, the piece of sheet music, etc. With the advent of wide-spread full text indexing, these two functions are not needed as much as they were in the past. Through the use of statistical analysis and direct access to the thing itself, indexers/search engines make the organization and discovery of information easier and less expenses. Note, I did not say “better”, just simpler and with greater efficiency.

Because wide-spread full text indexing abounds, the problem of find is not as acute as it used to be. In my opinion, it is time to move away from the problem of find and towards the problem of use. What does a person do with the information once they find and acquire it? Does it make sense? Is it valid? Does it have a relationship other things, and if so, then what is that relationship and how does it compare? If these relationships are explored, then what new knowledge might one uncover, or what existing problem might be solved? These are the questions of use. Find is a means to an end, not the end itself. Find is a library problem. Use the problem everybody else wants to solve.

True, classification systems provide a means to discover relationships between information objects, but the predominate classification systems and processes employed today are pre-coordinated and maintained by institutions. As such they posit realities that may or may not match the cognitive perception of today’s readers. Moreover, they are manually applied to information objects. This makes the process literally slow and laborious. Compared to post-coordinated and automated techniques, the manual process of applying classification to information objects is deemed expensive and of diminishing practical use. Put another way, the application of classification systems against information objects today is like icing on a cake, leather trim in a car, or a cherry on a ice cream sundae. They make their associated things richer, but they are not essencial their core purpose. They are extra.

Text mining

Through the use of a process called text mining, it is possible to provide new services against individual items in a collection as well as to collections as a whole. Such services can make information more useful.

Broadly defined, text mining is an automated process for analyzing written works. Rooted in linguistics, it makes the assumption that language — specifically written language — adheres to sets of loosely defined norms, and these norms are manifested in combinations of words, phrases, sentences, lines of a poem, paragraphs, stanzas, chapters, works, corpora, etc. Additionally, linguistics (and therefore text mining) also assumes these manifestations embody human expressions, meanings, and truth. By systematically examining the manifestations of written language as if they were natural objects, the expressions, meanings, and truths of a work may be postulated. Such is the art and science of text mining.

The process of text mining begins with counting, specifically, counting the number of words (n) in a document. This results in a fact — a given document is n words long. By comparing n across a given corpus of documents, new facts can be derived, such as one document is longer than another, shorter than another, or close to an average length. Once words have been counted they can be tallied. The result is a list of words and their associated frequencies. Some words occur often. Others occur infrequently. The examination of such a list tells a reader something about the given document. The comparison of frequency lists between documents tells the reader even more. By comparing the lengths of documents, the frequency of words, and their existence in an entire corpus a reader can learn of the statistical significance of given words. Thus, the reader can begin to determine the “aboutness” of a given document. This rudimentary counting process forms the heart of most relevancy ranking algorithms of indexing applications and is called “term frequency inverse document frequency” or TFIDF.

Not only can words be tallied but they can be grouped into different parts-of-speech (POS): nouns, pronouns, verbs, adjectives, adverbs, prepositions, function (“stop”) words, etc. While it may be interesting to examine the proportional use of each POS, it may be more interesting to examine the individual words in each POS. Are the personal pronouns singular or plural? Are they feminine or masculine? Are the names of places centered around a particular geographic location? Do these places exist in the current time, a time in the past, or a time in future? Compared to other documents, is there a relatively higher or lower use of color words, action verbs, names of famous people, or sets of words surrounding a particular theme? Knowing the answers to these questions can be quite informative. Just as these processes can be applied to words they can be applied to phrases, sentences, paragraphs, etc. The results can be charted, graphed, and visualized. They can be used to quickly characterize single documents or collections of documents.

The results of text mining processes are not to be taken as representations of truth, any more than the application of Library of Congress Subject Headings completely denote the aboutness of text. Text mining builds on the inherent patterns of language, but language is fluid and ambiguous. Therefore the results of text mining lend themselves to interpretation. The results of text mining are intended to be indicators, guides, and points of reference, and all of these things are expected to be interpreted and then used to explain, describe, and predict. Nor is text mining intended to be a replacement for the more traditional process of close reading. The results of text mining are akin to a book’s table of contents and back-of-the-book index. They outline, enumerate, and summarize. Text mining does the same. It is a form of analysis and a way to deal with information overload.

Assuming the availability of increasing numbers of full text information objects, a library’s “discovery system” could easily incorporate text mining for the purposes of enhancing the traditional cataloging process as well as increasing the usefulness of found material. In my opinion, this is the essence of a true “next-generation” library catalog.

Two examples

An organization called the Catholic Research Resources Alliance (CRRA) brings together rare, uncommon, and infrequently held materials into a thing colloquially called the “Catholic Portal”. The content for the Portal comes from a variety of metadata formats (MARC, EAD, and Dublin Core) harvested from participating member institutions. Besides supporting the Web 2.0 features we have all come to expect, it also provides item level indexing of finding aids, direct access to digitized materials, and concordancing services. The inclusion of concordance features makes the Portal more than the usual discovery system.

For example, the St. Michael’s College at the University of Toronto is a member of the CRRA. They have been working with the Internet Archive for a number years, and consequently measurable portions of their collection have been digitized. After being given hundreds of Internet Archive unique identifiers, a program was written which mirrored digital content and bibliographic descriptions (MARC records) locally. The MARC records were ingested into the Portal (an implementation of VuFind), and search results were enhanced to include links to both the locally mirrored content as well as the original digital surrogate. In this way, the Portal is pretty much just like any other discovery system. But the bibliographic displays go further because they contain links to text mining interfaces.

the catholic portal

The “Catholic Portal”

Through these interfaces, the reader can learn many things. For example, in a book called Letters Of An Irish Catholic Layman the word “catholic” is one of the most frequently used. Using the concordance, the reader can see that “Protestants and Roman Catholics are as wide as the poles asunder”, and “good Catholics are not alarmed, as they should be, at the perverseness with which wicked men labor to inspire the minds of all, but especially of youth, with notions contrary to Catholic doctrine”. This is no big surprise, but instead a confirmation. (No puns intended.) On the other hand, some of the statistically most significant two-word phrases are geographic identities (“upper canada”, “new york”, “lake erie”, and “niagara falls”) . This is interesting because such things are not denoted in the bibliographic metadata. Moreover, a histogram plotting where in the document “niagra fals” occurs can be juxtaposed with a similar histogram for the word “catholic”. Why does the author talk about Catholics when they do not talk about upstate New York? Text mining makes it easier to bring these observations to light in a quick and easy-to-use manner.

Concordance highlighting geographic two-word phrases

where is catholic

Where the word “catholic” is located in the text

Where “niagra falls” is located in the text

Some work being done in the The Hesburgh Libraries at the University of Notre Dame is in the same vein. Specifically, the Libraries is scanning Catholic pamphlets, curating the resulting TIFF images, binding them together to make PDF documents, embedding the results of OCR (optical character recognition) into the PDFs, saving the PDFs on a Web server, linking to the PDFs from the catalog and discovery system, and finally, linking to text mining services from the catalog and discovery system. Consequently, once found, the reader will be able to download a digitized version of a pamphlet, print it, read it in the usual way, and analyze it for patterns and meanings in ways that may have been overlooked through the use of traditional analytic methods.

Are we there yet?

Are we there yet? Has the library profession solved the problem of “next-generation” library catalogs and discovery systems? In my opinion, the answer is, “No.” To date the profession continues to automate its existing processes without truly taking advantage of computer technology. The integrated library systems are more open than they used to be. Consequently control over the way they operate is being transfered from vendors to the library community. The OPACs of yesterday are being replaced with the discovery systems of today. They are easier to use and better meet readers’ desires. They are not perfect. They are not catalogs. But they do make the process of find more efficient.

On the other hand, our existing systems do not take advantage of the current environment. They do not exploit the wide array and inherent functionality of available full text literature. Think of the millions of books freely available from the Internet Archive, Google Books, the HathiTrust, and Project Gutenberg. Think of the thousands of open access journal titles. Think about all the government documents, technical reports, theses & dissertations, conference proceedings, blogs, wikis, mailing list archives, and even “tweets” freely available on the Web. Even without the content available through licensing, this content has the makings of a significant library of any type. The next step is to provide enhanced services against this content — services that go beyond discovery and access. Once done, the library profession moves away from being a warehouse to an online place where data and information can be put into context, used to address existing problems, and/or create new knowledge.

The problem of find as reached the point of diminishing returns. The problem of use is now the problem requiring a greater amount of the profession’s attention.

Tags: "next generation" library catalogs, services against texts, text mining
Posted in Librarianship | 2 Comments »

Fun with RSS and the RSS aggregator called Planet

May 25th, 2011

This posting outlines how I refined a number of my RSS feeds and then aggregated them into a coherent whole using Planet.

Many different RSS feeds

I have, more or less, been creating RSS (Real Simple Syndication) feeds since 2002. My first foray was not really with RSS but rather with RDF. At that time the functions of RSS and RDF were blurred. In any event, I used RDF as a way of syndicating randomly selected items from my water collection. I never really pushed the RDF, and nothing really became of it. See “Collecting water and putting it on the Web” for details.

In December of 2004 I started marking up my articles, presentations, and travelogues in TEI and saving the result in a database. The webified version of these efforts was something called Musings on Information and Librarianship. I described the database supporting the process is a specific entry called “My personal TEI publishing system“. A program — make-rss.pl — was used to make the feed.

Since then blogs have become popular, and almost by definition, blogs support RSS in a really big way. My RSS was functional, but by comparison, everybody else’s was exceptional. For many reasons I started drifting away from my personal publishing system in 2008 and started moving towards WordPress. This manifested itself in this blog — Mini-Musings.

To make things more complicated, I started blogging on other sites for specific purposes. About a year ago I started blogging for the “Catholic Portal”, and more recently I’ve been blogging about research data management/curation — Days in the Life of a Librarian — at the University of Notre Dame.

In September of 2009 I started implementing a reading list application. Print an article. Read it. Draw and scribble on it. (Read, “Annotate it.”) Scan it. Convert it into a PDF document. Do OCR against it. Save the result to a Web-accessible file system. Do data entry against a database to describe it. Index the metadata and extracted OCR. And finally, provide a searchable/browsable interface to the whole lot. The result is a fledgling system I call “What’s Eric Reading?” Since I wanted to share my wealth (after all, I am a librarian) I created an RSS feed against this system too.

I was on a roll. I went back to my water collection and created a full-fledged RSS feed against it as well. See the simple Perl script — water2rss.pl — to see how easy it is.

Ack! I now have six different active RSS feeds, not counting the feeds I can get from Flickr and YouTube:

That’s too many, even for an ego surfer like myself. What to do? How can I consolidate these things? How can I present my writings in a single interface? How can I make it easy to syndicate all of this content in a standards-compliant way?

Planet

The answer to my questions is/was Planet — “an awesome ‘river of news’ feed reader. It downloads news feeds published by web sites and aggregates their content together into a single combined feed, latest news first.”

A couple of years ago the Code4Lib community created an RSS “planet” called Planet Code4Lib — “Blogs and feeds of interest to the Code4Lib community, aggregated.” I think it is maintained by Jonathan Rochkind, but I’m not sure. It is pretty nice since it brings together the RSS feeds from quite a number of library “hackers”. Similarly, there is another planet called Planet Cataloging which does the same thing for library cataloging feeds. This one is maintained by Jennifer W. Baxmeyer and Kevin S. Clarke. The combined planets work very well together, except when individual blogs are in both aggregations. When this happens I end up reading the same blog postings twice. Not a big deal. You get what you pay for.

After a tiny bit of investigation, I decided to use Planet to aggregate and serve my RSS feeds. Installation and configuration was trivial. Download and unpack the distribution. Select an HTML template. Edit a configuration file denoting the location of RSS feeds and where the output will be saved. Run the program. Tweak the template. Repeat until satisfied. Run the program on a regular basis, preferably via cron. Done. My result is called Planet Eric Lease Morgan.

The graphic design may not be extraordinarily beautiful, but the content is not necessarily intended to be read via an HTML page. Instead the content is intended to be read from inside one’s favorite RSS reader. Planet not only aggregates content but syndicates it too. Very, very nice.

What I learned

I learned a number of things from this process. First I learned that standards evolve. “Duh!”

Second, my understanding of open source software and its benefits was re-enforced. I would not have been able to do nearly as much if it weren’t for open source software.

Third, the process provided me with a means to reflect on the processes of librarianship. My particular processes for syndicating content needed to evolve in order to remain relevant. I had to go back and modify a number of my programs in order for everything to work correctly and validate. The library profession seemingly hates to do this. We have a mindset of “Mark it and park it.” We have a mindset of “I only want to touch book or record once.” In the current environment, this is not healthy. Change is more the norm than not. The profession needs to embrace change, but then again, all institutions, almost by definition, abhor change. What’s a person to do?

Forth, the process enabled me to come up with a new quip. The written word read transcends both space and time. Fun!?

Finally, here’s an idea for the progressive librarians in the crowd. Use the Planet software to aggregate RSS fitting your library’s collection development policy. Programatically loop through the resulting links to copy/mirror the remote content locally. Curate the resulting collection. Index it. Integrate the subcollection and index into your wider collection of books, jourals, etc. Repeat.

Tags: Planet, RSS
Posted in Hacks, Reviews | Comments Off on Fun with RSS and the RSS aggregator called Planet

Book reviews for Web app development

May 15th, 2011

This is a set of tiny book reviews covering the topic of Web app development for the iPhone, iPad, and iPod Touch.

Unless you’ve been living under a rock for the past three or four years, then you know the increasing popularity of personal mobile computing devices. This has manifested itself through “smart phones” like the iPhone and “tablet computers” like the iPad and to some extent the iPod Touch. These devices, as well as other smart phones and tablet computers, get their network connections from the ether, their screens are smaller than the monitors of desktop computers, and they employ touch screens for input instead of keyboards and mice. All of these things significantly change the user’s experience and thus their expectations.

As a librarian I am interested in providing information services to my clientele. In this increasingly competitive environment where the provision of information services includes players like Google, Amazon, and Facebook, it behooves me to adapt to the wider environment of my clientele as opposed to the other way around. This means I need to learn how to provide information services through mobile computing devices. Google does it. I have to do it too.

Applications for mobile computing devices fall into two categories: 1) native applications, and 2) “Web apps”. The former are binary programs written in compiled languages like Objective-C (or quite possibly Java). These types of applications are operating system-specific, but they are also able to take full advantage of the underlying hardware. This means applications for things like iPhone or iPad can interoperate with the devices’ microphone, camera, speakers, geo-location functions, network connection, local storage, etc. Unfortunately, I don’t know any compiled languages to any great degree, and actually I have little desire to do so. After all, I’m a lazy Perl programmer, and I’ve been that way for almost twenty years.

The second class of applications are Web apps. In reality, these things are simply sets of HTML pages specifically designed for mobiles. These “applications” have the advantage of being operating system independent but are dead in the water without the existence of a robust network connection. These applications, in order to be interactive and meet user expectations, also need to take full advantage of CSS and Javascript, and when it comes to Javascript it becomes imperative to learn and understand how to do AJAX and AJAX-like data acquisition. If I want to provide information services through mobile devices, then the creation of Web apps seems much more feasible. I know how to create well-formed and valid HTML. I can employ the classic LAMP stack to do any hard-core computing. There are a growing number of CSS frameworks making it easy to implement the mobile interface. All I have to do is learn Javascript, and this is not nearly as difficult as it used to be with the emergence of Javascript debuggers and numerous Javascript libraries. For me, Web apps seem to be the way to go.

Over the past couple of years I went out and purchased the following books to help me learn how to create Web apps. Each of them are briefly described below, but first, here’s a word about WebKit. There are at least three HTML frameworks driving the majority of Web browsers these days. Gecko which is the heart of Firefox, WebKit which is the heart of Safari and Chrome, and whatever Microsoft uses as the heart of Internet Explorer. Since I do not own any devices that run the Android or the Windows operating systems, all of my development is limited to Gecko or WebKit based browsers. Luckily, WebKit seems to be increasing in popularity, and this makes it easier for me to rationalize my development in iPhone, iPad, and iPod Touch. The books reviewed below also lean in this direction.

Beginning iPhone And iPad Web Apps (2010, 488 pgs.) by Chris Apers and Daniel Paterson – This is one my more recent purchases and I think I like this book the best. First and foremost, it is the most agnostic of all the books, even though some of the examples use WebKit. True to its title, it describes the use of HTML5, CSS, and Javascript to implement mobile interfaces. This includes whole chapters to the use of vector graphics and fonts, audio and video content, special effects with (WebKit-specific) CSS, touch and gesture events with Javascript, location-aware programming, and client-side data storage. Moreover, this book is the best of the bunch when it comes to describing how mobile interfaces are different from browser-based interfaces. Mobile interfaces are not just smaller versions of their older siblings! If you are going to buy one book, then buy this one. I think it will serve you for the longest period of time.
Building iPhone Apps With HTML, CSS, and Javascript (2010, 166 pgs.) by Jonathan Stark – Being shorter than the previous book, this one is not as thorough but still covers all the bases. On the other hand, unlike the previous title, it does describe how to use a Javascript library for mobile (JQTouch), and how to use PhoneGap to convert a Web app into a native application with many of the native application benefits. This book is a quick read and a good introduction.
Dashcode For Dummies (2011, 436 pgs.) by Jesse Feiler – Dashcode is a development environment originally designed to facilitate the creation of Macintosh OS X dashboard widgets. As you may or may not know, these widgets are self-contained HTML/Javascript/CSS files intended to support simple utility functions. Tell the time. Display the weather. Convert currencies. Render XML files. Etc. Dashcode evolved and now enables the developer to create Web apps for the Macintosh family of i-devices. I bought this book because I own these devices, and I thought the book might help me exploit their particular characteristics. It does not. Dashcode includes no internal links to the underlying hardware. This book describes how to use Dashcode very well, but Dashcode applications are not really the kind I want to create. I suppose I could use Dashcode to create the skin of my application but the overhead may be excessive and the result may be too device dependent.
Developing Hybrid Applications For The iPhone (2009, 195 pgs.) by Lee S. Barney – By introducing the idea of a “hybrid” application, this book picks up where the Dashcode book left off. It does this by describing two Javascript packages (QuickConnectiPhone and PhoneGap) allowing the developer to interact with the underlying hardware. I’ve read this book a couple of times, I’ve looked over it a few more, and in the end I am still challanged. I’m excited about accessing things like hardware’s camera, GPS funcationality, and file system, but after reading this book I’m still confused on actually how to do it. The content of this book is an advanced topic to be tackled after the basics have been mastered.
Safari And WebKit Development For iPhone OS 3.0 (2010, 383 pgs.) by Richard Wagner – This book is practical, and the one I relied upon the most, but only before I bought Beginning iPhone And iPad Web Apps. It gives an overview of WebKit, Javascript, and CSS. It advocates Web app frameworks like iUI, iWebKit, and UIUIKit. It describes how to design interfaces for the small screen of iPhone and iPod Touch. It has a chapter the specific Javascript events supported by iPhone and iPod Touch. Like a couple of the other books, it describes how to use the HTML5 canvas to render graphics. I was excited to learn how to interact with the phone, maps, and SMS functions of the devices, but learned that this is done simply through specialized URLs. When the book talks about “offline applications” it is really talking about local database storage — another feature of HTML5. A couple things I should have explored but haven’t yet include bookmarklets and data URLs. The book describes how to take advantage of these concepts. This book is really a second edition of similar book with a different title but written by the same author in 2008. Its content is not as current as it could be, but the fundamentals are there.

Based on the things I’ve learned from these books, I’ve created several mobile interfaces. Each of them deserve their own blog posting so I will only outline them here:

iMobile – A rough mobile interface to much of the Infomotions domain. Written a little more than a year ago, it combines backend Perl scripts with the iUI Javascript framework to render content. Now that I look back on it, the hacks there are pretty impressive, if I do say so myself. Of particular interest is the image gallery which gets its content from OAI-PMH data stored on the server, and my water collection which reads an XML file of my own design and plots where the water was collected on a Google map. iMobile was created from the knowledge I gained from Safari And WebKit Development For iPhone OS 3.0.
DH@ND – The home page for a fledgling initiative called Digital Humanities at the University of Notre Dame. The purpose of the site is to support sets of tools enabling students and scholars to simultaneously do “close reading” and “distant reading”. It was built using the principles gleaned from the books above combined with a newer Javascript framework called JQueryMobile. There are only two things presently of note there. The first is Alex Lite for Mobile, a mobile interface to a tiny catalogue of classic novels. Browse the collection by author or title. Download and read selected books in ePub, PDF, or HTML formats. The second is Geo-location. After doing named-entity extraction against a limited number of classic novels, this interface displays a word cloud of place names. The user can then click on place names and have them plotted on a Google Map.

Remember, the sites listed above are designed for mobile, primarly driven by the WebKit engine. If you don’t use a mobile device to view the sites, then your milage will vary.

Web app development is beyond a trend. It has all but become an expectation. Web app implementation requires an evolution in thinking about Web design as well as an additional skill set which includes advanced HTML, CSS, and Javascript. These are not your father’s websites. There are a number of books out there that can help you learn about these topics. Listed above are just a few of them.

Tags: ipad, iphone, ipod touch, web apps
Posted in Hacks, Reviews | Comments Off on Book reviews for Web app development

Alex Lite (version 2.0)

April 11th, 2011

This posting describes Alex Lite (version 2.0) — a freely available, standards-compliant distribution of electronic texts and ebooks.

Alex Lite in a browser

Alex Lite on a mobile

A few years ago I created the first version of Alex Lite. Its primary purpose was to: 1) explore and demonstrate how to transform a particular flavor of XML (TEI) into a number of ebook formats, and 2) distribute the result on a CD-ROM. The process was successful. I learned a lot of about XSLT — the primary tool for doing this sort of work.

Since then two new developments have occurred. First, a “standard” ebook format has emerged — ePub. Based on XHTML, this standard specifies packaging up numerous XML files into a specialized ZIP archive. Software is intended to uncompress the file and display the result. Second, mobile devices have become more prevalent. Think “smart phones” and iPads. These two things have been combined to generate an emerging ebook market. Consequently, I decided to see how easy it would be to transform my TEI files into ePub files, make them available on the Web as well as a CD-ROM, and finally implement a “Webapp” for using the whole thing.

Alex Lite (version 2.0) is the result. There you will find a rudimentary Web browser-based “catalogue” of electronic texts. Browsable by authors and titles (no search), a person can read as many as eigthy classic writings in the forms of HTML, PDF, and ePub files. Using just about any mobile device, a person should be able to use a differnt interface to the collection with all of the functionality of the original. The only difference is the form factor, and thus the graphic design.

The entire Alex Lite distribution is designed to be given away and used as a stand-alone “library”. Download the .zip file. Uncompress it (about 116 MB). Optionally save the result on your Web server. Open the distribution’s index.html file with your browser or mobile. Done. Everything is included. Supporting files. HTML files. ePub files. PDF’s. Since all the files have been run through validators, a CD of Alex Lite should be readable for quite some time. Give away copies to your friends and relatives. Alex Lite makes a great gift.

Computers and their networks are extremely fragile. If they were to break, then access to much of world’s current information would suddently become inaccessible. Creating copies of content, like Alex Lite, are a sort of insurance against this catastrophe. Marking-up content in forms like TEI make it realatively easy to migrate ideas forward. TEI is just the information, not display nor container. Using XSLT it is possible to create different containers and different displays. Having copies of content locally enables a person to control their own destiny. Linking to content only creates maintenance nightmares.

Alex Lite is a fun little hack. Share it with your friends, and use it to evolve your definition of a library.

Tags: Alex Catalogue, ePub
Posted in Hacks, Librarianship | Comments Off on Alex Lite (version 2.0)

Where in the world is the mail going?

March 23rd, 2011

For a good time, I geo-located the subscribers from a number of mailing lists, and then plotted them on a Google map. In other words, I asked the question, “Where in the world is the mail going?” The answer was sort of surprising.

I moderate/manage three library-specific mailing lists: Usability4Lib, Code4Lib, and NGC4Lib. This means I constantly get email messages from the LISTSERV application alerting me to new subscriptions, unsubscriptions, bounced mail, etc. For the most part the whole thing is pretty hands-off, and all I have to do is manually unsubscribe people because their address changed. No big deal.

It is sort of fun to watch the subscription requests. They are usually from places within the United States but not always. I then got to wondering, “Exactly where are these people located?” Plotting the answer on a world map would make such things apparent. This process is called geo-location. For me it is easily done by combining a Perl module called Geo::IP with the Google Maps API. The process was not too difficult and implemented in a program called domains2map.pl:

get a list of all the subscribers to a given mailing list
remove all information but the domain of the email addresses
get the latitude and longitude for a given domain — geo-locate the domain
increment the number of times this domain occurs in the list
got to Step #3 for each item in the list
build a set of Javascript objects describing each domain
insert the objects into an HTML template
output the finished HTML

The results are illustrated below.

Usability4Lib – 600 subscribers
interactive map	pie chart

Code4Lib – 1,700 subscribers
interactive map	pie chart

NGC4Lib – 2,100 subscribers
interactive map	pie chart

It is interesting to note how many of the subscribers seem to be located in Mountain View (California). This is because many people use Gmail for their mailing list subscriptions. The mailing lists I moderate/manage are heavily based in the United States, western Europe, and Australia — for the most part, English-speaking countries. There is a large contingent of Usability4Lib subscribers located in Rochester (New York). Gee, I wonder why. Even though the number of subscribers to Code4Lib and NGC4Lib is similar, the Code4Libbers use Gmail more. NGC4Lib seems to have the most international subscription base.

In the interest of providing “access to the data behind the chart”, you can download the data sets: code4lib.txt, ngc4lib.txt, and usability4lib.txt. Fun with Perl, Google Maps, and mailing list subscriptions.

For something similar, take a gander at my water collection where I geo-located waters of the world.

Tags: Code4Lib, geo-location, mailing lists, NGC4Lib, Usability4Lib
Posted in Hacks | Comments Off on Where in the world is the mail going?

Constant chatter at Code4Lib

March 20th, 2011

As illustrated by the chart, it seems as if the chatter was constant during the most recent Code4Lib conference.

For a good time and in the vein of text mining, I made an effort to collect as many tweets with the hash tag #c4l11 as well as the backchannel log files. (“Thanks, lbjay!”). I then parsed the collection into fields (keys, author identifiers, date stamps, and chats/tweets), and stuffed them into a database. I then created a rudimentary tab-delimited text file consisting of a key (representing a conference event), a start time, and an end time. Looping through this file I queried my database returning the number of chats and tweets associated with each time interval. Lastly, I graphed the result.

Constant chatter at Code4Lib, 2011

As you can see there are a number of spikes, most notably associated with keynote presentations and Lightning Talks. Do not be fooled, because each of these events are longer than balance of the events in the conference. The chatter was rather constant throughout Code4Lib 2011.

When talking about the backchannel, many people say, “It is too distracting; there is too much stuff there.” I then ask myself, “How much is too much?” Using the graph as evidence, I can see there are about 300 chats per event. Each event is about 20-30 minutes long. That averages out to 10ish chats per minute or 1 item every 6 seconds. I now have a yardstick. When the chat volume is equal to or greater than 1 item every 6 seconds, then there is too much stuff for many people to follow.

The next step will be to write a program allowing people to select time ranges from the chat/tweet collection, extract the associated data, and apply analysis tools against them. This includes things like concordances, lists of frequently used words and phrases, word clouds, etc.

Finally, just like traditional books, articles, microforms, and audio-visual materials things things like backchannel log files, tweets, blogs, and mailing list archives are forms of human expression. Do what degree do these things fall into the purview of library collections? Why (or why not) should libraries actively collect and archive them? If it is within our purview, then what do libraries need to do differently in order build such collections and take advantage of their fulltext nature?

Tags: c4l11, Code4Lib
Posted in Hacks, Librarianship | Comments Off on Constant chatter at Code4Lib

How “great” are the Great Books?

March 16th, 2011

In this posting I present two quantitative methods for denoting the “greatness” of a text. Through this analysis I learned that Aristotle wrote the greatest book. Shakespeare wrote seven of the top ten books when it comes to love. And Aristophanes’s Peace is the most significant when it comes to war. Once calculated, this description – something I call the “Great Ideas Coefficient” – can be used as a benchmark to compare & contrast one text with another.

Research questions

In 1952 Robert Maynard Hutchins et al. compiled a set of books called the Great Books of the Western World. [1] Comprised of fifty-four volumes and more than a couple hundred individual works, it included writings from Homer to Darwin. The purpose of the set was to cultivate a person’s liberal arts education in the Western tradition. [2]

To create the set a process of “syntopical reading” was first done. [3]. (Syntopical reading is akin to the emerging idea of “distant reading” [4], and at the same time complementary to the more traditional “close reading”.) The result was an enumeration of 102 “Great Ideas” commonly debated throughout history. Through the syntopical reading process, through the enumeration of timeless themes, and after thorough discussion with fellow scholars, the set of Great Books was enumerated. As stated in the set’s introductory materials:

…but the great books posses them [the great ideas] for a considerable range of ideas, covering a variety of subject matters or disciplines; and among the great books the greatest are those with the greatest range of imaginative or intellectual content. [5]

Our research question is then, “How ‘great’ are the Great Books?” To what degree do they discuss the Great Ideas which apparently define their greatness? If such degrees can be measured, then which of the Great Books are greatest?

Great Ideas Coefficient defined

To measure the greatness of any text – something I call a Great Ideas Coefficient – I apply two methods of calculation. Both exploit the use of term frequency inverse document frequency (TFIDF).

TFIDF is a well-known method for calculating statistical relevance in the field of information retrieval (IR). [6] Query terms are supplied to a system and compared to the contents of an inverted index. Specifically, documents are returned from an IR system in a relevancy ranked order based on: 1) the ratio of query term occurrences and the size of the document multiplied by 2) the ratio of the number of documents in the corpus and the number of documents containing the query terms. Mathematically stated, TFIDF equals:

(c/t) * log(d/f)

where:

c = number of times the query terms appear in a document
t = total number of words in a document
d = total number of documents in a corpus
f = total number of documents containing the query terms

For example, suppose a corpus contains 100 documents. This is d. Suppose two of the documents contain a given query term (such as “love”). This is f. Suppose also the first document is 50 words long (t) and contains the word love once (c). Thus, the first document has a TFIDF score of 0.034:

(1/50) * log(100/2) = 0.0339

Where as, if the second document is 75 words long (t) and contains the word love twice (c), then the second document’s TFIDF score is 0.045:

(2/75) * log(100/2) = 0.0453

Thus, the second document is considered more relevant than the first, and by extension, the second document is probably more “about” love than the first. For our purposes relevance and “aboutness” are equated with “greatness”. Consequently, in this example, when it comes to the idea of love, the second document is “greater” than the first. To calculate our first Coefficient I sum all 102 Great Idea TFIDF scores for a given document, a statistic called the “overlap score measure”. [7] By comparing the resulting sums I can compare the greatness of the texts as well as examine correlations between Great Ideas. Since items selected for inclusion in the Great books also need to exemplify the “greatest range of imaginative or intellectual content”, I also produce a Coefficient based on a normalized mean for all 102 Great Ideas across the corpus.

Great Ideas Coefficient calculated

To calculate the Great Ideas Coefficient for each of the Great Books I used the following process:

Mirrored versions of Great Books – By searching and browsing the Internet 222 of the 260 Great Books were found and copied locally, giving us a constant (d) equal to 222.
Indexed the corpus – An inverted index was created. I used Solr for this. [8]
Calculated TFIDF for a given Great Idea – First the given Great Idea was stemmed and searched against the the index resulting in a value for f. Each Great Book was retrieved from the local mirror whereby the size of the work (t) was determined as well as the number of times the stem appeared in the work (c). TFIDF was then calculated.
Repeated Step #3 for each of the Great Ideas – Go to Step #3 each of the Great Ideas.
Summed each of the TFIDF scores – The Great Idea TFIDF scores were added together giving us our first Great Ideas Coefficient for a given work.
Saved the result – Each of the individual scores as well as the Great Ideas Coefficient was saved to a database.
Returned to Step #3 for each of the Great Books – Go to Step #3 each of the other works in the corpus.

The end result was a file in the form of a matrix with 222 rows and 104 columns. Each row represents a Great Book. Each column is a local identifier, a Great Ideas TFIDF score, and a book’s Great Ideas Coefficient. [9]

The Great Books analyzed

Sorting the matrix according to the Great Ideas Coefficient is trivial. Upon doing so I see that Kant’s Introduction To The Metaphysics Of Morals and Aristotle’s Politics are the first and second greatest books, respectively. When the matrix is sorted by the love column, I see Plato’s Symposium come out as number one, but Shakespeare claims seven of the top ten items with his collection of Sonnets being the first. When the matrix is sorted by the war column, then Aristophanes’s Peace is the greatest.

Unfortunately, denoting overall greatness in the previous manner is too simplistic because it does not fit the definition of greatness posited by Hutchins. The Great Books are expected to be great because they exemplify the “greatest range of imaginative or intellectual content”. In other words, the Great Books are great because they discuss and elaborate upon a wide spectrum of the Great Ideas, not just a few. Ironically, this does not seem to be the case. Most of the Great Books have many Great Idea scores equal to zero. In fact, at least two of the Great Ideas – cosmology and universal – have TFIDF scores equal to zero across the entire corpus, as illustrated by Figure 1. This being the case, I might say that none of the Great Books are truly great because none of them significantly discuss the totality of the Great Ideas.

box plots of great ideas
Figure 1 – Box plot scores of Great Ideas

To take this into account and not allow the value of the Great Idea Coefficient to be overwhelmed by one or two Great Idea scores, I calculated the mean TFIDF score for each of the Great Ideas across the matrix. This vector represents an imaginary but “typical” Great Book. I then compared the Great Idea TFIDF scores for each of the Great Books with this central quantity to determine whether or not it is above or below the typical mean. After graphing the result I see that Aristotle’s Politics is still the greatest book with Hegel’s Philosophy Of History being number two, and Plato’s Republic being number three. Figure 2 graphically illustrates this finding, but in a compressed form. Not all works are listed in the figure.

normalized great books
Figure 2 – Individual books compared to the “typical” Great Book

Summary

How “great” are the Great Books? The answer depends on what qualities a person wants to measure. Aristotle’s Politics is great in many ways. Shakespeare is great when it comes to the idea of love. The calculation of the Great Ideas Coefficient is one way to compare & contrast texts in a corpus – “syntopical reading” in a digital age.

Notes

[1] Hutchins, Robert Maynard. 1952. Great books of the Western World. Chicago: Encyclopædia Britannica.

[2] Ibid. Volume 1, page xiv.

[3] Ibid. Volume 2, page xi.

[4] Moretti, Franco. 2005. Graphs, maps, trees: abstract models for a literary history. London: Verso, page 1.

[5] Hutchins, op. cit. Volume 3, page 1220.

[6] Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. 2008. An introduction to information retrieval. Cambridge: Cambridge University Press, page 109.

[7] Ibid.

[8] Solr – http://lucene.apache.org/solr/

[9] This file – the matrix of identifiers and scores – is available at http://bit.ly/cLmabY, but a more useful and interactive version is located at http://bit.ly/cNVKnE

Tags: digital humanities
Posted in Great Books of the Western World | Comments Off on How “great” are the Great Books?

Creator: Eric Lease Morgan <eric_morgan@infomotions.com>
Date created: 2008-05-26
Date updated: 2010-05-09
URL: ./

Archives

Categories

Tags

Links

Links

Links

Links

Links

Quick start

Manifest

Usage

To do

Database

Commentary

Links

Identify “invalid” headings

Search & download

Verify and merge

Summary and possible next steps

Notes

Librarianship as collections & services

Bibliographic description: an informal history

Resource Description Framework

Linked data

BIBFRAME

Workflow

Prototypical implementation

Summary

Notes

Links

Introduction

Well-formedness

DTD’s and namespaces

Schemas

RelaxNG, Schematron, and digital libraries

XPath

XSLT

TEI — Text Encoding Initiative

XSL-FO

EAD — Encoded Archival Description

RDF

MARC

Not a panacea

Summary

Epilogue — Use and understanding

Readings

Exercise

Linked Data

Curation of digital objects

baxter

General statistics

Notes on word usage

Items of interest

Comparison

Search

Observations

Code

Review of RDF

Simple foray into publishing linked open data

Triple stores and SPARQL endpoints

Advantages and disadvantages

Summary

Introduction

Installation

Koha “extras”

Ideas and next steps

index.pl

get.pl

search.pl

Table of Contents

Plato’s definitions

Aristotle’s definitions

Shakespeare’s definitions