Alex Catalogue – Infomotions Mini-Musings

Twitter, Facebook, Delicious, and Alex

Eric Lease Morgan — Sat, 18 Sep 2010 23:20:20 +0000

I spent time last evening and this afternoon integrating Twitter, Facebook, and Delicious into the my Alex Catalogue. The process was (almost) trivial:

create Twitter, Facebook, and Delicious accounts
select and configure the Twitter button I desired to use
acquire the Delicious javascript for bookmarking
place the results of Steps #1 and #2 into my HTML
rebuild my pages
install and configure the Twitter application for Facebook

Because of this process I am able to “tweet” from Alex, its search results, any of the etexts in the collection, as well as any results from the use of the concordances. These tweets then get echoed to Facebook.

(I tried to link directly to Facebook using their Like Button, but the process was cumbersome. Iframes. Weird, Facebook-specific Javascript. Pulling too much content from the header of my pages. Considering the Twitter application for Facebook, the whole thing was not worth the trouble.)

I find it challenging to write meaningful 140 character comments on the Alex Catalogue, especially since the URLs take up such a large number of the characters. Still, I hope to regularly find interesting things in the collection and share them with the wider audience. To see the fruits of my labors to date, see my Twitter feed — http://twitter.com/ericleasemorgan.

Only time will tell whether or not this “social networking” thing proves to be beneficial to my library — all puns intended.

Where in the world are windmills, my man Friday, and love?

Eric Lease Morgan — Sun, 12 Sep 2010 22:32:19 +0000

This posting describes how a Perl module named Lingua::Concordance allows the developer to illustrate where in the continum of a text words or phrases appear and how often.

Windmills, my man Friday, and love

When it comes to Western literature and windmills, we often think of Don Quiote. When it comes to “my man Friday” we think of Robinson Crusoe. And when it comes to love we may very well think of Romeo and Juliet. But I ask myself, “How often do these words and phrases appear in the texts, and where?” Using digital humanities computing techniques I can literally illustrate the answers to these questions.

Lingua::Concordance

Lingua::Concordance is a Perl module (available locally and via CPAN) implementing a simple key word in context (KWIC) index. Given a text and a query as input, a concordance will return a list of all the snippets containing the query along with a few words on either side. Such a tool enables a person to see how their query is used in a literary work.

Given the fact that a literary work can be measured in words, and given then fact that the number of times a particular word or phrase can be counted in a text, it is possible to illustrate the locations of the words and phrases using a bar chart. One axis represents a percentage of the text, and the other axis represents the number of times the words or phrases occur in that percentage. Such graphing techniques are increasingly called visualization — a new spin on the old adage “A picture is worth a thousand words.”

In a script named concordance.pl I answered such questions. Specifically, I used it to figure out where in Don Quiote windmills are mentiond. As you can see below they are mentioned only 14 times in the entire novel, and the vast majority of the time they exist in the first 10% of the book.

  $ ./concordance.pl ./don.txt 'windmill'
  Snippets from ./don.txt containing windmill:
	* DREAMT-OF ADVENTURE OF THE WINDMILLS, WITH OTHER OCCURRENCES WORTHY TO
	* d over by the sails of the windmill, Sancho tossed in the blanket, the
	* thing is ignoble; the very windmills are the ugliest and shabbiest of 
	* liest and shabbiest of the windmill kind. To anyone who knew the count
	* ers say it was that of the windmills; but what I have ascertained on t
	* DREAMT-OF ADVENTURE OF THE WINDMILLS, WITH OTHER OCCURRENCES WORTHY TO
	* e in sight of thirty forty windmills that there are on plain, and as s
	* e there are not giants but windmills, and what seem to be their arms a
	* t most certainly they were windmills and not giants he was going to at
	*  about, for they were only windmills? and no one could have made any m
	* his will be worse than the windmills," said Sancho. "Look, senor; thos
	* ar by the adventure of the windmills that your worship took to be Bria
	*  was seen when he said the windmills were giants, and the monks' mules
	*  with which the one of the windmills, and the awful one of the fulling
  
  A graph illustrating in what percentage of ./don.txt windmill is located:
	 10 (11) #############################
	 20 ( 0) 
	 30 ( 0) 
	 40 ( 0) 
	 50 ( 0) 
	 60 ( 2) #####
	 70 ( 1) ##
	 80 ( 0) 
	 90 ( 0) 
	100 ( 0)

If windmills are mentioned so few times, then why do they play so prominently in people’s minds when they think of Don Quiote? To what degree have people read Don Quiote in its entirity? Are windmills as persistent a theme throughout the book as many people may think?

What about “my man Friday”? Where does he occur in Robinson Crusoe? Using the concordance features of the Alex Catalogue of Electronic Texts we can see that a search for the word Friday returns 185 snippets. Mapping those snippets to percentages of the text results in the following bar chart:

Friday in Robinson Crusoe

Obviously the word Friday appears towards the end of the novel, and as anybody who has read the novel knows, it is a long time until Robinson Crusoe actually gets stranded on the island and meets “my man Friday”. A concordance helps people understand this fact.

What about love in Romeo and Juliet? How often does the word occur and where? Again, a search for the word love returns quite a number of snippets (175 to be exact), and they are distributed throughout the text as illustrated below:

love in Romeo and Juliet

“Maybe love is a constant theme of this particular play,” I state sarcastically, and “Is there less love later in the play?”

Digital humanities and librarianship

Given the current environment, where full text literature abounds, digital humanities and librarianship are a match made in heaven. Our library “discovery systems” are essencially indexes. They enable people to find data and information in our collections. Yet find is not an end in itself. In fact, it is only an activity at the very beginning of the learning process. Once content is found it is then read in an attempt at understanding. Counting words and phrases, placing them in the context of an entire work or corpus, and illustrating the result is one way this understanding can be accomplished more quickly. Remember, “Save the time of the reader.”

Integrating digital humanities computing techniques, like concordances, into library “discovery systems” represent a growth opportunity for the library profession. If we don’t do this on our own, then somebody else will, and we will end up paying money for the service. Climb the learning curve now, or pay exorbitant fees later. The choice is ours.

Ngrams, concordances, and librarianship

Eric Lease Morgan — Mon, 30 Aug 2010 05:08:47 +0000

This posting describes how the extraction of ngrams and the implementation of concordances are integrated into the Alex Catalogue of Electronic Texts. Given the increasing availability of full-text content in libraries, the techniques described here could easily be incorporated into traditional library “discovery systems” and/or catalogs, if and only if the library profession were to shift its definition of what it means to practice librarianship.

Lingua::EN::Bigram

During the past couple of weeks, in fits of creativity, one of the things I spent some of my time on was a Perl module named Lingua::EN::Bigram. At version 0.03, it now supports not only bigrams, trigrams, and quadgrams (two-, three-, and four-word phrases, respectively), but also ngrams — multi-word phrases of an arbitrary length.

Given this enhanced functionality, and through the use of a script called ngrams.pl, I learned that the 10 most frequently used 5-word phrases and the number of times they occur in Henry David Thoreau’s Walden seem to surround spacial references:

a quarter of a mile (6)
i have no doubt that (6)
as if it were a (6)
the other side of the (5)
the surface of the earth (4)
the greater part of the (4)
in the midst of a (4)
in the middle of the (4)
in the course of the (3)
two acres and a half (3)

Whereas the same process applied to Thoreau’s A Week on the Concord and Merrimack Rivers returns lengths and references to flowing water, mostly:

a quarter of a mile (8)
on the bank of the (7)
the surface of the water (6)
the middle of the stream (6)
as if it were the (5)
as if it were a (4)
is for the most part (4)
for the most part we (4)
the mouth of this river (4)
in the middle of the (4)

While not always as clear cut as the examples outlined above, the extraction and counting of ngrams usually supports the process of “distant reading” — a phrase coined by Franco Moretti in Graphs, Maps, Trees: Abstract Models for Literary History (2007) to denote the counting, graphing, and mapping of literary texts. With so much emphasis on reading in libraries, I ask myself, “Ought the extraction of ngrams be applied to library applications?”

Concordances

Concordances are literary tools used to evaluate texts. Dating back to as early as the 12th or 13th centuries, they were first used to study religious materials. Concordances take many forms, but they usually list all the words in a text, the number of times each occurs, and most importantly, places where each word within the context of its surrounding text — a key-word in context (KWIC) index. Done by hand, the creation of concordances is tedious and time consuming, and therefore very expensive. Computers make the work of creating a concordance almost trivial.

Each of the full text items in the Alex Catalogue of Electronic Texts (close to 14,000 of them) is accompanied with a concordance. They support the following functions:

list of all the words in the text starting with a given letter and the number of times each occurs
list the most frequently used words in the text and the number of times each occurs
list the most frequently used ngrams in a text and the number of times each occurs
display individual items from the lists above in a KWIC format
enable the student or scholar to search the text for arbitrary words or phrases (regular expressions) and have them displayed in a KWIC format

Such functionality allows people to answer many questions quickly and easily, such as:

Does Mark Twain’s Adventures of Huckleberry Finn contain many words beginning with the letter z, and if so, how many times and in what context?
To what extent does Aristotle’s Metaphysics use the word “good”, and maybe just as importantly, how is the word “evil” used in the same context?
In Jack London’s Call of the Wild the phrase “man in the red sweater” is one of the more frequently used. Who was this man and what role does he play in the story?
Compared to Shakespeare, to what extent does Plato discuss love, and how do the authors’ expositions differ?

The counting of words, the enumeration of ngrams, and the use of concordances are not intended to short-circuit traditional literary studies. Instead, they are intended to supplement and enhance the process. Traditional literary investigations, while deep and nuanced, are not scalable. A person is not able to read, compare & contrast, and then comprehend the essence of all of Shakespeare, all of Plato, and all of Charles Dickens through “close reading”. An individual simply does not have enough time. In the words of Gregory Crane, “What do you do with a million books?” Distant reading, akin to the proceses outlined above, make it easier to compare & contrast large corpora, discover patterns, and illustrate trends. Moreover, such processes are reproducible, less prone to subjective interpretation, and not limited to any particular domain. The counting, graphing, and mapping of literary texts makes a lot of sense.

The home page for the concordances is complete with a number of sample texts. Alternatively, you can search the Alex Catalogue and find an item on your own.

Library “discovery systems” and/or catalogs

The amount of full text content available to libraries has never been greater than it is today. Millions of books have been collectively digitized through Project Gutenberg, the Open Content Alliance, and the Google Books Project. There are thousands of open access journals with thousands upon thousands of freely available scholarly articles. There are an ever-growing number of institutional repositories both subject-based as well as institutional-based. These too are rich with full text content. None of this even considers the myriad of grey literature sites like blogs and mailing list archives.

Library “discovery systems” and/or catalogs are designed to organize and provide access to the materials outlined above, but they need to do more. First of all, the majority of the profession’s acquisitions processes assume collections need to be paid for. With the increasing availability of truly free content on the Web, greater emphasis needs to be placed on harvesting content as opposed to purchasing or licensing it. Libraries are expected to build collections designed to stand the test of time. Brokering access to content through licensing agreements — one of the current trends in librarianship — will only last as long as the money lasts. Licensing content makes libraries look like cost centers and negates the definition of “collections”.

Second, library “discovery systems” and/or catalogs assume an environment of sacristy. They assume the amount of accessible, relevant data and information needed by students, teachers, and researchers is relatively small. Thus, a great deal of the profession’s efforts go into enabling people to find their particular needle in one particular haystack. In reality, current indexing technology makes the process of finding relavent materials trivial, almost intelligent. Implemented correctly, indexers return more content than most people need, and consequently they continue to drink from the proverbial fire hose.

Let’s turn these lemons into lemonade. Let’s redirect some of the time and money spent on purchasing licenses towards the creation of full text collections by systematic harvesting. Let’s figure out how to apply “distant reading” techniques to the resulting collections thus making them, literally, more useful and more understandable. These redirections represent a subtle change in the current direction of librarianship. At the same time, they retain the core principles of the profession, namely: collection, organization, preservation, and dissemination. The result of such a shift will result in an increased expertise on our part, the ability to better control our own destiny, and contribute to the overall advancement of our profession.

What can we do to make these things come to fruition?

Cool URIs

Eric Lease Morgan — Sun, 22 Aug 2010 18:07:42 +0000

I have started implementing “cool” URIs against the Alex Catalogue of Electronic Texts.

As outlined in Cool URIs for the Semantic Web, “The best resource identifiers… are designed with simplicity, stability and manageability in mind…” To that end I have taken to creating generic URIs redirecting user-agents to URLs based on content negotiation — 303 URI forwarding. These URIs also provide a means to request specific types of pages. The shapes of these URIs follow, where “key” is a foreign key in my underlying (MyLibrary) database:

http://infomotions.com/etexts/id/key – generic; redirection based on content negotiation
http://infomotions.com/etexts/page/key – HTML; the text itself
http://infomotions.com/etexts/data/key – RDF; data about the text
http://infomotions.com/etexts/concordance/key – concordance; a means for textual analysis

For example, the following URIs return different versions/interfaces of Henry David Thoreau’s Walden:

This whole thing makes my life easier. No need to remember complicated URLs. All I have to remember is the shape of my URI and the foreign key. Through the process this also makes the URLs easier to type, shorten, distribute, and display.

The downside of this implementation is the need for an always-on intermediary application doing the actual work. The application, implemented as mod_perl module, is called Apache2::Alex::Dereference and available for your perusal. Another downside is the need for better, more robust RDF, but that’s for later.

Alex Catalogue Widget

Eric Lease Morgan — Tue, 16 Mar 2010 03:43:34 +0000

I created my first Apple Macintosh Widget today — Alex Catalogue Widget.

The tool is pretty rudimentary. Enter something into the field. Press return or click the Search button. See search results against the Alex Catalogue of Electronic Texts displayed in your browser. The development process reminded me of hacking in HyperCard. Draw things on the screen — buttons, fields, etc. — and assocate actions (events) with each of them.

Download it and try it for yourself.

Michael Hart in Roanoke (Indiana)

Eric Lease Morgan — Sun, 07 Mar 2010 21:54:18 +0000

On Saturday, February 27, Paul Turner and I made our way to Roanoke (Indiana) to listen to Michael Hart tell stories about electronic texts and Project Gutenberg. This posting describes our experience.

Roanoke and the library

To celebrate its 100th birthday, the Roanoke Public Library invited Michael Hart of Project Gutenberg fame to share his experience regarding electronic texts in a presentation called “Books & eBooks: Past, Present & Future Libraries”. The presentation was scheduled to start around 3 o’clock, but Paul Turner and I got there more than an hour early. We wanted to have time to visit the Library before it closed at 2 o’clock. The town of Roanoke (Indiana) — a bit south west of Fort Wayne — was tiny by just about anybody’s standard. It sported a single blinking red light, a grade school, a few churches, one block of shops, and a couple of eating establishments. According to the man in the bar, the town got started because of the locks that had been built around town.

The Library was pretty small too, but it bursted with pride. About 1,800 square feet in size, it was overflowing with books and videos. There were a couple of comfy chairs for adults, a small table, a set of four computers to do Internet things, and at least a few clocks the wall. They were very proud of the fact that they had become an Evergreen library as a part Evergreen Indiana initiative. “Now is is possible to see what is owned in other, nearby libraries, and borrow things from them as well,” said the Library’s Board Director.

Michael Hart

The presentation itself was not held in the Library but in a nearby church. About fifty (50) people attended. We sat in the pews and contemplated the symbolism of the stained glass windows and wondered how the various hardware placed around the alter was going to be incorporated into the presentation.

Full of smiles and joviality, Michael Hart appeared in a tailless tuxedo, cumber bun, and top hat. “I am now going to pull a library out of my hat,” he proclaimed, and proceeded to withdraw a memory chip. “This chip contains 10’s of thousands of books, and now I’m going to pull a million books out of my pocket,” and he proceed to display a USB drive. Before the year 2020 he sees us capable of carrying around a billion books on some sort of portable device. Such was the essence of his presentation — computer technology enables the distribution and acquisition of “books” in ways never before possible. Through this technology he wants to change the world. “I consider myself to be like Johnny Appleseed, and I’m spreading the word,” at which time I raised my hand and told him Johnny Appleseed (John Chapman) was buried just up the road in Fort Wayne.

Mr. Hart displayed and described a lot of antique hardware. A hard drive that must have weighed fifty (50) pounds. Calculators. Portable computers. Etc. He illustrated how storage mediums were getting smaller and smaller while being able to save more and more data. He was interested in the packaging of data and displayed a memory chip a person can buy from Walmart containing “all of the hit songs from the 50’s and 60’s”. (I wonder how the copyright issues around that one had been addressed.) “The same thing,” he said, “could be done for books but there is something wrong with the economics and the publishing industry.”

Roanoke (Indiana)

public library

He outlined how Project Gutenberg works. First a book is identified as a possible candidate for the collection. Second, the legalities of the making the book available are explored. Next, a suitable edition of the book is located. Fourth, the book’s content is transcribed or scanned. Finally, 100’s of people proof-read the result and ultimately make it available. Hart advocated getting the book out sooner rather than later. “It does not have to be perfect, and we can always the fix errors later.”

He described how the first Project Gutenberg item came into existence. In a very round-about and haphazard way, he enrolled in college. Early on he gravitated towards the computer room because it was air conditioned. Through observation he learned how to use the computer, and to do his part in making the expense of the computer worthwhile, he typed out the United States Declaration of the Independence on July 4th, 1971.

“Typing the books is fun,” he said. “It provides a means for reading in ways you had never read them before. It is much more rewarding than scanning.” As a person who recently learned how to bind books and as a person who enjoys writing in books, I asked Mr. Hart to compare & contrast ebooks, electronic texts, and codexes. “The things Project Gutenberg creates are electronic texts, not ebooks. They are small, portable, easily copyable, and readable by any device. If you can’t read a plain text document on your computer, then you have much bigger problems. Moreover, there is an enormous cost-benefit compared to printed books. Electronic texts are cheap.” Unfortunately, he never really answered the question. Maybe I should have phrased it differently and asked him, the way Paul did, to compare the experience of reading physical books and electronic texts. “I don’t care if it looks like a book. Electronic texts allow me to do more reading.”

“Two people invented open source. Me and Richard Stallman,” he said. Well, I don’t think this is exactly true. Rather, Richard Stahlman invented the concept of GNU software, and Michael Hart may have invented the concept of open access publishing. But the subtle differences between open source software and open access publishing are lost on most people. In both cases the content is “free”. I guess I’m too close to the situation. I too see open source software distribution and open access publishing having more things in common than differences.

church

stained glass

“I knew Project Gutenberg was going to be success when I was talking on the telephone with a representative of the Common Knowledge project and heard a loud crash on the other end of the line. It turns out the representative’s son and friends had broken an annorandak chair while clamoring to read an electronic text.” In any case, he was fanatically passionate about giving away electronic texts. He sited the World eBook Fair, and came to the presentation with plenty of CD’s for distribution.

In the end I had my picture taken with Mr. Hart. We then all retired to the basement for punch and cake where we sang Happy Birthday to Michael. Two birthdays celebrated at the same time.

Reflection

Michael and Eric

Many people are drawn to the library profession as a matter of principle. Service to others. Academic freedom. Preservation of the historical record. I must admit that I am very much the same way. I was drawn to librarianship for two reasons. First, as a person with a BA in philosophy, I saw libraries as a places full of ideas, literally. Second, I saw the profession as a growth industry because computers could be used to disseminate the content of books. In many ways my gut feelings were accurate, but at the same time they were misguided because much of librarianship surrounds workflows, processes that are only a couple of steps away from factory work, and the curation of physical items. To me, just like Mr. Hart, the physical item is not as important as what it manifests. It is not about the book. Rather, it is what is inside the book. Us librarians have tied our identities to the physical book in such a way to be limiting. We have pegged ourselves, portrayed a short-sighted vision, and consequently painted ourselves into a corner. It the carpenter a hammer expert? Is the surgeon a scalpel technician? No, they are builders and healers, respectively. Why must librarianship be identified with books?

I have benefited from Mr. Hart’s work. My Alex Catalogue of Electronic Texts contains many Project Gutenberg texts. Unlike the books from the Internet Archive, the texts are much more amenable to digital humanities computing techniques because they have been transcribed by humans and not scanned by computers. At the same time, the Project Gutenberg texts are not formatted as well for printing or screen display as PDF versions of the same. This is why the use of electronic texts and ebooks is not an either/or situation but rather a both/and, especially when it comes to analysis. Read a well-printed book. Identify item of interest. Locate item in electronic version of book. Do analysis. Return to printed book. The process could work just as well the other way around. Ask a question of the electronic text. Get one or more answers. Examine them in the context of the printed word. Both/and, not either/or.

The company was great, and the presentation was inspiring. I applaud Michael Hart for his vision and seemingly undying enthusiasm. His talk made me feel like I really am on the right track, but change takes time. The free distribution of data and information — whether the meaning of free be denoted as liberty or gratis — is the right thing to do for society in general. We all benefit, and therefore the individual benefits as well. The political “realities” of the situation are more like choices and not Platonic truths. They represent immediate objectives as opposed to long-term strategic goals. I guess this is what you get when you mix the corporeal and ideal natures of humanity.

Who would have known that a trip to Roanoke would turn out to be a reflection of what it means to be human.

Alex Catalogue collection policy

Eric Lease Morgan — Sun, 04 Oct 2009 13:12:08 +0000

This page lists the guidelines for including texts in the Alex Catalogue of Electronic Texts. Originally written in 1994, much of it is still valid today.

Purpose

The primary purpose of the Catalogue is to provide me with the means for demonstrating a concept I call arscience through American and English literature as well as Western philosophy. The secondary purpose of the Catalogue is to provide value-added access to some of the world’s great literature in turn providing the means for enhancing education. Consequently, the items in the collection must satisfy either of these two goals.

Qualities

Listed in priority order, texts in the collection must have the following qualities:

Only texts in the public domain or freely distributed texts will be collected.
Only texts that can be classified as American literature, English literature, or Western philosophy will be included.
Only texts that are considered “great” literature will included. Great literature is broadly defined as literature withstanding the test of time and found in authoritative reference works like the Oxford Companions or the Norton Anthologies.
Only complete works will be collected unless a particular work was never completed in the first place. In other words, partially digitized texts will not be included in the Catalogue.
Whenever possible, collections of short stories or poetry will be included as they were originally published. If the items from the originally published collections have been broken up into individual stories or poems, then those items will be included individually.
The texts in the collection must be written in or translated into English. Otherwise I will not be able to evaluate the texts’ quality nor will the indexing and content-searching work correctly.

File formats

Because of technical limitations and the potential long-term integrity of the Catalogue, texts in the collection, listed in order of preference, should have the following formats:

Plain text files are preferred over HTML files.
HTML files are preferred over compressed files.
Compressed files are preferred over “word processor” files.
Word processed files are the least preferable file format.
Texts in unalterable file formats, such as Adobe Acrobat, will not be included.

In all cases, text that have not been divided into parts are preferred over texts that have been divided. If a particular item is deemed especially valuable and the item has been divided into parts, then efforts will be made to concatenate the individual parts and incorporate the result into the collection. The items in the collection are not necessarily intended to be read online.

Alex, the movie!

Eric Lease Morgan — Sun, 04 Oct 2009 12:58:48 +0000

Created circa 1998, this movie describes the purpose and scope of the Alex Catalogue of Electronic Texts. While coming off rather pompous, the gist of what gets said is still valid and correct. Heck, the links even work. “Thanks Berkeley!”

How to make a book (#1 of 3)

Eric Lease Morgan — Sun, 23 Aug 2009 21:13:55 +0000

This is a series of posts where I will describe and illustrate how to make books. In this first post I will show you how to make a book with a thermo-binding machine. In the second post I will demonstrate how to make a book by simply tearing and folding paper. In the third installment, I will make a traditional book with a traditional cover and binding. The book — or more formally, the codex — is a pretty useful format for containing information.

Fellowes TB 250 thermo-binding machine

The number of full text books found on the Web is increasing at a dramatic pace. A very large number of these books are in the public domain and freely available for downloading. While computers make it easy to pick through smaller parts of books, it is diffcult to read and understand them without printing. Once they are printed you are then empowered to write in the margins, annotate them as you see fit, and share them with your friends. On the other hand, reams of unbound paper is difficult to handle. What to do?

Enter a binding machine, specifically a thermo-binding machine like the Fellowes TB 250. This handy-dandy gizmo allows you to print bunches o’ stuff, encase it in inexpensive covers, and bind it into books. Below is an outline of the binding process and a video demonstration is also available online:

Buy the hardware – The machine costs less than $100 and available from any number of places on the Web. Be sure to purchase covers in a variety of sizes.
Print and gather your papers – Be sure to “jog” your paper nice and neatly.
Turn the machine on – This makes the heating element hot.
Place the paper into the cover – The inside of each cover’s spine is a ribbon of glue. Make sure the paper is touching the glue.
Place the book into the binder – This melts the glue.
Remove the book, and press the glue – The larger the book the more important it is to push the adhesive into the pages.
Go to Step #5, at least once – This makes the pages more secure in the cover.
Remove, and let cool – The glue is hot. Let it set.
Enjoy your book – This is the fun part. Read and scribble in your book to your heart’s content.

Binding and the Alex Catalogue

The Alex Catalogue of Electronic Texts is a collection of fulltext books brought together for the purposes of furthering a person’s liberal arts eduction. While it supports tools for finding, analyzing, and comparing texts, the items are intended to be read in book form as well. Consider printing and binding the PDF or fully transcribed versions of the texts. Your learning will be much more thorough, and you will be able to do more “active” reading.

Binding and libraries

Binding machines are cheap, and they facilitate a person’s learning by enabling users to organize their content. Maybe providing a binding service for library patrons is apropos? Make it easy for people to print things they find in a library. Make it easy for them to use some sort of binding machine. Enable them to take more control over the stuff of their learning, teaching, and research. It certainly sounds like good idea to me. After all, in this day and age, libraries aren’t so much about providing access to information as they are about making information more useful. Binding — books on demand — is just one example.

Browsing the Alex Catalogue

Eric Lease Morgan — Sat, 22 Aug 2009 01:51:44 +0000

The Alex Catalogue is browsable by author names, subject tags, and titles. Just select a browsable list, then a letter, and finally an item.

Browsability is an important feature of any library catalog. It gives you an opportunity to see what the collection contains without entering a query. It is also possible to use browsability to identify similar names, terms, or titles. “Oh look, I hadn’t thought of that idea, and look at the alternative spellings I can use.”

Creating the browsable list is rather trivial. Since all of the underlying content is saved in a relational database, it is rather easy to loop through the fields of “controlled” vocabulary terms and “authority” lists to identify matching etext titles. These lists include:

The later is probably the most interesting since it gives you an idea of the most common words and two-word phrases used in the corpus. For example, look at the list of words starting with the letter “k” and all the ways the word “kant” has been extracted from collection

Indexing and searching the Alex Catalogue

Eric Lease Morgan — Tue, 18 Aug 2009 01:23:59 +0000

The Alex Catalogue of Electronic Texts uses state-of-the-art software to index both the metadata and full text of its content. While the interface accepts complex Boolean queries, it is easier to enter a single word, a number of words, or a phrase. The underlying software will interpret what you enter and do much of hard query syntax work for you.

Indexing

The Catalogue consists of a number of different types of content harvested from different repositories. Most of the content is in the form of electronic texts (“etexts” as opposed to “ebooks”). Think Project Gutenberg, but also items from a defunct gopher archive from Virginia Tech, and more recently digitized materials from the Internet Archive. All of these items benefit from metadata and full text indexing. In other words, things like title words, author names, and computer-generated subject tags are made searchable as well as the full texts of the items.

The collection is supplemented with additional materials such as open access journal titles, open access journal article titles, some content from the HaitiTrust, as well as photographs taken by myself. Presently the full text of these secondary items is not included, just metadata: titles, authors, notes, and subjects. Search results return pointers to the full texts.

Regardless of content type, all metadata and full text is managed in an underlying MyLibrary database. To make the content searchable reports are written against the database and fed to Solr/Lucene for indexing. The Solr/Lucene data structure is rather simple consisting only of a number of Dublin Core-like fields, a default search field, and three facets (creator, subject/tag, and sub-collection). From a 30,000 foot view, this is the process used to index the content of the Catalogue:

extract metadata and full text records from the database
map each record’s fields to the Solr/Lucene data structure
insert each record into Solr/Lucene; index the record
go to Step #1 until all records have been indexed
optimize the index for faster retrieval

Solr/Lucene works pretty well, and interfacing with it was made much simpler through the use of a set of Perl modules called WebService::Solr. On the other hand, there are many ways the index could be improved such as implementing facilitates for sorting and adding weights to various fields. An indexer’s work is never done.

Searching

Because of people’s expectations, searching the index is a bit more complicated and not as straight-forward, but only because the interface is trying to do you some favors.

Solr/Lucene supports single-word, multiple-word, and phrase searches through the use of single or double quote marks. If multi-word queries are entered without Boolean operators, then a Boolean and is assumed.

Since people often enter multiple-word queries, and it is difficult to know whether or not they are really wanting to do a phrase search, the Alex Catalogue converts ambiguous multiple-word queries into more robust Boolean queries. For example a search for “william shakespeare” (sans the quote marks) will get converted into “(william AND shakespeare) OR ‘william shakespeare'” (again, sans the double quote marks) on behalf of the user. This is considered a feature of the Catalogue.

To some degree Solr/Lucene tokenizes query terms, and consequently searches for “book” and “books” return the same number of hits.

Search results are returned in a relevance ranked order. Some time in the future there will be the option of sorting results by date, author, title, and/or a couple of other criteria. Unlike other catalogs, Alex only has a single display — search results. There is no intermediary detailed display; the Catalogue only displays search results or the full text of the item.

In the hopes of making it easier for the user to refine their search, the results page allows the user to automatically turn queries into subject, author, or title searches. It takes advantage of a thesaurus (WordNet) to suggest alternative queries. The system returns “facets” (author names, subject tags, or material types) allowing the user to limit their query with additional terms and narrow search results. The process is not perfect and there are always ways of improving the interface. Usability is never done either.

Summary

Do not try to out think the Alex Catalogue. Enter a word or two. Refine your query using the links on the resulting page. Read & enjoy the discovered texts. Repeat.

Automatic metadata generation

Eric Lease Morgan — Fri, 31 Jul 2009 02:22:02 +0000

I have been having a great deal of success extracting keywords and two-word phrases from documents and assigning them as “subject headings” to electronic texts — automatic metadata generation. In many cases but not all, the set of assigned keywords I’ve created are just as good if not better as the controlled vocabulary terms assigned by librarians.

The problem

The Alex Catalogue is a collection of roughly 14,000 electronic texts. The vast majority come from Project Gutenberg. Some come from the Internet Archive. The smallest number come from a defunct etext collection of Virginia Tech. All of the documents are intended to surround the themes of American and English literature and Western philosophy.

With the exception of the non-fiction works from the Internet Archive, none of the electronic texts were associated with subject-related metadata. With the exception of author names (which are yet to be “well-controlled”), it has been difficult learn the “aboutness” of each of the documents. Such a thing is desirable for two reasons: 1) to enable the reader to evaluate the relevance of document, and 2) to provide a browsable interface to the collection. Without some sort of tags, subject headings, or application of clustering techniques, browsability is all but impossible. My goal was to solve this problem in an automated manner.

The solution

A couple of years ago I used tools such as Lingua::EN::Summarize and Open Text Summarizer to extract keywords and summaries from the etexts and assign them as subject terms. The process worked, but not extraordinarily well. I then learned about Term Frequency Inverse Document Frequency (TFIDF) to calculate “relevance”, and T-Score to calculate the probability of two words appearing side-by-side — bi-grams or two-word phrases. Applying these techniques to the etexts of the Alex Catalogue I have been able to create and add meaningful subject “tags” to each of my documents which then paves the way to browsability. Here is the algorithm I used to implement the solution:

Collect documents – This was done through various harvesting techniques. Etexts are saved to the local file system and what metadata does exist gets saved to a database.
Index the collection – Each of the documents is full-text indexed. Not only does this facilitate Steps #3 and #4, below, it makes the collection searchable.
Calculate a relevancy score (TFIDF) for each word – With the exception of parsing each etext into a set of “words”, counting the number of words in a document and the frequency of each word is easy. Determining the total number of documents in the collection is trivial. By searching the index for each word and getting back the number of documents in which it appears is the work of the indexer. With these four values (number of words in a document, frequency of a word in a document, the number of total documents, and the number of documents where the word appears) TFIDF can be calculated for each word.
Calculate a relevancy score for each bi-gram – Instead of extracting words from an etext, bi-grams (two-word phrases) were extracted and TFIDF is calculated for each of them, just like Step #3.
Save – If the score for each word or bi-gram is greater than an arbitrarily denoted lower bounds, and if the word or bi-gram is not a stop word, then assign the word or bi-gram to the etext. This step was the most time-consuming. It required many dry runs of the algorithm to determine an optimal lower-bounds as well as set of stop words. The lower the bounds the greater number of words and phrases are returned, but as the number of words and phrases increases their apparent usefulness decreases. The words become too common among the controlled vocabulary. At the other end of the scale, a stop word list needed to be created to remove meaningless words and phrases. The stop word problem was complicated in Project Gutenberg texts because of the “fine print” and legalese in most of the documents, and by the OCRed (optical character recognized) text from the Internet Archive. Words like “thofe” where the “f” was really an “s” needed to be removed.
Go to Step #3 for each document in the collection.
Done.

The results

Through this process I discovered a number of things.

First, in regards to fictional works, the words or phrases returned are often pronouns, and these were usually the names of characters from the work. An excellent example is Mark Twain’s Adventures of Huckleberry Finn whose currently assigned terms include: huck, tom, joe, injun joe, aunt polly, tom sawyer, muff potter, and injun joe’s.

Second, in regards to works of non-fiction, the words and phrases returned are also nouns, and these are objects referred to often in the etext. A good example includes John Stuart Mill’s Auguste Comte and Positivism where the assigned words are: comte, phaenomena, metaphysical, science, mankind, social, scientific, philosophy, and sciences.

Third, automatically generated keywords and phrases were many times just as useful as the librarian-assigned Library of Congress Subject headings. Many of the items harvested from the Internet Archive were complete with MARC records. Some of those records included subject headings. During Step #5 (above), I spent time observing the output and comparing it to previously assigned terms. Take for example a work called Universalism in America: A History by Richard Eddy. Its assigned headings included:

Universalism United States History
Unitarian Universalist churches United States

My automatically generated terms/phrases are:

universalist
ballou
hosea ballou
boston
universalist church
sermon
convention
first universalist
universalist quarterly
doctrine
universalist society
restorationist controversy
thomas whittemore
delivered
abner kneeland
sermon delivered
church
universalist meeting
universalist magazine
universal salvation
america
hosea ballon
vers alism
edward turner
general convention
universalism

Granted, the generated list is not perfect. For example, Hosea Ballou is mentioned twice, and the second was probably caused by an OCR error. On the other hand, how was a person to know that Hosea Ballou was even a part of the etext if it weren’t for this process? The same goes for the other people: Thomas Whittemore, Abner Kneeland, and Edward Turner. In defense of controlled vocabulary, the terms “church”, “sermon”, “doctrine”, and “american” could all be assumed from the (rather) hierarchal nature of LCSH, but unless a person understands the nature of LCSH such a thing is not obvious.

As a librarian I understand the power of a controlled vocabulary, but since I am not limited to three to five subject headings per entry, and because controlled vocabularies are often very specific, I have retained the LCSH in each record whenever possible. The more the merrier.

Next steps

Now that the collection has richer metadata, the next steps will be to exploit it. Some of those nexts steps include:

Normalize the data – Each of the subjects are currently saved in a single database field. They need to be normalized across the database to enable database joins and make it easier to generate reports.
Create a browsable interface – Write a set of static Web pages linking keywords and phrases to etexts. This will make it easier to see at a glance the type of content in the collection.
Re-index – Trivial. Send all the data and metadata back to the indexer ultimately improving the precision/recall ratio.
Enhance search experience – Extract the keywords and phrases from search results and display them to the user. Make them linkable to easily “find more like this one.” Extract the same keywords and phrases and use them to implement the increasingly popular browsable facets feature.
Enhance linked data – Generate a report against the database to create (better) RDF files complete with more meaningful (subject) tags. Link these tags to external vocabularies such as WordNet through the use of linked data thus contributing to the Semantic Web and enabling others to benefit from my labors. (Infomotions Man says, ‘Give back to the ‘Net”.)

Fun! Combining traditional librarianship with computer applications; not automating existing workflows as much as exploiting the inherent functions of a computer. Using mathematics to solve large-scale problems. Making it easier to do learning and research. It is the not what of librarianship that needs to change as much as the how.

Alex on Google

Eric Lease Morgan — Fri, 24 Jul 2009 11:41:06 +0000

I don’t exactly know how or why Google sometimes creates nice little screen shots of Web home pages, but it created one for my Alex Catalogue of Electronic Texts. I’ve seen them for other sites on the Web, and some of them even contain search boxes.

I wish I could get Google to make one of these for a greater number of my sites, and I wish I could get the Google Search Appliance to do the same. It is a nifty feature, to say the least.

Lingua::EN::Bigram (version 0.01)

Eric Lease Morgan — Tue, 23 Jun 2009 13:41:36 +0000

Below is the POD (Plain O’ Documentation) file describing a Perl module I wrote called Lingua::EN::Bigram.

The purpose of the module is to: 1) extract all of the two-word phrases from a given text, and 2) rank each phrase according to its probability of occurance. Very nice for doing textual analysis. For example, by applying this module to Mark Twain’s Adventures of Tom Sawyer it becomes evident that the signifcant two-word phrases are names of characters in the story. On the other hand, Ralph Waldo Emerson’s Essays: First Series returns action statements — instructions. On the other hand Henry David Thoreau’s Walden returns “walden pond” and descriptions of pine trees. Interesting.

The code is available here or on CPAN.

NAME

Lingua::EN::Bigram – Calculate significant two-word phrases based on frequency and/or T-Score

SYNOPSIS

  use Lingua::EN::Bigram;
  $bigram = Lingua::EN::Bigram->new;
  $bigram->text( 'All men by nature desire to know. An indication of this...' );
  $tscore = $bigram->tscore;
  foreach ( sort { $$tscore{ $b } <=> $$tscore{ $a } } keys %$tscore ) {
  
    print "$$tscore{ $_ }\t" . "$_\n";
  
  }

DESCRIPTION

This module is designed to: 1) pull out all of the two-word phrases (collocations or “bigrams”) in a given text, and 2) list these phrases according to thier frequency and/or T-Score. Using this module is it possible to create list of the most common two-word phrases in a text as well as order them by their probable occurance, thus implying significance.

METHODS

new

Create a new, empty bigram object:

  # initalize
  $bigram = Lingua::EN::Bigram->new;

text

Set or get the text to be analyzed:

  # set the attribute
  $bigram->text( 'All good things must come to an end...' );
  
  # get the attribute
  $text = $bigram->text;

words

Return a list of all the tokens in a text. Each token will be a word or puncutation mark:

  # get words
  @words = $bigram->words;

word_count

Return a reference to a hash whose keys are a token and whose values are the number of times the token occurs in the text:

  # get word count
  $word_count = $bigram->word_count;
  
  # list the words according to frequency
  foreach ( sort { $$word_count{ $b } <=> $$word_count{ $a } } keys %$word_count ) {
  
    print $$word_count{ $_ }, "\t$_\n";
  
  }

bigrams

Return a list of all bigrams in the text. Each item will be a pair of tokens and the tokens may consist of words or puncutation marks:

  # get bigrams
  @bigrams = $bigram->bigrams;

bigram_count

Return a reference to a hash whose keys are a bigram and whose values are the frequency of the bigram in the text:

  # get bigram count
  $bigram_count = $bigram->bigram_count;
  
  # list the bigrams according to frequency
  foreach ( sort { $$bigram_count{ $b } <=> $$bigram_count{ $a } } keys %$bigram_count ) {
  
    print $$bigram_count{ $_ }, "\t$_\n";
  
  }

tscore

Return a reference to a hash whose keys are a bigram and whose values are a T-Score — a probabalistic calculation determining the significance of bigram occuring in the text:

  # get t-score
  $tscore = $bigram->tscore;
  
  # list bigrams according to t-score
  foreach ( sort { $$tscore{ $b } <=> $$tscore{ $a } } keys %$tscore ) {
  
    print "$$tscore{ $_ }\t" . "$_\n";
  
  }

DISCUSSION

Given the increasing availability of full text materials, this module is intended to help “digital humanists” apply mathematical methods to the analysis of texts. For example, the developer can extract the high-frequency words using the word_count method and allow the user to search for those words in a concordance. The bigram_count method simply returns the frequency of a given bigram, but the tscore method can order them in a more finely tuned manner.

Consider using T-Score-weighted bigrams as classification terms to supplement the “aboutness” of texts. Concatonate many texts together and look for common phrases written by the author. Compare these commonly used phrases to the commonly used phrases of other authors.

Each bigram includes punctuation. This is intentional. Developers may need want to remove bigrams containing such values from the output. Similarly, no effort has been made to remove commonly used words — stop words — from the methods. Consider the use of Lingua::StopWords, Lingua::EN::StopWords, or the creation of your own stop word list to make output more meaningful. The distribution came with a script (bin/bigrams.pl) demonstrating how to remove puncutation and stop words from the displayed output.

Finally, this is not the only module supporting bigram extraction. See also Text::NSP which supports n-gram extraction.

TODO

There are probably a number of ways the module can be improved:

the constructor method could take a scalar as input, thus reducing the need for the text method
the distribution’s license should probably be changed to the Perl Aristic License
the addition of alternative T-Score calculations would be nice
it would be nice to support n-grams
make sure the module works with character sets beyond ASCII

ACKNOWLEDGEMENTS

T-Score is calculated as per Nugues, P. M. (2006). An introduction to language processing with Perl and Prolog: An outline of theories, implementation, and application with special consideration of English, French, and German. Cognitive technologies. Berlin: Springer. Page 109.

AUTHOR

Eric Lease Morgan

Alex Lite: A Tiny, standards-compliant, and portable catalogue of electronic texts

Eric Lease Morgan — Sat, 12 Jul 2008 16:13:01 +0000

One the beauties of XML its ability to be transformed into other plain text files, and that is what I have done with a simple software distribution called Alex Lite.

My TEI publishing system(s)

A number of years ago I created a Perl-based TEI publishing system called “My personal TEI publishing system“. Create a database designed to maintain authority lists (titles and subjects), sets of XSLT files, and TEI/XML snippets. Run reports against the database to create complete TEI files, XHTML files, RSS files, and files designed to be disseminated via OAI-PMH. Once the XHTML files are created, use an indexer to index them and provide a Web-based interface to the index. Using this system I have made accessible more than 150 of my essays, travelogues, and workshop handouts retrospectively converted as far back as 1989. Using this system, many (if not most) of my writings have been available via RSS and OAI-PMH since October 2004.

A couple of years later I morphed the TEI publishing system to enable me to mark-up content from an older version of my Alex Catalogue of Electronic Texts. Once marked up I planned to transform the TEI into a myriad of ebook formats: plain text, plain HTML, “smart” HTML, PalmPilot DOC and eReader, Rocket eBook, Newton Paperback, PDF, and TEI/XML. The mark-up process was laborious and I have only marked up about 100 texts, and you can see the fruits of these labors, but the combination of database and XML technology has enabled me to create Alex Lite.

Alex Lite

Alex Lite the result of a report written against my second TEI publishing system. Loop through each item in the database and update an index of titles. Create a TEI file against each item. Using XSLT, convert each TEI file into a plain HTML file, a “pretty” XHTML file, and a FO (Formatting Objects) file. Use a FO processor (like FOP) to convert the FO into PDF. Loop through each creator in the database to create an author index. Glue the whole thing together with an index.html file. Save all the files to a single directory and tar up the directory.

The result is a single file that can be downloaded, unpacked, and provide immediate access to sets of electronic books in an standards-compliant, operating system independent manner. Furthermore, no network connection is necessary except for the initial acquisition of the distribution. This directory can then be networked or saved to a CD-ROM. Think of the whole thing as if it were a library.

Give it a whirl; download a version of Alex Lite. Here is a list of all the items in the tiny collection:

Alger Jr., Horatio (1834-1899)
- The Cash Boy
- Cast Upon The Breakers
Bacon, Francis (1561-1626)
- The Essays
- The New Atlantis
Burroughs, Edgar Rice (1875-1850)
- At The Earth’s Core
- The Beasts Of Tarzan
- The Gods Of Mars
- The Jungle Tales Of Tarzan
- The Monster Men
- A Princess Of Mars
- The Return Of Tarzan
- The Son Of Tarzan
- Tarzan And The Jewels Of Opar
- Tarzan Of The Apes
- The Warlord Of Mars
Conrad, Joseph (1857-1924)
- The Heart Of Darkness
- Lord Jim
- The Secret Sharer
Doyle, Arthur Conan (1859-1930)
- The Adventures Of Sherlock Holmes
- The Case Book Of Sherlock Holmes
- His Last Bow
- The Hound Of The Baskervilles
- The Memoirs Of Sherlock Holmes
Machiavelli, Niccolo (1469-1527)
- The Prince
Plato (428-347 B.C.)
- Charmides, Or Temperance
- Cratylus
- Critias
- Crito
- Euthydemus
- Euthyphro
- Gorgias
Poe, Edgar Allan (1809-1849)
- The Angel Of The Odd–An Extravaganza
- The Balloon-Hoax
- Berenice
- The Black Cat
- The Cask Of Amontillado
Stoker, Bram (1847-1912)
- Dracula
- Dracula’s Guest
Twain, Mark (1835-1910)
- The Adventures Of Huckleberry Finn
- A Connecticut Yankee In King Arthur’s Court
- Extracts From Adam’s Diary
- A Ghost Story
- The Great Revolution In Pitcairn
- My Watch: An Instructive Little Tale
- A New Crime
- Niagara
- Political Economy

XSLT

As alluded to above, the beauty of XML is its ability to be transformed into other plain text formats. XSLT allows me to convert the TEI files into other files for different mediums. The distribution includes only simple HTML, “pretty” XHTML, and PDF versions of the texts, but for the XSLT affectionatos in the crowd who may want to see the XSLT files, I have included them here:

tei2htm.xsl – used to create plain HTML files complete with metadata
tei2html.xsl – used to create XHTML files complete with metadata as well as simple CSS-enabled navigation
tei2fo.xsl – used to create FO files which were fed to FOP in order to create things designed for printing on paper

Here’s a sample TEI file, Edgar Allen Poe’s The Cask Of Amontillado.

Future work

I believe there is a lot of promise in the marking-up of plain text into XML, specifically works of fiction and non-fictin into TEI. Making available such marked-up texts paves the way for doing textual analysis against them and for enhancing them with personal commentary. It is too bad that the mark-up process, even simple mark-up, is so labor intensive. Maybe I’ll do more of this sort of thing in my copius spare time.