Archive for the ‘Alex Catalogue’ Category

Alex, the movie!

Sunday, October 4th, 2009

Created circa 1998, this movie describes the purpose and scope of the Alex Catalogue of Electronic Texts. While coming off rather pompous, the gist of what gets said is still valid and correct. Heck, the links even work. “Thanks Berkeley!”

How to make a book (#1 of 3)

Sunday, August 23rd, 2009

This is a series of posts where I will describe and illustrate how to make books. In this first post I will show you how to make a book with a thermo-binding machine. In the second post I will demonstrate how to make a book by simply tearing and folding paper. In the third installment, I will make a traditional book with a traditional cover and binding. The book — or more formally, the codex — is a pretty useful format for containing information.

Fellowes TB 250 thermo-binding machine

The number of full text books found on the Web is increasing at a dramatic pace. A very large number of these books are in the public domain and freely available for downloading. While computers make it easy to pick through smaller parts of books, it is diffcult to read and understand them without printing. Once they are printed you are then empowered to write in the margins, annotate them as you see fit, and share them with your friends. On the other hand, reams of unbound paper is difficult to handle. What to do?

Enter a binding machine, specifically a thermo-binding machine like the Fellowes TB 250. This handy-dandy gizmo allows you to print bunches o’ stuff, encase it in inexpensive covers, and bind it into books. Below is an outline of the binding process and a video demonstration is also available online:

  1. Buy the hardware – The machine costs less than $100 and available from any number of places on the Web. Be sure to purchase covers in a variety of sizes.
  2. Print and gather your papers – Be sure to “jog” your paper nice and neatly.
  3. Turn the machine on – This makes the heating element hot.
  4. Place the paper into the cover – The inside of each cover’s spine is a ribbon of glue. Make sure the paper is touching the glue.
  5. Place the book into the binder – This melts the glue.
  6. Remove the book, and press the glue – The larger the book the more important it is to push the adhesive into the pages.
  7. Go to Step #5, at least once – This makes the pages more secure in the cover.
  8. Remove, and let cool – The glue is hot. Let it set.
  9. Enjoy your book – This is the fun part. Read and scribble in your book to your heart’s content.

Binding and the Alex Catalogue

The Alex Catalogue of Electronic Texts is a collection of fulltext books brought together for the purposes of furthering a person’s liberal arts eduction. While it supports tools for finding, analyzing, and comparing texts, the items are intended to be read in book form as well. Consider printing and binding the PDF or fully transcribed versions of the texts. Your learning will be much more thorough, and you will be able to do more “active” reading.

Binding and libraries

Binding machines are cheap, and they facilitate a person’s learning by enabling users to organize their content. Maybe providing a binding service for library patrons is apropos? Make it easy for people to print things they find in a library. Make it easy for them to use some sort of binding machine. Enable them to take more control over the stuff of their learning, teaching, and research. It certainly sounds like good idea to me. After all, in this day and age, libraries aren’t so much about providing access to information as they are about making information more useful. Binding — books on demand — is just one example.

Browsing the Alex Catalogue

Friday, August 21st, 2009

The Alex Catalogue is browsable by author names, subject tags, and titles. Just select a browsable list, then a letter, and finally an item.

Browsability is an important feature of any library catalog. It gives you an opportunity to see what the collection contains without entering a query. It is also possible to use browsability to identify similar names, terms, or titles. “Oh look, I hadn’t thought of that idea, and look at the alternative spellings I can use.”

Creating the browsable list is rather trivial. Since all of the underlying content is saved in a relational database, it is rather easy to loop through the fields of “controlled” vocabulary terms and “authority” lists to identify matching etext titles. These lists include:

The later is probably the most interesting since it gives you an idea of the most common words and two-word phrases used in the corpus. For example, look at the list of words starting with the letter “k” and all the ways the word “kant” has been extracted from collection

Indexing and searching the Alex Catalogue

Monday, August 17th, 2009

The Alex Catalogue of Electronic Texts uses state-of-the-art software to index both the metadata and full text of its content. While the interface accepts complex Boolean queries, it is easier to enter a single word, a number of words, or a phrase. The underlying software will interpret what you enter and do much of hard query syntax work for you.

Indexing

The Catalogue consists of a number of different types of content harvested from different repositories. Most of the content is in the form of electronic texts (”etexts” as opposed to “ebooks”). Think Project Gutenberg, but also items from a defunct gopher archive from Virginia Tech, and more recently digitized materials from the Internet Archive. All of these items benefit from metadata and full text indexing. In other words, things like title words, author names, and computer-generated subject tags are made searchable as well as the full texts of the items.

The collection is supplemented with additional materials such as open access journal titles, open access journal article titles, some content from the HaitiTrust, as well as photographs taken by myself. Presently the full text of these secondary items is not included, just metadata: titles, authors, notes, and subjects. Search results return pointers to the full texts.

Regardless of content type, all metadata and full text is managed in an underlying MyLibrary database. To make the content searchable reports are written against the database and fed to Solr/Lucene for indexing. The Solr/Lucene data structure is rather simple consisting only of a number of Dublin Core-like fields, a default search field, and three facets (creator, subject/tag, and sub-collection). From a 30,000 foot view, this is the process used to index the content of the Catalogue:

  1. extract metadata and full text records from the database
  2. map each record’s fields to the Solr/Lucene data structure
  3. insert each record into Solr/Lucene; index the record
  4. go to Step #1 until all records have been indexed
  5. optimize the index for faster retrieval

Solr/Lucene works pretty well, and interfacing with it was made much simpler through the use of a set of Perl modules called WebService::Solr. On the other hand, there are many ways the index could be improved such as implementing facilitates for sorting and adding weights to various fields. An indexer’s work is never done.

Searching

Because of people’s expectations, searching the index is a bit more complicated and not as straight-forward, but only because the interface is trying to do you some favors.

Solr/Lucene supports single-word, multiple-word, and phrase searches through the use of single or double quote marks. If multi-word queries are entered without Boolean operators, then a Boolean and is assumed.

Since people often enter multiple-word queries, and it is difficult to know whether or not they are really wanting to do a phrase search, the Alex Catalogue converts ambiguous multiple-word queries into more robust Boolean queries. For example a search for “william shakespeare” (sans the quote marks) will get converted into “(william AND shakespeare) OR ‘william shakespeare’” (again, sans the double quote marks) on behalf of the user. This is considered a feature of the Catalogue.

To some degree Solr/Lucene tokenizes query terms, and consequently searches for “book” and “books” return the same number of hits.

Search results are returned in a relevance ranked order. Some time in the future there will be the option of sorting results by date, author, title, and/or a couple of other criteria. Unlike other catalogs, Alex only has a single display — search results. There is no intermediary detailed display; the Catalogue only displays search results or the full text of the item.

In the hopes of making it easier for the user to refine their search, the results page allows the user to automatically turn queries into subject, author, or title searches. It takes advantage of a thesaurus (WordNet) to suggest alternative queries. The system returns “facets” (author names, subject tags, or material types) allowing the user to limit their query with additional terms and narrow search results. The process is not perfect and there are always ways of improving the interface. Usability is never done either.

Summary

Do not try to out think the Alex Catalogue. Enter a word or two. Refine your query using the links on the resulting page. Read & enjoy the discovered texts. Repeat.

Automatic metadata generation

Thursday, July 30th, 2009

I have been having a great deal of success extracting keywords and two-word phrases from documents and assigning them as “subject headings” to electronic texts — automatic metadata generation. In many cases but not all, the set of assigned keywords I’ve created are just as good if not better as the controlled vocabulary terms assigned by librarians.

The problem

The Alex Catalogue is a collection of roughly 14,000 electronic texts. The vast majority come from Project Gutenberg. Some come from the Internet Archive. The smallest number come from a defunct etext collection of Virginia Tech. All of the documents are intended to surround the themes of American and English literature and Western philosophy.

With the exception of the non-fiction works from the Internet Archive, none of the electronic texts were associated with subject-related metadata. With the exception of author names (which are yet to be “well-controlled”), it has been difficult learn the “aboutness” of each of the documents. Such a thing is desirable for two reasons: 1) to enable the reader to evaluate the relevance of document, and 2) to provide a browsable interface to the collection. Without some sort of tags, subject headings, or application of clustering techniques, browsability is all but impossible. My goal was to solve this problem in an automated manner.

The solution

A couple of years ago I used tools such as Lingua::EN::Summarize and Open Text Summarizer to extract keywords and summaries from the etexts and assign them as subject terms. The process worked, but not extraordinarily well. I then learned about Term Frequency Inverse Document Frequency (TFIDF) to calculate “relevance”, and T-Score to calculate the probability of two words appearing side-by-side — bi-grams or two-word phrases. Applying these techniques to the etexts of the Alex Catalogue I have been able to create and add meaningful subject “tags” to each of my documents which then paves the way to browsability. Here is the algorithm I used to implement the solution:

  1. Collect documents – This was done through various harvesting techniques. Etexts are saved to the local file system and what metadata does exist gets saved to a database.
  2. Index the collection – Each of the documents is full-text indexed. Not only does this facilitate Steps #3 and #4, below, it makes the collection searchable.
  3. Calculate a relevancy score (TFIDF) for each word – With the exception of parsing each etext into a set of “words”, counting the number of words in a document and the frequency of each word is easy. Determining the total number of documents in the collection is trivial. By searching the index for each word and getting back the number of documents in which it appears is the work of the indexer. With these four values (number of words in a document, frequency of a word in a document, the number of total documents, and the number of documents where the word appears) TFIDF can be calculated for each word.
  4. Calculate a relevancy score for each bi-gram – Instead of extracting words from an etext, bi-grams (two-word phrases) were extracted and TFIDF is calculated for each of them, just like Step #3.
  5. Save – If the score for each word or bi-gram is greater than an arbitrarily denoted lower bounds, and if the word or bi-gram is not a stop word, then assign the word or bi-gram to the etext. This step was the most time-consuming. It required many dry runs of the algorithm to determine an optimal lower-bounds as well as set of stop words. The lower the bounds the greater number of words and phrases are returned, but as the number of words and phrases increases their apparent usefulness decreases. The words become too common among the controlled vocabulary. At the other end of the scale, a stop word list needed to be created to remove meaningless words and phrases. The stop word problem was complicated in Project Gutenberg texts because of the “fine print” and legalese in most of the documents, and by the OCRed (optical character recognized) text from the Internet Archive. Words like “thofe” where the “f” was really an “s” needed to be removed.
  6. Go to Step #3 for each document in the collection.
  7. Done.

The results

Through this process I discovered a number of things.

First, in regards to fictional works, the words or phrases returned are often pronouns, and these were usually the names of characters from the work. An excellent example is Mark Twain’s Adventures of Huckleberry Finn whose currently assigned terms include: huck, tom, joe, injun joe, aunt polly, tom sawyer, muff potter, and injun joe’s.

Second, in regards to works of non-fiction, the words and phrases returned are also nouns, and these are objects referred to often in the etext. A good example includes John Stuart Mill’s Auguste Comte and Positivism where the assigned words are: comte, phaenomena, metaphysical, science, mankind, social, scientific, philosophy, and sciences.

Third, automatically generated keywords and phrases were many times just as useful as the librarian-assigned Library of Congress Subject headings. Many of the items harvested from the Internet Archive were complete with MARC records. Some of those records included subject headings. During Step #5 (above), I spent time observing the output and comparing it to previously assigned terms. Take for example a work called Universalism in America: A History by Richard Eddy. Its assigned headings included:

  • Universalism United States History
  • Unitarian Universalist churches United States

My automatically generated terms/phrases are:

  • universalist
  • ballou
  • hosea ballou
  • boston
  • universalist church
  • sermon
  • convention
  • first universalist
  • universalist quarterly
  • doctrine
  • universalist society
  • restorationist controversy
  • thomas whittemore
  • delivered
  • abner kneeland
  • sermon delivered
  • church
  • universalist meeting
  • universalist magazine
  • universal salvation
  • america
  • hosea ballon
  • vers alism
  • edward turner
  • general convention
  • universalism

Granted, the generated list is not perfect. For example, Hosea Ballou is mentioned twice, and the second was probably caused by an OCR error. On the other hand, how was a person to know that Hosea Ballou was even a part of the etext if it weren’t for this process? The same goes for the other people: Thomas Whittemore, Abner Kneeland, and Edward Turner. In defense of controlled vocabulary, the terms “church”, “sermon”, “doctrine”, and “american” could all be assumed from the (rather) hierarchal nature of LCSH, but unless a person understands the nature of LCSH such a thing is not obvious.

As a librarian I understand the power of a controlled vocabulary, but since I am not limited to three to five subject headings per entry, and because controlled vocabularies are often very specific, I have retained the LCSH in each record whenever possible. The more the merrier.

Next steps

Now that the collection has richer metadata, the next steps will be to exploit it. Some of those nexts steps include:

  1. Normalize the data – Each of the subjects are currently saved in a single database field. They need to be normalized across the database to enable database joins and make it easier to generate reports.
  2. Create a browsable interface – Write a set of static Web pages linking keywords and phrases to etexts. This will make it easier to see at a glance the type of content in the collection.
  3. Re-index – Trivial. Send all the data and metadata back to the indexer ultimately improving the precision/recall ratio.
  4. Enhance search experience – Extract the keywords and phrases from search results and display them to the user. Make them linkable to easily “find more like this one.” Extract the same keywords and phrases and use them to implement the increasingly popular browsable facets feature.
  5. Enhance linked data – Generate a report against the database to create (better) RDF files complete with more meaningful (subject) tags. Link these tags to external vocabularies such as WordNet through the use of linked data thus contributing to the Semantic Web and enabling others to benefit from my labors. (Infomotions Man says, ‘Give back to the ‘Net”.)

Fun! Combining traditional librarianship with computer applications; not automating existing workflows as much as exploiting the inherent functions of a computer. Using mathematics to solve large-scale problems. Making it easier to do learning and research. It is the not what of librarianship that needs to change as much as the how.

Lingua::EN::Bigram (version 0.01)

Tuesday, June 23rd, 2009

Below is the POD (Plain O’ Documentation) file describing a Perl module I wrote called Lingua::EN::Bigram.

The purpose of the module is to: 1) extract all of the two-word phrases from a given text, and 2) rank each phrase according to its probability of occurance. Very nice for doing textual analysis. For example, by applying this module to Mark Twain’s Adventures of Tom Sawyer it becomes evident that the signifcant two-word phrases are names of characters in the story. On the other hand, Ralph Waldo Emerson’s Essays: First Series returns action statements — instructions. On the other hand Henry David Thoreau’s Walden returns “walden pond” and descriptions of pine trees. Interesting.

The code is available here or on CPAN.

NAME

Lingua::EN::Bigram – Calculate significant two-word phrases based on frequency and/or T-Score

SYNOPSIS

  use Lingua::EN::Bigram;
  $bigram = Lingua::EN::Bigram->new;
  $bigram->text( 'All men by nature desire to know. An indication of this...' );
  $tscore = $bigram->tscore;
  foreach ( sort { $$tscore{ $b } <=> $$tscore{ $a } } keys %$tscore ) {

    print "$$tscore{ $_ }\t" . "$_\n";

  }

DESCRIPTION

This module is designed to: 1) pull out all of the two-word phrases (collocations or “bigrams”) in a given text, and 2) list these phrases according to thier frequency and/or T-Score. Using this module is it possible to create list of the most common two-word phrases in a text as well as order them by their probable occurance, thus implying significance.

METHODS

new

Create a new, empty bigram object:

  # initalize
  $bigram = Lingua::EN::Bigram->new;

text

Set or get the text to be analyzed:

  # set the attribute
  $bigram->text( 'All good things must come to an end...' );

  # get the attribute
  $text = $bigram->text;

words

Return a list of all the tokens in a text. Each token will be a word or puncutation mark:

  # get words
  @words = $bigram->words;

word_count

Return a reference to a hash whose keys are a token and whose values are the number of times the token occurs in the text:

  # get word count
  $word_count = $bigram->word_count;

  # list the words according to frequency
  foreach ( sort { $$word_count{ $b } <=> $$word_count{ $a } } keys %$word_count ) {

    print $$word_count{ $_ }, "\t$_\n";

  }

bigrams

Return a list of all bigrams in the text. Each item will be a pair of tokens and the tokens may consist of words or puncutation marks:

  # get bigrams
  @bigrams = $bigram->bigrams;

bigram_count

Return a reference to a hash whose keys are a bigram and whose values are the frequency of the bigram in the text:

  # get bigram count
  $bigram_count = $bigram->bigram_count;

  # list the bigrams according to frequency
  foreach ( sort { $$bigram_count{ $b } <=> $$bigram_count{ $a } } keys %$bigram_count ) {

    print $$bigram_count{ $_ }, "\t$_\n";

  }

tscore

Return a reference to a hash whose keys are a bigram and whose values are a T-Score — a probabalistic calculation determining the significance of bigram occuring in the text:

  # get t-score
  $tscore = $bigram->tscore;

  # list bigrams according to t-score
  foreach ( sort { $$tscore{ $b } <=> $$tscore{ $a } } keys %$tscore ) {

    print "$$tscore{ $_ }\t" . "$_\n";

  }

DISCUSSION

Given the increasing availability of full text materials, this module is intended to help “digital humanists” apply mathematical methods to the analysis of texts. For example, the developer can extract the high-frequency words using the word_count method and allow the user to search for those words in a concordance. The bigram_count method simply returns the frequency of a given bigram, but the tscore method can order them in a more finely tuned manner.

Consider using T-Score-weighted bigrams as classification terms to supplement the “aboutness” of texts. Concatonate many texts together and look for common phrases written by the author. Compare these commonly used phrases to the commonly used phrases of other authors.

Each bigram includes punctuation. This is intentional. Developers may need want to remove bigrams containing such values from the output. Similarly, no effort has been made to remove commonly used words — stop words — from the methods. Consider the use of Lingua::StopWords, Lingua::EN::StopWords, or the creation of your own stop word list to make output more meaningful. The distribution came with a script (bin/bigrams.pl) demonstrating how to remove puncutation and stop words from the displayed output.

Finally, this is not the only module supporting bigram extraction. See also Text::NSP which supports n-gram extraction.

TODO

There are probably a number of ways the module can be improved:

  • the constructor method could take a scalar as input, thus reducing the need for the text method
  • the distribution’s license should probably be changed to the Perl Aristic License
  • the addition of alternative T-Score calculations would be nice
  • it would be nice to support n-grams
  • make sure the module works with character sets beyond ASCII

ACKNOWLEDGEMENTS

T-Score is calculated as per Nugues, P. M. (2006). An introduction to language processing with Perl and Prolog: An outline of theories, implementation, and application with special consideration of English, French, and German. Cognitive technologies. Berlin: Springer. Page 109.

AUTHOR

Eric Lease Morgan <eric_morgan@infomotions.com>

Alex Lite: A Tiny, standards-compliant, and portable catalogue of electronic texts

Saturday, July 12th, 2008

One the beauties of XML its ability to be transformed into other plain text files, and that is what I have done with a simple software distribution called Alex Lite.

My TEI publishing system(s)

A number of years ago I created a Perl-based TEI publishing system called “My personal TEI publishing system“. Create a database designed to maintain authority lists (titles and subjects), sets of XSLT files, and TEI/XML snippets. Run reports against the database to create complete TEI files, XHTML files, RSS files, and files designed to be disseminated via OAI-PMH. Once the XHTML files are created, use an indexer to index them and provide a Web-based interface to the index. Using this system I have made accessible more than 150 of my essays, travelogues, and workshop handouts retrospectively converted as far back as 1989. Using this system, many (if not most) of my writings have been available via RSS and OAI-PMH since October 2004.

A couple of years later I morphed the TEI publishing system to enable me to mark-up content from an older version of my Alex Catalogue of Electronic Texts. Once marked up I planned to transform the TEI into a myriad of ebook formats: plain text, plain HTML, “smart” HTML, PalmPilot DOC and eReader, Rocket eBook, Newton Paperback, PDF, and TEI/XML. The mark-up process was laborious and I have only marked up about 100 texts, and you can see the fruits of these labors, but the combination of database and XML technology has enabled me to create Alex Lite.

Alex Lite

Alex Lite the result of a report written against my second TEI publishing system. Loop through each item in the database and update an index of titles. Create a TEI file against each item. Using XSLT, convert each TEI file into a plain HTML file, a “pretty” XHTML file, and a FO (Formatting Objects) file. Use a FO processor (like FOP) to convert the FO into PDF. Loop through each creator in the database to create an author index. Glue the whole thing together with an index.html file. Save all the files to a single directory and tar up the directory.

The result is a single file that can be downloaded, unpacked, and provide immediate access to sets of electronic books in an standards-compliant, operating system independent manner. Furthermore, no network connection is necessary except for the initial acquisition of the distribution. This directory can then be networked or saved to a CD-ROM. Think of the whole thing as if it were a library.

Give it a whirl; download a version of Alex Lite. Here is a list of all the items in the tiny collection:

  1. Alger Jr., Horatio (1834-1899)
    • The Cash Boy
    • Cast Upon The Breakers
  2. Bacon, Francis (1561-1626)
    • The Essays
    • The New Atlantis
  3. Burroughs, Edgar Rice (1875-1850)
    • At The Earth’s Core
    • The Beasts Of Tarzan
    • The Gods Of Mars
    • The Jungle Tales Of Tarzan
    • The Monster Men
    • A Princess Of Mars
    • The Return Of Tarzan
    • The Son Of Tarzan
    • Tarzan And The Jewels Of Opar
    • Tarzan Of The Apes
    • The Warlord Of Mars
  4. Conrad, Joseph (1857-1924)
    • The Heart Of Darkness
    • Lord Jim
    • The Secret Sharer
  5. Doyle, Arthur Conan (1859-1930)
    • The Adventures Of Sherlock Holmes
    • The Case Book Of Sherlock Holmes
    • His Last Bow
    • The Hound Of The Baskervilles
    • The Memoirs Of Sherlock Holmes
  6. Machiavelli, Niccolo (1469-1527)
    • The Prince
  7. Plato (428-347 B.C.)
    • Charmides, Or Temperance
    • Cratylus
    • Critias
    • Crito
    • Euthydemus
    • Euthyphro
    • Gorgias
  8. Poe, Edgar Allan (1809-1849)
    • The Angel Of The Odd–An Extravaganza
    • The Balloon-Hoax
    • Berenice
    • The Black Cat
    • The Cask Of Amontillado
  9. Stoker, Bram (1847-1912)
    • Dracula
    • Dracula’s Guest
  10. Twain, Mark (1835-1910)
    • The Adventures Of Huckleberry Finn
    • A Connecticut Yankee In King Arthur’s Court
    • Extracts From Adam’s Diary
    • A Ghost Story
    • The Great Revolution In Pitcairn
    • My Watch: An Instructive Little Tale
    • A New Crime
    • Niagara
    • Political Economy

XSLT

As alluded to above, the beauty of XML is its ability to be transformed into other plain text formats. XSLT allows me to convert the TEI files into other files for different mediums. The distribution includes only simple HTML, “pretty” XHTML, and PDF versions of the texts, but for the XSLT affectionatos in the crowd who may want to see the XSLT files, I have included them here:

  • tei2htm.xsl – used to create plain HTML files complete with metadata
  • tei2html.xsl – used to create XHTML files complete with metadata as well as simple CSS-enabled navigation
  • tei2fo.xsl – used to create FO files which were fed to FOP in order to create things designed for printing on paper

Here’s a sample TEI file, Edgar Allen Poe’s The Cask Of Amontillado.

Future work

I believe there is a lot of promise in the marking-up of plain text into XML, specifically works of fiction and non-fictin into TEI. Making available such marked-up texts paves the way for doing textual analysis against them and for enhancing them with personal commentary. It is too bad that the mark-up process, even simple mark-up, is so labor intensive. Maybe I’ll do more of this sort of thing in my copius spare time.