TEI Toolbox, or "How a geeky librarian reads Horace"

tldnr; By marking up documents in XML/TEI, you create sets of well-structured narrative data, and consequently, this enables you to "read" the documents in new & different ways.

Horace, not
Horace, not
Who was Horace and what did he write about? To answer this question, I suppose I could do some sort of Google search and hope for the best. Through an application of my information literacy skills, I suppose I could read an entry about Horace in an encyclopedia, of which I have many. One of those encyclopedias could be Wikipedia, of which I am a fan. Unfortunately, these approaches rely on the judgements of other people, and while other people have more experience & expertise than myself, it is still important for me to make up my own mind. To answer questions -- to educate myself -- I combine the advice of others with personal experience. Thus, the sole use of Google and/or encyclopedias fail me.

To put in another way, in order to answer my question, I ought to read Horace's works. For this librarian, obtaining the complete works of Horace is a trivial task. Search something like Project Gutenberg, the Internet Archive, Google Books, or the HathiTrust. Download item. Read it in its electronic form, or print it and read it in a more traditional manner. Gasp! I could even borrow a copy from a library or purchase a copy. In the former case, I am not allowed to write in the item, and in the later case the format may not be amenable to personal annotation. (Dont' tell anybody, but I advocate writing in books. I even facilitate workshops on how to systematically do such a thing.)

Obtaining a copy of Horace's works and reading it in a traditional manner is all well and good, but the process is expensive in terms of time, and the process does not easily lend itself to computer assistance. After all, a computer can remember much better than I can. It can process things much faster than I can. And a computer can communicate with other computers much more throughly than I can. Thus, this geeky librarian wants to read Horace with the help of a computer.

This is where the TEI Toolbox comes in. The TEI Toolbox is a fledging system of Bash, Perl, and Python scripts used to create and transform Text Encoding Initiative (TEI) files into other files, and these other files lend themselves to alternative forms of reading. More specifically, given a TEI file, the Toolbox can:

  • validated it
  • parse it into smaller blocks such as chapters and paragraphs, and save the results for later use
  • mark-up each word in each sentence in terms of parts-of-speech; "morphadorn" it
  • transform it into plain text, for other computing purposes
  • transform it into HTML, for online reading
  • transform it into PDF, specifically designed for printing
  • distill its content into a relational (SQLite) database complete with bibliographics, parts-of-speech, and named-entities
  • create a word-embedding (word2vec) database
  • create a (Solr) full-text index complete with parts-of-speech, named-entities, etc.
  • search the totality of the above in any number of different ways
  • compare & contrast documents in any number of different ways

Thus, given a valid TEI file, I can not only print a version of it amenable to traditional reading (and writing in), but I can also explore & navigate a text for the purposes of scholarly investigation. Such is exactly what I am doing with the complete works of Horace.

My first step was to identify a plain text version of Horace's works, and the version at Project Gutenberg was just fine. Next, I marked up the plain text into valid TEI using a set of Barebones BBEdit macros of my own design. This process was tedious and took me about an hour. I then used my Toolbox's ./bin/carrel-initialize.sh script to create a tiny file system. I then used the ./bin/carrel-build.sh script to perform most of the actions outlined above. This resulted in a set of platform-independent files saved in a directory named "horace". For example, it includes:

To date, I have printed the PDF file, and I plan to bind it before the week is out. I will then commence upon reading (and writing in) it in the traditional manner. In the meantime, I have used the Toolbox to index the whole with Solr, and I have queried the resulting index for some of my favorite themes. Consequently, I have gotten a jump start on my reading. What I think is really cool (or "kewl"), is how the search results return pointers to the exact locations of the hits in the HTML file. This means I can view the search results within the context of the whole work, like a concordance on steroids. For example, below are sample queries for "love AND war". Notice how the results are hyperlinked within the complete work:

  1. While you, great Lollius, declaim at Rome...
  2. O thou fountain of Bandusia, clearer than...
  3. When first Greece, her wars being over, b...

Here are some results for "god AND law":

  1. There was a certain freedman, who, an old...
  2. Orpheus, the priest and Interpreter of th...
  3. O ye elder of the youths, though you are ...

And finally, "(man OR men) AND justice)":

  1. What shame or bound can there be to our a...
  2. Damasippus is mad for purchasing antique ...
  3. Have you any regard for reputation, which...

All of the above only scratches the surface of what is possible with the Toolbox, but the essence of the Toolbox is this: by marking up a document in TEI you transform a narrative text into a set of structured data amenable to computer analysis. From where I sit, the process of marking up a document is a form of close reading. Printing a version of the text and reading (and writing in) it lends itself to additional methods of use & understanding. Finally, by exploiting derivative versions of the text with a computer, even more methods of understanding present themselves. Hopefully, I will share some of those other techniques in future postings. Now, I'm off to my workshop to bind the book, all 400 pages of it...

"Reading is FUNdemental."


Creator: Eric Lease Morgan <eric_morgan@infomotions.com>
Source: This document was originally published as an Infomotions Musing.
Date created: 2020-12-29
Date updated: 2020-12-29
Subject(s): reading; XML (eXtensible Mark-up Language);
URL: http://infomotions.com/musings/geek-reading-horace/