Preservationists have the most challenging job

January 3rd, 2010

In the field of librarianship, I think the preservationists have the most challenging job because it is fraught with the greatest number of unknowns.

Twenty-eight (28) CDs

mangled book
mangled book

As I am writing this posting, I am in the middle of an annual processes — archiving the data I created from the previous year. This is something I have been doing since 1986. It began by putting my writings on 3.5 inch “floppy” disks. After a few years, CDs became more feasible, and I have been using them ever since. The first few CDs contain multiple years’ worth of content. This year I will require 14 CDs, and considering the fact that I create duplicates of every CD, this year I will burn 28. It goes with too much saying, this process takes a long time.

Now, I’m not quite a prolific a writer as 28 CDs sound, but the type of content I archive is large and diverse. It begins with my email which I have been systematically collecting since 1997. (“Can you say, ‘Mr. Serials’?”) No, I do not have all of my email, just the email I think is important; email of a significant nature where I actually say something, or somebody actually says something to me. It includes some attachments in the form of PDF documents and image files. It includes, inquiries I get regarding my work and postings to mailing lists that are longer rather than shorter. By the way, I only send plain text email messages because MIME encodings — the process used to include other than plain text content — adds an extra layer of complexity when it comes to reading and parsing email (mbox) archives. How can I be sure future digital archeologists will be able to compute against such stuff? Likewise, nothing gets tape archived (“tarred”), and nothing gets compressed (“zipped”) for all for the same reasons — an extra layer of complexity. Since I am the “owner” of the Code4Lib, NGC4Lib, and Usability4Lib mailing lists, and since was used to be the official archivist for ACQNET, I systematically collect, organize, archive, index, and provide access to these mailing lists using Mr. Serials. Burning the raw (mbox) email files of these lists as well as their browsable HTML counterparts is a part of my annual email preservation process.

The proces continues with the various types of other writings. Each presentation I give has its own folder complete with invitation, logistics, bio & abstract, as well three versions of my presentation: 1) a plain-text version, a one-page handout in the form of a PDF file, and a Word document. (Ick!) If I’m lucky I will remember to archive the TEI version of my remarks which is always longer than one page long and lives in the Musings section of Infomotions. Other types of writings include the plain text versions of blog postings, various versions of essays for publication, etc. At the very least, everything is saved as plain text. Not Word. Not PDF. Not anything that is platform or software-title specific. Otherwise I can’t guarantee it will be readable into the next decade. I figure that if someone can’t read a plain text file, then they have much bigger problems.

Then there is the software. I write lots of software over the period of one year. At least a couple dozen programs. Some of them are simple hacks. Some of them are “studies”, experiments, or investigations. Some of them are extensive intermediaries between relational databases and people using Web browsers. While many of these programs come to me in bursts of creative energy, I would not have the ability to recreate them if they were lost and gone to Big Byte Heaven. When it comes to computers, your data is your most important assest. Not the hardware. Not the software. The data — the content you create. This is the content you can not get back again. This is the content that is unique. This is the content that needs to be backed up and saved against future calamity.

Because some of my data is saved in relational databases, the annual preservation process includes raw database dumps. Again, these are plain text files but in the form of SQL statements. Thank God for mysqldump. It gives me the opportunity to restore my Musings, my blog, my Alex Catalogue, my water collection, and now my Highlights & Annotations. (More on that later.)

Biblioteca Valenciana
Biblioteca Valenciana

All of the content above fits on a single CD. Easily. Again, I’m not that prolific of a writer.

The hard part is the multimedia. As a part of an Apple Library of Tomorrow grant awarded to me by Steve Cisler, I was given an Apple QuickTake camera in 1994 or so. It could store about 24 pictures in 256 colors. It broke when my wife accidentally dropped it into a pond. It still works, if you have the necessary Macintosh hardware and it is plugged in. Presently, I use a 5 megapixel camera. I take the pictures at the highest resolution. I take movies as well. The pictures get edited. The movies get edited as well. This content currently makes up the bulk of the CDs. Six for the movies saved in the Apple movie (.mov) format. One DVD for actual use. Three for the full-scale JPEG images. Three for the iPhoto CDs. While I feel confident the JPEG files will be readable into the future, I’m not so sure about the .mov files, let alone the DVD. I might feel better about some sort of MPEG format, but it seems to be continually changing. Similarly, I suppose I ought to be saving the JPEG files as PNG files. At least that way more of the metadata may be traveling along with the images. For even better preservation, I ought to be putting the movies on video tape. (There is no compression or encryption there). I ought to be printing the photographs on glossy paper and binding the whole lot into books.

This year I started saving my music. I’ve been recording myself playing guitar since 1984. It began with audio cassette tapes. I have about 30 of them labeled and stored away in plastic boxes. I’ve made a couple attempts to digitize them, but the process is very laborious. It is easier to record yourself digitally in the first place and save the resulting files. This year a rooted through my archives and found a number of recordings. Tests of new recording gear and software. Experiments in production techniques. Background music to home videos. Saved as AIFF files, I hope they will be readable in the future.

Once everything gets burnt to CDs, one copy becomes my working copy. The other copy goes to a CD case not to be touched. Soon I will need a new case.

Finally, everything is not digital. In fact, I print a lot. Print that thought-provoking email message. Print that essay. Print this blog posting. Print the code to that computer program. Sign and date the print out. Put it into the archival box. The number of boxes I’m accumulating is now up to about 10.

What can I say. I enjoy all aspects of librarianship.

Preservation

My world of (digital) preservation is miniscule compared to work of academic preservationists, archivists, and curators. If it takes this much effort to systematically collect, organize, and archive one person’s content, then think how much effort would be required to apply the process against the intellectual output of an entire college or university!

U of MN Archive
U of MN Archive

Even if so much people-power were available, this is no insurance against the future. How do we go about preserving digital content? What formats should the content be manifested in? What hardware will be needed to read the media where the data is saved? What software will be necessary to read the data? Too many questions. Too many unknowns. Too many things that are unpredictable. Right now, there only seems to be two solutions, and the real solution is probably a combination of the two. First, make sincere efforts to copy non-proprietary formats of content to physical media — a storage artifact that can be read by the widest variety of computer hardware. Plan on migrating the content as well as the physical media forward as technology changes. Think this process as an a type of insurance. Second, make as many copies of the content as possible in as many formats as possible. Print it. Microfilm it. Put it on tape and spinning disks. Make it available on the Web. While the folks at LOCKSS may not have thought the expression would be used in this manner, it is still true — “Lot’s of copies keep stuff safe.”

I sincerely believe we are in the process of creating a Digital Dark Age. “No, you can not read or access that content. It was created during the late 20th and early 21st centuries. It was a time of prolific exploration, few standards, and many legal barriers.” Something needs to happen differently.

Maybe it doesn’t really matter. Maybe the content that is needed is the content that always lives on “spinning disks” and gets automatically migrated forward. Computers make it easier to create lots of junk. It certainly doesn’t all need to be preserved. On the other hand, those letters from the American Civil War were not necessarily considered important at the time. Many of them were written by unknown people. Yet, these letters are important to us today. Not because of who wrote them, but because they reflect the thinking of the time. They provide pieces of a puzzle that can verify facts or provide alternative perspectives. After years and years, information can grow in importance, and consequently, today, we run the risk of throwing away stuff this is of importance tomorrow.

Preservationists have the hardest job in the field of librarianship. More power to them.

How to make a book (#2 of 3)

January 1st, 2010

This is the second of a three-part series on how to make a book.

The first posting described and illustrated how to use a thermo-binding machine to make a book. This posting describes and illustrates how to “weave” a book together — folding and cutting (or tearing). The process requires no tools. No glue. No sewing. Just paper. Ingenious. The third posting will be about traditional bookmaking.

Attribution

Like so many things in my life, I learned how to do this by reading a… book, but alas, I have misplaced this particular book and I am unable to provide you with a title/citation. (Pretty bad for a librarian!) In any event, the author of the book explained her love of bookmaking. She described her husband as an engineer who thought all of the traditional cutting, gluing, and sewing were unnecessary. She challenged him to create something better. The result was the technique described below. While what he created was not necessarily “better”, it surely showed ingenuity.

The process

Here is process outlined, but you can also see how it is done on YouTube:

  1. Begin with 12 pieces of paper – I use normal printer paper, but the larger 11.5 x 14 inch pieces of paper make for very nicely sized books.
  2. Fold pairs of paper length-wise – In the end, you will have 6 pairs of paper half as big as the originals.
  3. Draw a line down the center of 3 pairs – Demarcate where you will create “slots” for your book by drawing a line half the size of of the inner crease of 3 pairs of paper.
  4. Draw a line along the outside of 3 pairs – Demarcate where you will create “tabs” for your books by drawing two lines from one quarter along the crease towards the outside of the 3 pairs of paper.
  5. Cut along the lines – Actually create the slots and tabs of your books by cutting along the lines drawn in Steps #3 and #Instead of using scissors, you can tear along the creases. (No tools!)
  6. Create mini-books – Take one pair of paper cut as a tab and insert the tab into the slot of another pair. Do this for all of 3 of the slot-tab pairs. The result will be 3 mini-books simply “woven” together.
  7. Weave together the mini-books – Finally, find the slot of one of your mini-books and insert a tab from another mini-book. Do the same with the remaining mini-book.

The result of your labors should be a fully-functional book complete with 48 pages. I use them for temporary projects — notebooks. Yeah, the cover is not very strong. During the use of your book, put the whole thing in a manila or leather folder. Lastly, I know the process is difficult to understand without pictures. Watch the video.

Good and best open source software

December 28th, 2009

What qualities and characteristics make for a “good” piece of open source software? And once that question is answered, then what pieces of library-related open source software can be considered “best”?

I do not believe there is any single, most important characteristic of open source software that qualifies it to be denoted as “best”. Instead, a number of characteristics need to be considered. For example, a program might do one thing and do it well, but if it is bear to install then that counts against it. Similarly, some software might work wonders but it is built on a proprietary infrastructure such as a closed source compiler. Can that software really be considered “open”?

For my own education and cogitation, I have begun to list questions to help me address what I think is the “best” library-related open source software. Your comments would be greatly appreciated. I have listed the questions in (more or less) priority order:

  • Does the software work as advertised? – If the program says it can do one thing, but never does, then this may be a non-starter. On the other hand, accomplishing a particular goal is sometimes relative. In most cases the software might perform excellently, but in others it performs less so. It is unrealistic to expect any software to be all things to all people.
  • To what degree is the software supported? – Support, can mean many things. Most obviously, users of the software want to know whether or not there are one or more people behind the software who can answer questions about it. Where is the developer and how can I get in touch with them? Are they approachable? If the developer is not available, then can support be purchased? Do I get what I pay for when I make this purchase? How expensive is it? Is their website easy to use? Support can also allude to software updates. “Software is never done. If it were, then it would be called hardware.” For example, my favorite XSL processor (xsltproc) and some of its friends work great but recommending it to friends comes with hesitation because I wonder about ongoing maintenance and upgrades to the newer versions of the API. Support also means user community. While open source is about “free” software, it relies on communities for sustainability. Do such communities exist? Are there searchable mailing lists with browsable archives? Are there wikis, virtual and real meetings, and/or IRC channels, etc?
  • Is the documentation thorough? – Is there a man page? A POD? Something that can be printed and annotated? Is there an introduction? FAQ? Glossary of terms? Is there a different guide/section for different types of readers such as systems administrators, programmers, implementors, and/or users? Is the documentation well-written? While I have used plenty of pieces of software and never read the manual, documentation is essencial if the software is expected to be exploited to the highest degree. Few thing in life are truly intuitive. Software is certainly not one of them. Documentation is a form of writing, and writing is something that literally transcends space and time. It is an alternative to having a person giving you instructions.
  • What are the licence terms? – Personally I place a higher value on the viral nature of a GNU-like license, but BSD-like licenses enable commercial enterprise to a greater degree, and whether I like it or not commercial enterprises are all but necessary in the world I live in. (After all, it enabled the creation of favorite personal computer’s operating system.) At the same time, if the licensing is not GNU-like or BSD-like, then the software is not really open source anyway. Right?
  • To what degree is the software easy to install? – Since installing software is usually not a process that needs to be repeated, a difficult installation can be overlooked. On the other hand, if tweaking kernels, installing a huge number of dependencies, requiring a second piece of obscure software that is not supported is required, then all this counts against an open source software distribution.
  • To what degree is the software implemented using the “standard” LAMP stack? – LAMP is an acronym for Linux, Apache, MySQL, and Perl (or PHP, or Python, or just about any other computer language), and the LAMP stack is/was the basis for many pieces of open source applications. The combination is well-supported, well-documented, and easily transportable to different hardware platforms. If the software application is built on LAMP, then the application has a lot going for it.
  • Is the distribution in question an application/system or a library/module? – It is possible to divide software into two group: 1) software that is designed to build other software — libraries/modules, and 2) software that is an an end-in-itself — applications/systems. The former is akin to a tool in a toolbox used to build applications. The later is something intended for an end user. The former requires a computer programmer to truly exploit. The later usually does not require as much specific expertise. Both the module and the application have their place. Each have their own advantages and disadvantages. Depending on the implementor’s environment one might be better suited.
  • To what degree does the software satisfy some sort of real library need? – This question is specific to my particular audience, and is dependent on a definition of librarianship. Collection. Preservation. Organization. Dissemination. Books? Catalogs? Circulation? Reading and information literacy? Physical place fostering community? Etc. For example, librarians love to create lists, and in a digital environment lists are well managed through the use of relational databases. Therefore, does MySQL qualify as a piece of library-related software? Similarly, as Roy Tennant was told one time, “Librarians like to search. Everybody else likes to find.” Does this mean indexers like Solr/Lucene ought to qualify? Maybe the question ought to be rephrased. “To what degree does the software satisfy your or your institution’s needs?”

What sorts of things have I left out? Is there anything here that can be measurable or is everything left to subjective judgement? Just as importantly, can we as a community answer these questions in the list of specific software distributions to come up with the “best” of class?

‘More questions than answers.

Valencia and Madrid: A Travelogue

December 5th, 2009

I recently had the opportunity to visit Valencia and Madrid (Spain) to share some of my ideas about librarianship. This posting describes some of things I saw and learned along the way.

La Capilla de San Francisco de Borja
La Capilla de San Francisco de Borja
Capilla del Santo Cáliz
Capilla del Santo Cáliz

LIS-EPI Meeting

In Valencia I was honored to give the opening remarks at the 4th International LIS-EPI Meeting. Hosted by the Universidad Politécnica de Valencia and organized by Fernanda Mancebo as well as Antonia Ferrer, the Meeting provided an opportunity for librarians to come together and share their experiences in relation to computer technology. My presentation, “A few possibilities for librarianship by 2015” outlined a few near-term futures for the profession. From the introduction:

The library profession is at a cross roads. Computer technology coupled with the Internet have changed the way content is created, maintained, evaluated, and distributed. While the core principles of librarianship (collection, organization, preservation, and dissemination) are still very much apropos to the current milieu, the exact tasks of the profession are not as necessary as they once were. What is a librarian to do? In my opinion, there are three choices: 1) creating services against content as opposed to simply providing access to it, 2) curating collections that are unique to our local institutions, or 3) providing sets of services that are a combination of #1 and #2.

And from the conclusion:

If libraries are representing a smaller and smaller role in the existing information universe, then two choice present themselves. First, the profession can accept this fact, extend it out to its logical conclusion, and see that libraries will eventually play in insignificant role in society. Libraries will not be libraries at all but more like purchasing agents and middle men. Alternatively, we can embrace the changes in our environment, learn how to take advantage of them, exploit them, and change the direction of the profession. This second choice requires a period of transition and change. It requires resources spent against innovation and experimentation with the understanding that innovation and experimentation more often generate failures as opposed to successes. The second option carries with it greater risk but also greater rewards.

toro
toro
robot sculpture
robot sculpture

Josef Hergert

Providing a similar but different vision from my own, Josef Hergert (University of Applied Sciences HTW Chur) described how librarianship ought to be embracing Web 2.0 techniques in a presentation called “Learning and Working in Time of Web 2.0: Reconstructing Information and Knowledge”. To say Hergert was advocating information literacy would be to over-simplify his remarks, yet if you broaden the definition of information literacy to include the use of blogs, wikis, social bookmarking sites — Web 2.0 technologies — then the phrase information literacy is right on target. A number of notable quotes included:

  • We are experiencing many changes in the environment: non-commercial sharing of content, legislative overkill, and “pirate parties”… The definition of “authorship” is changing.
  • The teaching of information literacy courses will help overcome some of the problems.
  • The process of learning is changing because of the Internet… We are now experiencing a greater degree of informal learning as opposed to formal learning… We need as librarians to figure out how to exploit the environment to support learning both formal and informal.
  • The current environment is more than paper, but also about a network of people, and the librarian can help create these networks with [Web 2.0 tools].
  • Provide not only the book but the environment and tools to do the work.

As an aside, I have been using networked computer technologies for more than twenty years. Throughout that time a number of truisms have become apparent. “If you don’t want it copied, then don’t put it on the ‘Net; give back to the ‘Net”, “On the Internet nobody knows that you are a dog”, and “It is like trying to drink from a fire hose” are just a few. Hergert used the newest one, “If it is not on the Internet, then it doesn’t exist.” For better or for worse, I think this is true. Convenience is a very powerful elixer. The ease of acquiring networked data and information is so great compared the time and energy needed to get data and information in analog format that people will get what is simple “good enough”. In order to remain relevant, libraries must put their (full text) content on the ‘Net or be seen as an impediment to learning as opposed to learning’s facilitator.

While I would have enjoyed learning what the other Meeting presenters has to say, it was unrealistic for me to attend the balance of the conference. The translators were going back to Switzerland, and I would not have been able to understand what the presenters were saying. In this regard is sort of felt like the Ugly American, but I have come to realize that the use of English is a purely practical matter. It as nothing to do with a desire to understand American culture.

Bibliteca Valenciana

The next day I have a few others had the extraordinary opportunity to get an inside tour of the Bibliteca Valenciana (Valencia Library). Starting out as a monastery, it was transformed into quite a number of other things, such as a prison, before it became a library. We got to go into the archives, see of of their treasures, and learn about the library’s history. They were very proud of their Don Quixote collection, and we saw their oldest book — a treatise on the Black Death which included receipts for treatments.

Biblioteca Nacional de España

In Madrid I believe visited the Biblioteca Nacional de España (National Library of Spain) and went to their museum. It was free, and I saw an exhibition of original Copernicus, Galileo, Brahe, Kepler, and Newton editions embodying Western scientific progress. Very impressive, and very well done, especially considering the admission fee.

Biblioteca Nacional de España
Biblioteca Nacional
statue
statue

International Institute

Finally, I shared the presentation from the LIS-EPI Meeting at the International Institute. While I advocated changes in the way’s our profession do its work, the attendees at both venues wondered how to about these changes. “We are expected to provide a certain set of services to our patrons here and now. What do we do to learn these new skills?” My answer was grounded in applied research & development. Time must be spent experimenting and “playing” with the new technologies. This should be considered an investment in the profession and its personnel, an investment that will pay off later in new skills and greater flexibility. We work in academia. It behooves us to work academically. This includes explorations into applying our knowledge in new and different ways.

Acknowledgements

Many thanks go to many people for making this professional adventure possible. I am indebted to Monica Pareja from the United Stated Embassy in Madrid. She kept me out of trouble. I thank Fernanda Mancebo and Antonia Ferrer who invited me to the Meeting. Last and certainly not least, I thank my family for allowing to to go to Spain in the first place since the event happened over the Thanksgiving holiday. “Thank you, one and all.”

alley
alley
fountain
fountain

Colloquium on Digital Humanities and Computer Science: A Travelogue

December 4th, 2009

On November 14-16, 2009 I attended the 4th Annual Chicago Colloquium on Digital Humanities and Computer Science at the Illinois Institute of Technology in Chicago. This posting outlines my experiences there, but in a phrase, I found the event to be very stimulating. In my opinion, libraries ought to be embracing the techniques described here and integrating them into their collections and services.

IIT
IIT
Paul Galvin Library
Paul Galvin Library

Day #0 – A pre-conference workshop

Upon arrival I made my way directly to a pre-conference workshop entitled “Machine Learning, Sequence Alignment, and Topic Modeling at ARTFUL” presented by Mark Olsen and Clovis Gladstone. In the workshop they described at least two applications they were using to discover common phrases between texts. The first was called Philomine and the second was called Text::Pair. Both work similarly but Philomine needs to be integrated with Philologic, and Text::Pair is a stand-alone Perl module. Using these tools n-grams are extracted from texts, indexed to the file system, and await searching. By entering phrases into a local search engine, hits are returned that include the phrases and the works where the phrase was found. I believe Text::Pair could be successfully integrated in my Alex Catalogue.

orange, green, and gray
orange, green, and gray
orange and green
orange and green

Day #1

The Colloquium formally began the next day with an introduction by Russell Betts (Illinois Institute of Chicago). His most notable quote was, “We have infinite computer power at our fingertips, and without much thought you can create an infinite amount of nonsense.” Too true.

Marco Büchler (University of Leipzig) demonstrated textual reuse techniques in a presentation called “Citation Detection and Textual Reuse on Ancient Greek Texts”. More specifically, he used textual reuse to highlight differences between texts, graph ancient history, and explore computer science algorithms. Try www.eaqua.net for more.

Patrick Juola’s (Duquesne University) “conjecturator” was the heart of the next presentation called “Mapping Genre Spaces via Random Conjectures”. In short, Juola generated thousands and thousands of “facts” in the form of [subject1] uses [subject2] more or less than [subject3]. He then tested each of these facts for truth against a corpus. Ironically, he was doing much of what Betts alluded to in the introduction — creating nonsense. On the other hand, the approach was innovative.

By exploiting a parts-of-speech (POS) parser, Devin Griffiths (Rutgers University) sought the use of analogies as described in “On the Origin of Theories: The Semantic Analysis of Analogy in Scientific Corpus”. Assuming that an analogy can be defined as a noun-verb-noun-conjunction-noun-verb-noun phrase, Griffith looked for analogies in Darwin’s Origin of Species, graphed the number of analogies against locations in the text, and made conclusions accordingly. He asserted that the use of analogy was very important during the Victorian Age, and he tried to demonstrate this assertion through a digital humanities approach.

The use of LSIDs (large screen information displays) was discussed by Geoffrey Rockwell (McMaster University). While I did not take a whole lot of notes from this presentation, I did get a couple of ideas: 1) figure out a way for a person to “step into” a book, or 2) display a graphic representation of a text on a planetarium ceiling. Hmm…

Kurt Fendt (MIT) described a number of ways timelines could be used in the humanities in his presentation called “New Insights: Dynamic Timelines in Digital Humanities”. Through the process I became aware of the SIMILE timeline application/widget. Very nice.

I learned of the existence of a number of digital humanities grants as described by Michael Hall (NEH). They are both start-up grants as well a grants on advanced topics. See: neh.gov/odh/.

The first keynote speech, “Humanities as Information Sciences”, was given by Vasant Honavar (Iowa State University) in the afternoon. Honavar began with a brief history of thinking and philosophy, which he believes lead to computer science. “The heart of information processing is taking one string and transforming it into another.” (Again, think the introductory remarks.) He advocated the creation of symbols, feeding them into a processor, and coming up with solutions out the other end. Language, he posited, is an information-rich artifact and therefore something that can be analyzed with computing techniques. I liked how he compared science with the humanities. Science observes physical objects, and the humanities observe human creations. Honavar was a bit arscient, and therefore someone to be admired.

subway tunnel
subway tunnel
skyscraper predecessor
skyscraper predecessor

Day #2

In “Computational Phonostylistics: Computing the Sounds of Poetry” Marc Plamondon (Nipissing University) described how he was counting phonemes in both Tennyson’s and Browning’s poetry to validate whether or not Tennyson’s poetry is “musical” or plosive sounding and Browning’s poetry is “harsh” or fricative. To do this he assumed one set of characters are soft and another set are hard. He then counted the number of times each of these sets of characters existed in each of the respective poets’ works. The result was a graph illustrating the “musical” or “harshness” of the poetry. One of the more interesting quotes from Plamondon’s presentation included, “I am interested in quantifying aesthetics.”

In C.W. Forstal’s (SUNY Buffalo) presentation “Features from Frequency: Authorship and Stylistic Analysis Using Repetitive Sound” we learned how he too is counting sound n-grams to denote style. He applied the technique to D.H. Lawrence as well as to the Iliad and Odyssey, and to his mind the technique works to his satisfaction.

The second keynote presentation was give by Stephen Wolfram (Wolfram Research) via teleconference. It was called “What Can Be Made Computable in the Humanities?” He began by describing Mathematica as a tool he used to explore the world around him. All of this assumes that the world consists of patterns, and these patterns can be described through the use of numbers. He elaborated through something he called the Principle of Computational Equivalency — once you get to a certain threshold systems create a level of complexity. Such a principle puts pressure on having the simplest descriptive model as possible. (Such things are standard scientific/philosophic principles. Nothing new here.) Looking for patterns was the name of his game, and one such game was applied to music. Discover the patterns in a type of music. Feed the patterns to a computer. Have the computer generate the music. Most of the time the output works pretty well. He called this WolframTones. He went on to describe WolframAlpha as an attempt to make the world’s knowledge computable. Essentially a front-end to Mathematica, WolframAlpha is a vast collection of content associated with numbers: people and their birth dates, the agriculture output of countries, the price of gold over time, temperatures from across the world, etc. Queries are accepted into the system. Searches are done against its content. Results are returned in the form of best-guess answers complete with graphs and charts. WolframAlpha exposes mathematical processing to the general public in ways that have not been done previously. Wolfram described two particular challenges in the creation of WolframAlpha. One was the collection of content. Unlike Google, Wolfram Research does not necessarily crawl the Internet. Rather it selectively collects the content of a “reference library” and integrates it into the system. Second, and more challenging, has been the design of the user interface. People do not enter structured queries, but structured output is expected. Interpreting people’s input is a difficult task in and of itself. From my point of view, he is probably learning more about human thought processes than the natural world.

red girder sculpture
red girder sculpture
gray sculpture
gray sculpture

Some thoughts

This meeting was worth every single penny, especially considering the fact that there was absolutely no registration fee. Free, except of the my travel costs, hotel, and the price of the banquet. Unbelievable!

Just as importantly, the presentations given at this meeting demonstrate the maturity of the digital humanities. These things are not just toys but practical tools for evaluating (mostly) texts. Given the increasing amount of full text available in library collections, I see very little reason why these sorts of digital humanities applications could not be incorporate into library collections and services. Collect full text content. Index it. Provide access to the index. Get back a set of search results. Select one or more items. Read them. Select one or more items again, and then select an option such as graph analogies, graph phonemes, or list common phrases between texts. People need to do more than read the texts. People need to use the texts, to analyze them, to compare & contrast them with other texts. The tools described in this conference demonstrate that such things are more than possible. All that has to be done is to integrate them into our current (library) systems.

So many opportunities. So little time.

Alex Catalogue collection policy

October 4th, 2009

This page lists the guidelines for including texts in the Alex Catalogue of Electronic Texts. Originally written in 1994, much of it is still valid today.

Purpose

The primary purpose of the Catalogue is to provide me with the means for demonstrating a concept I call arscience through American and English literature as well as Western philosophy. The secondary purpose of the Catalogue is to provide value-added access to some of the world’s great literature in turn providing the means for enhancing education. Consequently, the items in the collection must satisfy either of these two goals.

Qualities

Listed in priority order, texts in the collection must have the following qualities:

  1. Only texts in the public domain or freely distributed texts will be collected.
  2. Only texts that can be classified as American literature, English literature, or Western philosophy will be included.
  3. Only texts that are considered “great” literature will included. Great literature is broadly defined as literature withstanding the test of time and found in authoritative reference works like the Oxford Companions or the Norton Anthologies.
  4. Only complete works will be collected unless a particular work was never completed in the first place. In other words, partially digitized texts will not be included in the Catalogue.
  5. Whenever possible, collections of short stories or poetry will be included as they were originally published. If the items from the originally published collections have been broken up into individual stories or poems, then those items will be included individually.
  6. The texts in the collection must be written in or translated into English. Otherwise I will not be able to evaluate the texts’ quality nor will the indexing and content-searching work correctly.

File formats

Because of technical limitations and the potential long-term integrity of the Catalogue, texts in the collection, listed in order of preference, should have the following formats:

  1. Plain text files are preferred over HTML files.
  2. HTML files are preferred over compressed files.
  3. Compressed files are preferred over “word processor” files.
  4. Word processed files are the least preferable file format.
  5. Texts in unalterable file formats, such as Adobe Acrobat, will not be included.

In all cases, text that have not been divided into parts are preferred over texts that have been divided. If a particular item is deemed especially valuable and the item has been divided into parts, then efforts will be made to concatenate the individual parts and incorporate the result into the collection. The items in the collection are not necessarily intended to be read online.

Alex, the movie!

October 4th, 2009

Created circa 1998, this movie describes the purpose and scope of the Alex Catalogue of Electronic Texts. While coming off rather pompous, the gist of what gets said is still valid and correct. Heck, the links even work. “Thanks Berkeley!”

Collecting water and putting it on the Web (Part III of III)

September 3rd, 2009

This is Part III of an essay about my water collection, specifically a summary, opportunities for future study, and links to the source code. Part I described the collection’s whys and hows. Part II described the process of putting it on the Web.

Summary, future possibilities, and source code

There is no doubt about it. My water collection is eccentric but through my life time I have encountered four other people who also collect water. At least I am not alone.

Putting the collection on the Web is a great study in current technology. It includes relational database design. Doing input/output against the database through a programming language. Exploiting the “extensible” in XML by creating my own mark-up language. Using XSLT to transform the XML for various purposes: display as well as simple transformation. Literally putting the water collection on the map. Undoubtably technology will change, but the technology of my water collection is a representative reflection of the current use of computers to make things available on the Web.

I have made all the software a part of this system available here:

  1. SQL file sans any data – good for study of simple relational database
  2. SQL file complete with data – see how image data is saved in the database
  3. PHP scripts – used to do input/output against the database
  4. waters.xml – a database dump, sans images, in the form of an XML file
  5. waters.xsl – the XSLT used to display the browser interface
  6. waters2markers.xsl – transform water.xml into Google Maps XML file
  7. map.pl – implementation of Google Maps API

My water also embodies characteristics of librarianship. Collection. Acquisition. Preservation. Organization. Dissemination. The only difference is that the content is not bibliographic in nature.

There are many ways access to the collection could be improved. It would be nice to sort by date. It would be nice to index the content and make the collection searchable. I have given thought to transforming the WaterML into FO (Formatting Objects) and feeding the FO to a PDF processor like FOP. This could give me a printed version of the collection complete with high resolution images. I could transform the WaterML into an XML file usable by Google Earth providing another way to view the collection. All of these things are “left up the reader for further study”. Software is never done, nor are library collections.

River Lune
River Lune
Roman Bath
Roman Bath
Ogle Lake
Ogle Lake

Finally, again, why do I do this? Why do I collect the water? Why have a spent so much time creating a system for providing access to the collection? Ironically, I am unable to answer succinctly. It has something to do with creativity. It has something to do with “arsience“. It has something to do with my passion for the library profession and my ability to manifest it through computers. It has something to do with the medium of my art. It has something to do with my desire to share and expand the sphere of knowledge. “Idea. To be an idea. To be an idea and an example to others… Idea”. I really don’t understand it through and through.

Read all the posts in this series:

  1. The whys and hows of the water collection
  2. How the collection is put on the Web
  3. This post

Visit the water collection.

Collecting water and putting it on the Web (Part II of III)

September 3rd, 2009

This is Part II of an essay about my water collection, specifically the process of putting it on the Web. Part I describes the whys and hows of the collection. Part III is a summary, provides opportunities for future study, and links to the source code.

Making the water available on the Web

As a librarian, I am interested in providing access to my collection(s). As a librarian who has the ability to exploit the use of computers, I am especially interested in putting my collection(s) on the Web. Unfortunately, the process is not as easy as the actual collection process, and there have been a number of processes along the way. When I was really into HyperCard I created a “stack” complete with pictures of my water, short descriptions, and an automatic slide show feature that played the sound of running water in the background. (If somebody asks, I will dig up this dinosaur and make it available.) Later I created a Filemaker Pro database of the collection, but that wasn’t as cool as the HyperCard implementation.

Mississippi River
Mississippi River
Mississippi River
Mississippi River
Mississippi River
Mississippi River

The current implementation is more modern. It takes advantage of quite a number of technologies, including:

  • a relational database
  • a set of PHP scripts that do input/output against the database
  • an image processor to create thumbnail images
  • an XSL processor to generate a browsable Web presence
  • the Google Maps API to display content on a world map

The use of each of these technologies is described in the following sections.

Relational database

ER diagram
ER diagram

Since 2002 I have been adding and maintaining newly acquired waters in a relational, MySQL, database. (Someday I hope to get the waters out of those cardboard boxes and add them to the database too. Someday.) The database itself is rather simple. Four tables: one for the waters, one for the collectors, a join table denoting who collected what, and a metadata table consisting of a single record describing the collection as a whole. The entity-relationship diagram illustrates the structure of the database in greater detail.

Probably the most interesting technical characteristic of the database is the image field of type mediumblob in the waters table. When it comes to digital libraries and database design, one of the perennial choices to make is where to save your content. Saving it outside your database makes your database smaller and more complicated but forces you to maintain links to your file system or the Internet where the actual content resides. This can be an ongoing maintenance nightmare and can side-step the preservation issues. On the other hand inserting your content inside the database allows you to keep your content all in once place while “marrying” it to up in your database application. Putting the content in the database also allows you to do raw database dumps making the content more portable and easier to back-up. I’ve designed digital library systems both ways. Each has its own strengths and weaknesses. This is one of the rarer times I’ve put the content into the database itself. Never have I solely relied on maintaining links to off-site content. Too risky. Instead I’ve more often mirrored content locally and maintained two links in the database: one to the local cache and another to the canonical website.

PHP scripts for database input/output

Sets of PHP scripts are used to create, maintain, and report against the waters database. Creating and maintaining database records is tedious but not difficult as long as you keep in mind that there are really only four things you need to do with any database: 1) create records, 2) find records, 3) edit records, and 4) delete records. All that is required is to implement each of these processes against each of the fields in each of the tables. Since PHP was designed for the Web, each of these processes is implemented as a Web page only accessible to myself. The following screen shots illustrate the appearance and functionality of the database maintenance process.

ER diagram
Admin home
ER diagram
Admin waters
ER diagram
Edit water

High-level menus on the right. Sub-menus and data-entry forms in the middle. Simple. One of the nice things about writing applications for oneself is the fact that you don’t have to worry about usability, just functionality.

The really exciting stuff happens when the reports are written against the database. Both of them are XML files. The first is a essentially a database dump — water.xml — complete with the collection’s over-arching metadata record, each of the waters and their metadata, and a list of collectors. The heart of the report-writing process includes:

  1. finding all of the records in the database
  2. converting and saving each water’s image as a thumbnail
  3. initializing the water record
  4. finding all of the water’s collectors
  5. adding each collector to the record
  6. going to Step #5 for each collector
  7. finishing the record
  8. going to Step #2 for each water
  9. saving the resulting XML to the file system

There are two hard parts about this process. The first, “MOGRIFY”, is a shelled out hack to the operating system using an ImageMagik utility to convert the content of the image field into a thumbnail image. Without this utility saving the image from the database to the file system would be problematic. Second, the SELECT statement used to find all the collectors associated with a particular water is a bit tricky. Not really to difficult, just a typical SQL join process. Good for learning relational database design. Below is a code snippet illustrating the heart of this report-writing process:

  # process every found row
  while ($r = mysql_fetch_array($rows)) {

    # get, define, save, and convert the image -- needs error checking
    $image     = stripslashes($r['image']);
    $leafname  = explode (' ' ,$r['name']);
    $leafname  = $leafname[0] . '-' . $r['water_id'] . '.jpg';
    $original  = ORIGINALS  . '/' . $leafname;
    $thumbnail = THUMBNAILS . '/' . $leafname;
    writeReport($original, $image);
    copy($original, $thumbnail);
    system(MOGRIFY . $thumbnail);

    # initialize and build a water record
    $report .= '<water>';
    $report .= "<name water_id='$r[water_id]' lat='$r[lat]' lng='$r[lng]'>" .
               prepareString($r['name']) . '</name>';
    $report .= '<date_collected>';
    $report .= "<year>$r[year]</year>";
    $report .= "<month>$r[month]</month>";
    $report .= "<day>$r[day]</day>";
    $report .= '</date_collected>';

    # find all the COLLECTORS associated with this water, and...
    $sql = "SELECT c.*
            FROM waters AS w, collectors AS c, items_for_collectors AS i
            WHERE w.water_id   = i.water_id
            AND c.collector_id = i.collector_id
            AND w.water_id     = $r[water_id]
            ORDER BY c.last_name, c.first_name";
    $all_collectors = mysql_db_query ($gDatabase, $sql);
    checkResults();

    # ...process each one of them
    $report .= "<collectors>";
    while ($c = mysql_fetch_array($all_collectors)) {

      $report .= "<collector collector_id='$c[collector_id]'><first_name>
                 $c[first_name]</first_name>
                 <last_name>$c[last_name]</last_name></collector>";

    }
    $report .= '</collectors>';

    # finish the record
    $report .= '<description>' . stripslashes($r['description']) .
               '</description></water>';

  }

The result is the following “WaterML” XML content — a complete description of a water, in this case water from Copenhagen:

  <water>
    <name water_id='87' lat='55.6889' lng='12.5951'>Canal
      surrounding Kastellet, Copenhagen, Denmark
    </name>
    <date_collected>
      <year>2007</year>
      <month>8</month>
      <day>31</day>
    </date_collected>
    <collectors>
      <collector collector_id='5'>
        <first_name>Eric</first_name>
        <last_name>Morgan</last_name>
    </collector>
    </collectors>
    <description>I had the opportunity to participate in the
      Ticer Digital Library School in Tilburg, The Netherlands.
      While I was there I also had the opportunity to visit the
      folks at
      <a href="http://indexdata.com">Index Data</a>, a company
      that writes and supports open source software for libraries.
      After my visit I toured around Copenhagen very quickly. I
      made it to the castle (Kastellet), but my camera had run out
      of batteries. The entire Tilburg, Copenhagen, Amsterdam
      adventure was quite informative.
    </description>
  </water>

When I first created this version of the water collection RSS was just coming on line. Consequently I wrote an RSS feed for the water, but then I got realistic. How many people want to get an RSS feed of my water. Crazy?!

XSL processing

Now that the XML file has been created an the images are saved to the file system, the next step is to make a browser-based interface. This is done though an XSLT style sheet and XSL processor called Apache2::TomKit.

Apache2::TomKit is probably the most eclectic component of my online water collection application. Designed to be a replacement for another XSL processor called AxKit, Apache2::TomKit enables the developer to create CGI-like applications, complete with HTTP GET parameters, in the form of XML/XSLT combinations. Specify the location of your XML files. Denote what XSLT files to use. Configure what XSLT processor to use. (I use LibXSLT.) Define an optional cache location. Done. The result is on-the-fly XSL transformations that work just like CGI scripts. The hard part is writing the XSLT.

The logic of my XSLT style sheet — waters.xsl — goes like this:

  1. Get input – There are two: cmd and id. Cmd is used to denote the desired display function. Id is used to denote which water to display
  2. Initialize output – This is pretty standard stuff. Display XHTML head elements and start the body.
  3. Branch – Depending on the value of cmd, display the home page, a collectors page, all the images, all the waters, or a specific water.
  4. Display the content – This is done with the thorough use of XPath expressions.
  5. Done – Complete the XHTML with a standard footer.

Of all the XSLT style sheets I’ve written in my career, waters.xsl is definitely the most declarative in nature. This is probably because the waters.xml file is really data driven as opposed mixed content. The XSLT file is very elegant but challenging for the typical Perl or PHP hacker to quickly grasp.

Once the integration of the XML file, the XSLT style sheet, and Apache2::TomKit is complete, I was able to design URL’s such as the following:

Okay. So its not very REST-ful; the URLs are not very “cool”. Sue me. I originally designed this in 2002.

Waters and Google Maps

In 2006 I used my water collection to create my first mash-up. It combined latitudes and longitudes with the Google Maps API.

Inserting maps into your Web pages via the Google API is a three-step process: 1) create an XML file containing latitudes and longitudes, 2) insert a call to the Google Maps javascript into the head of your HTML, and 3) call the javascript from within the body of your HTML.

For me, all I had to do was: 1) create new fields in my database for latitudes and longitudes, 2) go through each record in the database doing latitude and longitude data-entry, 3) write a WaterML file, 4) write an XSLT file transforming the WaterML into an XML file expected of Google Maps, 5) write a CGI script that takes latitudes and longitudes as input, 6) display a map, and 7) create links from my browser-based interface to the maps.

It may sound like a lot of steps, but it is all very logical, and taken bit by bit is relatively easy. Consequently, I am able to display a world map complete with pointers to all of my water. Conversely, I am able to display a water record and link its location to a map. The following two screen dumps illustrate the idea, and I try to get as close to the actual collection point as possible:

ER diagram
World map
ER diagram
Single water

Read all the posts in this series:

  1. The whys and hows of the water collection
  2. This post
  3. A summary, future directions, and source code

Visit the water collection.

Collecting water and putting it on the Web (Part I of III)

September 3rd, 2009

This is Part I of an essay about my water collection, specifically the whys and hows of it. Part II describes the process of putting the collection on the Web. Part III is a summary, provides opportunities for future study, and links to the source code.

I collect water

It may sound strange, but I have been collecting water since 1978, and to date I believe I have around 200 bottles containing water from all over the world. Most of the water I’ve collected myself, but much of it has also been collected by friends and relatives.

The collection began the summer after I graduated from high school. One of my best friends, Marlin Miller, decided to take me to Ocean City (Maryland) since I had never seen the ocean. We arrived around 2:30 in the morning, and my first impression was the sound. I didn’t see the ocean. I just heard it, and it was loud. The next day I purchased a partially melted glass bottle for 59¢ and put some water, sand, and air inside. I was going keep some of the ocean so I could experience it anytime I desired. (Actually, I believe my first water is/was from the Pacific Ocean, collected by a girl named Cindy Bleacher. She visited there in the late Spring of ‘78, and I asked her to bring some back so I could see it too. She did.) That is how the collection got started.

Cape Cod Bay
Cape Cod Bay
Robins Bay
Robins Bay
Gulf of Mexico
Gulf of Mexico

The impetus behind the collection was reinforced in college — Bethany College (Bethany, WV). As a philosophy major I learned about the history of Western ideas. That included Heraclitus who believed the only constant was change, and water was the essencial element of the universe. These ideas were elaborated upon by other philosophers who thought there was not one essencial element, but four: earth, water, air, and fire. I felt like I was on to something, and whenever I heard of somebody going abroad I asked them bring me back some water. Burton Thurston, a Bethany professor, went to the Middle East on a diplomatic mission. He brought back Nile River water and water from the Red Sea. I could almost see Moses floating in his basket and escaping from the Egyptians.

The collection grew significantly in the Fall of 1982 because I went to Europe. During college many of my friends studied abroad. They didn’t do much studying as much as they did traveling. They were seeing and experiencing all of the things I was learning about through books. Great art. Great architecture. Cities whose histories go back millennia. Foreign languages, cultures, and foods. I wanted to see those things too. I wanted to make real the things I learned about in college. I saved my money from my summer peach picking job. My father cashed in a life insurance policy he had taken out on me when I was three weeks old. Living like a turtle with its house on its back, I did the back-packing thing across Europe for a mere six weeks. Along the way I collected water from the Seine at Notre Dame (Paris), the Thames (London), the Eiger Mountain (near Interlaken, Switzerland) where I almost died, the Agean Sea (Ios, Greece), and many other places. My Mediterranean Sea water from Nice is the prettiest. Because of the all the alge, the water from Venice is/was the most biologically active.

Over the subsequent years the collection has grown at a slower but regular pace. Atlantic Ocean (Myrtle Beach, South Carolina) on a day of playing hooky from work. A pond at Versailles while on my honeymoon. Holy water from the River Ganges (India). Water from Lock Ness. I’m going to grow a monster from DNA contained therein. I used to have some of a glacier from the Canadian Rockies, but it melted. I have water from Three Mile Island (Pennsylvania). It glows in the dark. Amazon River water from Peru. Water from the Missouri River where Lewis & Clarke decided it began. Etc.

Many of these waters I haven’t seen in years. Moves from one home to another have relegated them to cardboard boxes that have never been unpacked. Most assuredly some of the bottles have broken and some of the water has evaporated. Such is the life of a water collection.

Lake Huron
Lake Huron
Trg Bana Jelacica
Trg Bana Jelacica
Jimmy Carter Water
Jimmy Carter Water

Why do I collect water? I’m not quite sure. The whole body of water is the second largest thing I know. The first being the sky. Yet the natural bodies of water around the globe are finite. It would be possible to collect water from everywhere, but very difficult. Maybe I like the challenge. Collecting water is cheap, and every place has it. Water makes a great souvenir, and the collection process helps strengthen my memories. When other people collect water for me it builds between us a special relationship — a bond. That feels good.

What do I do with the water? Nothing. It just sits around my house occupying space. In my office and in the cardboard boxes in the basement. I would like to display it, but over all the bottles aren’t very pretty, and they gather dust easily. I sometimes ponder the idea of re-bottling the water into tiny vials and selling it at very expensive prices, but in the process the air would escape, and the item would lose its value. Other times I imagine pouring the water into a tub and taking a bath it it. How many people could say they bathed in the Nile River, Amazon River, Pacific Ocean, Atlantic Ocean, etc. all at the same time.

How water is collected

The actual process of collecting water is almost trivial. Here’s how:

  1. Travel someplace new and different – The world is your oyster.
  2. Identify a body of water – This should be endemic of the locality such as an ocean, sea, lake, pond, river, stream, or even a public fountain. Natural bodies of water a preferable. Processed water is not.
  3. Find a bottle – In earlier years this was difficult, and I usually purchased a bottle of wine with my meal, kept the bottle and cork, and used the combination as my container. Now-a-days it is easier to root round in a trash can for a used water bottle. They’re ubiquitous, and they too are often endemic of the locality.
  4. Collect the water – Just fill the bottle with mostly water but some of what the water is flowing over as well. The air comes along for the ride.
  5. Take a photograph – Hold the bottle at arm’s length and take a picture it. What you are really doing here is two-fold. Documenting the appearance of the bottle but also documenting the authenticity of the place. The picture’s background supports the fact that water really came from where the collector says.
  6. Label the bottle – On a small piece of paper write the name of the body of water, where it came from, who collected it, and when. Anything else is extra.
  7. Save – Keep the water around for posterity, but getting it home is sometimes a challenge. With the advent of 911 it is difficult to get the water through airport security and/or customs. I have recently found myself checking my bags and incurring a handling fee just to bring my water home. Collecting water is not as cheap as it used to be.

Who can collect water for me? Not just anybody. I have to know you. Don’t take it personally, but remember, part of the goal is relationship building. Moreover, getting water from strangers would jeopardize the collection’s authenticity. Is this really the water they say it is? Call it a weird part of the “collection development policy”.

Pacific Ocean
Pacific Ocean
Rock Run
Rock Run
Salton Sea
Salton Sea

Read all the posts in this series:

  1. This post
  2. How the collection is put on the Web
  3. A summary, future directions, and source code

Visit the water collection.