Archive for July, 2008

Origami is arscient, and so is librarianship

Wednesday, July 30th, 2008

To do origami well a person needs to apply both artistic and scientific methods to the process. The same holds true for librarianship.

Arscience

Arscience is a word I have coined to denote the salient aspects of both art and science. It is a type of thinking — thinquing — that is both intuitive as well as systematic. It exemplifies synthesis — the bringing together of ideas and concepts — and analysis — the division of our world into smaller and smaller parts. Arscience is my personal epistemological method employing a Hegalian dialectic — an internal discussion. It juxtaposes approaches to understanding including art and science, synthesis and analysis, as well as faith and experience. These epistemological methods can be compared and contrasted, used or exploited, applied and debated against many of the things we encounter in our lives. Through this process I believe a fuller understanding of many things can be achieved.

arscience

Origami

A trivial example is origami. One one hand, origami is very artistic. Observe something in the natural world. Examine its essential parts and take notice of their shape. Acquire a piece of paper. Fold the paper to bring the essential parts together to form a coherent whole. The better your observation skills, the better your command of the medium, the better your origami will be.

On the other hand, you can discover that a square can be inscribed on any plane, and upon a square any number of regular polygons can be further inscribed. All through folding. You can then go about bisecting angles and dividing paper in halves, creating symbols denoting different types of folds, and systematically recording the process so it can be shared with others, ultimately creating a myriad of three-dimensional objects from an essentially two-dimensional thing. Unfold the three-dimensional object to expose its mathematics.

Seemingly conflicting approaches to the same problem results in similar outcomes. Arscience.

arscience

Librarianship

The same artistic and scientific processes — an arscient process — can be applied to librarianship. While there are subtle differences between different libraries, they all do essentially the same thing. To some degree they all collect, organize, preserve, and disseminate data, information, and knowledge for the benefit their respective user populations.

To accomplish these goals the librarian can take both an analysis tack as well as a synthesis tack. Interactions with people is more about politics, feelings, wants, and needs. Such things are not logical but emotional. This is one side of the coin. The other side of the coin includes well-structured processes & workflows, usability studies & statistical analysis, systematic analysis & measurable results. In our hyper-dynamic environment, such as the one we are working it, innovation — thinking a bit outside the box — is a necessary ingredient for moving forward. At the same time, it is not all about creativity but it is also about strategically planning for the near, medium, and long term future.

Librarianship requires both. Librarianship is arscient.

On the move with the Mobile Web

Sunday, July 20th, 2008

On The Move With The Mobile Web by Ellyssa Kroski provides a nice overview of mobile technology and what it presently means for libraries.

What is in the Report

In my most recent list of top technology trends I mentioned mobile devices. Because of this Kroski had a copy of the Library Technology Report she authored, above, sent to me. Its forty-eight pages essentially consists of six chapters (articles) on the topic of the Mobile Web:

  1. What is the Mobile Web? – An overview of Web technology and its use on hand-held, portable devices. I liked the enumeration of Mobile Web benefits such as: constant connectivity, location-aware services, limitless access, and interactive capabilities. Also, texting was described here as a significant use of the Mobile Web. Ironically, I sent my first text message just prior to the 2008 ALA Annual Meeting.
  2. Mobile devices – A listing and description of the hardware, software (operating systems as well as applications), networks, and companies working in the sphere of the Mobile Web. Apparently three companies (Verizon, AT&T, and Sprint Nextel) have 70% of the market share in terms of network accessibility in the United States.
  3. What can you do with the Mobile Web? – Another list and description but this time of application types: email, text messaging, ringtones & wallpaper, music & radio, software & games, instant messaging, social networking, ebooks, social mapping networks (sort of scary if you ask me), search, mapping, audiobooks, television, travel, browsers, news, blogging, food ordering, and widgets.
  4. Library mobile initiatives – A listing and description of what some libraries are doing with the Mobile Web. Ball State University’s Mobile Web presence seems to be out in front in this regard, and PubMed seems pretty innovative as well. For some commentary regarding iPhone-specific applications for libraries see Peter Brantley’s “The Show Room Library“.
  5. How to create a mobile experience – This is more or less a set of guidelines for implementing Mobile Web services. Some of the salient points include: it is about providing information to people who don’t have a computer, think a lot about location-based services, understand the strengths & weaknesses of the technology. I found this chapter to be the most useful.
  6. Getting started with the Mobile Web – A list of fun things to do to educate yourself on what the Mobile Web can do.

Each chapter is complete with quite a number of links and citations for further reading.

Cellphone barcodes

Through my reading of this Report my knowledge of the Mobile Web increased. The most interesting thing I learned was the existence of Semapedia, a project that “strives to tag real-world objects with 2D barcodes that can be read by camera phones.” Go to Semapedia. Enter a Wikipedia URL. Get back a PDF document containing “barcodes” that your cellphone should be able to read (with the appropriate application). Label real-world things with the barcode. Scan the code with your cellphone. See a Wikipedia article describing the thing. Interesting. Below is one of these barcodes for the word “blog” which links to the Mobile Web-ready Wikipedia entry on blogs:

barcode

Read the report

I still believe the Mobile Web is going to play larger role in people’s everyday lives. (Duh!) By extension, I believe it is going to play a larger role in libraries. Ellyssa Kroski’s On The Move With The Mobile Web will give you a leg up on the technology.

TPM — technological protection measures

Sunday, July 20th, 2008

I learned a new acronym a few weeks ago — TPM — which stands for “technological protection measures”, and in the May 2008 issue of College & Research Libraries Kristin R. Eschenfelder wrote an article called “Every library’s nightmare?” and enumerated various types of protection measures employed by publishers to impede the use of electronic scholarly material.

Types of restrictions

In today’s environment, where digital information is increasingly bought, sold, and/or licensed, publishers feel the need to protect their product from duplication. As described by Eschenfelder, these protections — restrictions — come in two forms: soft and hard.

Soft restrictions are “configurations of hardware or software that make certain uses such as printing, saving, copy/pasting, or e-mailing more difficult — but not impossible — to achieve.” The soft restrictions have been divided into the following subtypes:

  • extent of use – page print limits; PDF download limits; data export limits; suspicious use tracking
  • obfuscation – need to select items before options become available
  • omission – not providing buttons or links to enact users
  • decomposition – saving document results in many files, making recreating or e-mailing the document difficult
  • frustration – page chunking in e-books
  • warning – copyright warnings; end-user licenses on startup

Hard restrictions are “configurations of software or hardware that strictly prevent certain uses.” The hard restrictions have been divided into the following subtypes:

  • restricted copy and paste OCR – OCR exposed for searching, but not for copying and pasting of text
  • secure container TPM – use rights vary by resource

To investigate what types of restricts were put into everyday practice Eschenfelder studied a total of about seventy-five resources from three different disciplines (engineering, history, art history) and tallied the types of restrictions employed.

Salient quotes

A few salient quotes from the article exemplify Eschenfelder’s position on TPM:

  • “This paper suggests that the soft restrictions that are present in licensed products may haver already changed user’s and librarian’s expectations about what the use rights they ought to expect from vendors and their products.” (Page 207)
  • “One concern is that the library community has already accepted many of the soft use restrictions identified in this paper.” (Page 219)
  • “[Librarians] should also advocate for removal of use restrictions, or encourage new vendors to offer competing restriction-free products.” (Page 219)
  • “A more realistic solution might be a shared knowledge base of vendor interfaces and known use restrictions.” (Page 219)
  • “The paper argues that soft use restrictions deserve more attention from the library community, and that librarians should not accept these restrictions as the natural order of things.” (Page 220)

My commentary

I agree with Eschenfelder.

Many people who work in libraries seem to be there because of the values libraries portray. Examples include but are not limited to: intellectual freedom, education, diversity, equal access to information, preservation of the historical record for future generations, etc. Heaven know, people who work in libraries are not in it for the money! I fall into the equal access to information camp, and that is why I advocate things like open access publishing and open source software development.

TPM inhibits the free and equal access of information, and I think Eschenfelder makes a good point when she says the “library community has already accepted many of the soft use restrictions.” Why do we accept them? Librarians are not required to purchase and/or license these materials. We have choice. If much of the scholarly publishing industry is driven by the marketplace — supply & demand — then why don’t/can’t we just say, “No”. Nobody is forcing us spend our money this way. If vendors don’t provide the sort of products and services we desire, then the marketplace will change. Right?

In any event, consider educating yourself on the types of TPM and read Eschenfelder’s article.

Against The Grain is not

Tuesday, July 15th, 2008

Against The Grain is not your typical library-related serial.

Last year I had the opportunity to present at the 27th Annual Charleston Conference where I shared my ideas regarding the future of search and how some of those ideas can implemented in “next-generation” library catalogs. In appreciation of my efforts I was given a one-year subscription to Against The Grain. From the website’s masthead:

Against the Grain (ISSN: 1043-2094) is your key to the latest news about libraries, publishers, book jobbers, and subscription agents. It is a unique collection of reports on the issues, literature, and people that impact the world of books and journals. ATG is published on paper six times a year, in February, April, June, September, and November and December/January.

I try to read the issues as they come out, but I find it difficult. This not because the content is poor, but rather because the there is so much of it! In a few words and phrases, Against The Grain is full, complete, dense, tongue-in-cheek, slightly esoteric, balanced, graphically challenging and at the same time graphically interesting, informative, long, humorous, supported by advertising, somewhat scholarly, personal, humanizing, a realistic reflection of present-day librarianship (especially in regards to technical services in academic libraries), predictable, and consistent. For example, the every issue contains a “rumors” article listing bunches and bunches of people, where they are going, and what they are doing. Moreover, the articles are printed in a relatively small typeface in a three-column format. Very dense. To make things easier to read, sort of, all names and titles are bolded. I suppose the dutiful reader could simply scan for names of interest and read accordingly, but there are so many of them. (Incidentally, the bolded names pointed me to the Tenth Fiesole Retreat which piqued my interest because I had given a modified SIG-IR presentation on MyLibrary at the Second Fiesole Retreat. Taking place at Oxford, that was a really cool meeting!)

Don’t get me wrong. I like Against The Grain but it so full of information and has been so thoroughly put together that I feel almost embarrassed not reading it. I feel like the amount of work put into each issue warrants the same amount of effort on my part to read it.

The latest issue (volume 20, number 3, June 2008) includes a number of articles about Google. For me, the most interesting articles included:

  • “Kinda just like Google” by Jimmy Ghaphery – an examination of the number of search targets appearing on ARL library home pages. Almost all of them include a search of the catalog. Just fewer have searches of meta-search engines. Just fewer than that are pages including searches of Google and its relatives, and just fewer than that, if not non-existent, were searches of locally created indexes like institution repositories or digital collections. Too many search boxes?
  • “Giggling Over Google” by Lilia Murray – a description of how Google Docs and Google Custom Search engines can be used and harnessed in libraries. Well-documented. Well-written. Advocates the creation of more Custom Search Engines by librarians. Sounds like a great idea to me.
  • “Keeping the Enemy Close” by John Wender – compares and contrasts the advantages and disadvantages of including/supporting Google Scholar in an academic library setting. I liked the allusion to Carl Shapiro and Hall Varian’s idea of “information as an ‘experience good'”. Kinda like, “A bird in the hand is worth two in the bush.”
  • “Measuring the ‘Google Effect’ at JSTOR by Bruce Heterick – a description of how JSTOR’s usage skyrocketed after its content was indexed by Google.
  • “Prescription vs. Description in the information-seeking process, or should we encourage our patrons to use Google Scholar?” by Bruce Sanders – contrasts “prescription” and “description” librarianship. One encourages competent, sophisticated searching of databases. The other tailors the library Website to make the patron search strategies as effective as possible. An interesting comparison.
  • “Medium rare books, PODS wars, instant books brought to you by algorithms” by John D. Riley – describes how a fortune of books was found in the stacks of the Forbes Library as opposed to the library’s special collections.

If you have the time, spent it reading Against The Grain.

E-journal archiving solutions

Monday, July 14th, 2008

A JISC-funded report on e-journal archiving solutions is an interesting read, and it seems as if no particular solution is the hands-down “winner”.

Terry Morrow, et al. recently wrote a report sponsored by JISC called “A Comparative study of e-journal archiving solutions“. Its goal was to compare & contrast various technical solutions to archiving electronic journals and present the informed opinion on the subject.

Begged and unanswered questions

The report begins by setting the stage. Of particular note is the increased movement to e-only journal solutions many libraries are adopting. This e-only approach begs unanswered questions regarding the preservation and archiving of electronic journals — two similar but different aspects of content curation. To what degree will e-journals suffer from technical obsolescence? Just as importantly, how will the change in publishing business models, where access, not content, is provided through license agreements effect perpetual access and long-term preservation of e-journals?

Two preservation techniques

The report outlines two broad techniques to accomplish the curation of e-journal content. On one hand there is “source file” preservation where content (articles) are provided by the publisher to a third-party archive. This is the raw data of the articles — possibly SGML files, XML files, Word documents, etc. — as opposed to the “presentation” files intended for display. This approach is seen as being more complete, but relies heavily on active publisher and third party participation. This is the model employed by Portico. The other technique is harvesting. In this case the “presentation” files are archived from the Web. This method is more akin to the traditional way libraries preserved and archived their materials. This is the model employed by LOCKSS.

Compare & contrast

In order to come their conclusions, Morrow et al. compared & contrasted six different e-journal preservation initiatives while looking through the lense of four possible trigger events. These initiatives (technical archiving solutions) included:

  1. British Library e-Journal Digital Archive – a fledgling initiative by a national library
  2. CLOCKSS – a dark archive of articles using the same infrastructure as LOCKSS
  3. e-Depot – a national library initiative from The Netherlands
  4. LOCKSS – an open source and distributed harvesting implementation
  5. OCLC ECO – an aggregation of aggregators, not really preservation
  6. Portico – a Mellon-backed “source file” approach

The trigger events included:

  1. cancelation of an e-journal title
  2. e-journal no longer available from a publisher
  3. publisher ceased operation
  4. catastrophic hardware or network failure

These characteristics made up a matrix and enabled Morrow, et al. to describe what would happen with each initiative under each trigger event. In summary, they would all function but it seems the LOCKSS solution would provide immediate access to content whereas most of the other solutions would only provide delayed access. Unfortunately, the LOCKSS initiative seems to have less publisher backing than the Portico initiative. On the other hand, the Portico initiative costs more money and assumes a lot of new responsibilities from publishers.

In today’s environment where information is more routinely sold and licensed, I wonder whether or not what level of trust can be given to publishers. What’s in it for them? In the end, neither solution — LOCKSS nor Portico — can be considered ideal, and both ought to be employed at the present time. One size does not fit all.

Recommendations

In the end there were ten recommendations:

  1. carry out risk assessments
  2. cooperate with one or more external e-journal archiving solutions
  3. develop standard cross-industry definitions of trigger events and protocols
  4. ensure archiving solutions cover publishers of value to UK libraries
  5. explicitly state perpetual access policies
  6. follow the Transfer Code of Practice
  7. gather and share statistical information about the likelihood of trigger events
  8. provide greater detail of coverage details
  9. review and update this study on a regular basis
  10. take the initiative by specifying archiving requirements when negotiating licenses

Obviously the report went into much greater detail regarding all of these recommendations and how they derived. Read the report for the details.

There are many aspects that make up librarianship. Preservation is just one of them. Unfortunately, when it comes to preservation of electronic, born-digital content, the jury is still out. I’m afraid we are suffering from a wealth of content right now, but in the future this content may not be accessible because society has not thought very long into the future regarding preservation and archiving. I hope we are not creating a Digital Dark Age as we speak. Implementing ideas from this report will help reduce the possibility of this problem from becoming a reality.

Web 2.0 and “next-generation” library catalogs

Monday, July 14th, 2008

A First Monday article systematically comparing & contrasting Web 1.0 and Web 2.0 website technology recently caught my interest, and I think it points a way to making more informed decisions regarding “next-generation” library catalog interfaces and Internet-based library services in general.

Web 1.0 versus Web 2.0

Graham Cormode and Balachander Krishnamurthy in “Key differences between Web 1.0 and Web 2.0“, First Monday, 13(6): June 2008 thoroughly describe the characteristics of Web 2.0 technology. It outlines the features of Web 2.0, describes the structure of Web 2.0 sites, identifies problem with measurement of Web 2.0 usage, and covers technical issues.

I really liked the how it listed some of the identifying characteristics. Web 2.0 sites usually:

  • encourage user-generated content
  • exploit AJAX
  • have a strong social component
  • support some sort of public API
  • support the ability to form connections between people
  • support the posting of content in many forms
  • treat users as first class entities in the system

The article included a nice matrix of popular websites across the top and services down the side. At the intersection of the rows and columns check marks were placed denoting whether or not the website supported the services. Of all the websites Facebook, YouTube, Flicr, and MySpace ranked as being the most Web 2.0-esque. Not surprising.

The compare & contrast between Web 1.0 and Web 2.0 sites was particular interesting, and can be used as a sort of standard/benchmark for comparing existing (library) websites to the increasingly expected Web 2.0 format. For example, Web 1.0 sites are characterized as being:

  • stateless
  • shaped like a “bow-tie” where there is a front-page linked to many sub-pages and supplimented with many cross links between sub-pages
  • covering a single topic

Whereas Web 2.0 websites generally:

  • include a broader mixture of content types
  • produce groups or feeds of content
  • rely on user-provided content
  • represent a shared space
  • require some sort of log-in function
  • see “portalization” is a trend

For readers who feel they they do not understand the meaning of Web 2.0, the items outlined above and elaborated upon in the article will make the definition of Web 2.0 clearer. Good reading.

Library “catalogs”

The article also included an interesting graphic, Figure 1, illustrating the paths from content creator to consumer in Web 2.0. The images is linked from the article, below:

Figure 1: Paths from content creator to consumer in Web 2.0

The far left denotes people creating content. The far right denotes people using content. In the middle are services. When I look at the image I see everything from the center to the far right of the following illustration (of my own design):

infrastructure for a next-generation library catalog

This illustration represents a model for a “next-generation” library catalog. On the far left is content aggregation. In the center is content normalization and indexing. On the right are services against the content. The right half of the illustration above is analgous to the entire illustration from Cormode and Krishnamurthy.

Like the movement from Web 1.0 to Web 2.0, library websites (online “catalogs”) need to be more about users, their content, and services applied against it. “Next-generation” library catalogs will fall short if they are only enhanced implementations of search and browse interfaces. With the advent of digization, everybody has content. What is needed are tools — services — to make it more useful.

Alex Lite: A Tiny, standards-compliant, and portable catalogue of electronic texts

Saturday, July 12th, 2008

One the beauties of XML its ability to be transformed into other plain text files, and that is what I have done with a simple software distribution called Alex Lite.

My TEI publishing system(s)

A number of years ago I created a Perl-based TEI publishing system called “My personal TEI publishing system“. Create a database designed to maintain authority lists (titles and subjects), sets of XSLT files, and TEI/XML snippets. Run reports against the database to create complete TEI files, XHTML files, RSS files, and files designed to be disseminated via OAI-PMH. Once the XHTML files are created, use an indexer to index them and provide a Web-based interface to the index. Using this system I have made accessible more than 150 of my essays, travelogues, and workshop handouts retrospectively converted as far back as 1989. Using this system, many (if not most) of my writings have been available via RSS and OAI-PMH since October 2004.

A couple of years later I morphed the TEI publishing system to enable me to mark-up content from an older version of my Alex Catalogue of Electronic Texts. Once marked up I planned to transform the TEI into a myriad of ebook formats: plain text, plain HTML, “smart” HTML, PalmPilot DOC and eReader, Rocket eBook, Newton Paperback, PDF, and TEI/XML. The mark-up process was laborious and I have only marked up about 100 texts, and you can see the fruits of these labors, but the combination of database and XML technology has enabled me to create Alex Lite.

Alex Lite

Alex Lite the result of a report written against my second TEI publishing system. Loop through each item in the database and update an index of titles. Create a TEI file against each item. Using XSLT, convert each TEI file into a plain HTML file, a “pretty” XHTML file, and a FO (Formatting Objects) file. Use a FO processor (like FOP) to convert the FO into PDF. Loop through each creator in the database to create an author index. Glue the whole thing together with an index.html file. Save all the files to a single directory and tar up the directory.

The result is a single file that can be downloaded, unpacked, and provide immediate access to sets of electronic books in an standards-compliant, operating system independent manner. Furthermore, no network connection is necessary except for the initial acquisition of the distribution. This directory can then be networked or saved to a CD-ROM. Think of the whole thing as if it were a library.

Give it a whirl; download a version of Alex Lite. Here is a list of all the items in the tiny collection:

  1. Alger Jr., Horatio (1834-1899)
    • The Cash Boy
    • Cast Upon The Breakers
  2. Bacon, Francis (1561-1626)
    • The Essays
    • The New Atlantis
  3. Burroughs, Edgar Rice (1875-1850)
    • At The Earth’s Core
    • The Beasts Of Tarzan
    • The Gods Of Mars
    • The Jungle Tales Of Tarzan
    • The Monster Men
    • A Princess Of Mars
    • The Return Of Tarzan
    • The Son Of Tarzan
    • Tarzan And The Jewels Of Opar
    • Tarzan Of The Apes
    • The Warlord Of Mars
  4. Conrad, Joseph (1857-1924)
    • The Heart Of Darkness
    • Lord Jim
    • The Secret Sharer
  5. Doyle, Arthur Conan (1859-1930)
    • The Adventures Of Sherlock Holmes
    • The Case Book Of Sherlock Holmes
    • His Last Bow
    • The Hound Of The Baskervilles
    • The Memoirs Of Sherlock Holmes
  6. Machiavelli, Niccolo (1469-1527)
    • The Prince
  7. Plato (428-347 B.C.)
    • Charmides, Or Temperance
    • Cratylus
    • Critias
    • Crito
    • Euthydemus
    • Euthyphro
    • Gorgias
  8. Poe, Edgar Allan (1809-1849)
    • The Angel Of The Odd–An Extravaganza
    • The Balloon-Hoax
    • Berenice
    • The Black Cat
    • The Cask Of Amontillado
  9. Stoker, Bram (1847-1912)
    • Dracula
    • Dracula’s Guest
  10. Twain, Mark (1835-1910)
    • The Adventures Of Huckleberry Finn
    • A Connecticut Yankee In King Arthur’s Court
    • Extracts From Adam’s Diary
    • A Ghost Story
    • The Great Revolution In Pitcairn
    • My Watch: An Instructive Little Tale
    • A New Crime
    • Niagara
    • Political Economy

XSLT

As alluded to above, the beauty of XML is its ability to be transformed into other plain text formats. XSLT allows me to convert the TEI files into other files for different mediums. The distribution includes only simple HTML, “pretty” XHTML, and PDF versions of the texts, but for the XSLT affectionatos in the crowd who may want to see the XSLT files, I have included them here:

  • tei2htm.xsl – used to create plain HTML files complete with metadata
  • tei2html.xsl – used to create XHTML files complete with metadata as well as simple CSS-enabled navigation
  • tei2fo.xsl – used to create FO files which were fed to FOP in order to create things designed for printing on paper

Here’s a sample TEI file, Edgar Allen Poe’s The Cask Of Amontillado.

Future work

I believe there is a lot of promise in the marking-up of plain text into XML, specifically works of fiction and non-fictin into TEI. Making available such marked-up texts paves the way for doing textual analysis against them and for enhancing them with personal commentary. It is too bad that the mark-up process, even simple mark-up, is so labor intensive. Maybe I’ll do more of this sort of thing in my copius spare time.

Indexing MARC records with MARC4J and Lucene

Wednesday, July 9th, 2008

In anticipation of the eXtensible Catalog (XC) project, I wrote my first Java programs a few months ago to index MARC records, and you can download them from here.

The first uses MARC4J and Lucene to parse and index MARC records. The second uses Lucene to search the index created from the first program. They are very simple programs — functional and not feature-rich. For the budding Java programmer in libraries, these programs could be used as a part a rudimentary self-paced tutorial. From the distribution’s README:

This is the README file for two Java programs called Index and Search.

Index and Search are my first (real) Java programs. Using Marc4J, Index
reads a set of MARC records, parses them (for authors, titles, and call
numbers), and feeds the data to Lucene for indexing. To get the program
going you will need to:

  1. Get the MARC4J .jar files, and make sure they are in your CLASSPATH.
  2. Get the Lucene .jar files, and make sure they are in your CLASSPATH.
  3. Edit Index.java so the value of InputStream points to a set of MARC records.
  4. Create a directory named index in the same directory as the source code.
  5. Compile the source (javac Index.java).
  6. Run the program (java Index).

The program should echo the parsed data to the screen and create an
index in the index directory. It takes me about fifteen minutes to index
700,000 records.

The second program, Search, is designed to query the index created by
the first program. To get it to run you will need to:

  1. Get the Lucene .jar files, and make sure they are in your CLASSPATH.
  2. Make sure the index created by Index is located in the same directory as the source code.
  3. Compile the source (javac Search.java).
  4. Run the program (java Search where is a word or phrase).

The result should be a list items from the index. Simple.

Enjoy?!

Encoded Archival Description (EAD) files everywhere

Tuesday, July 1st, 2008

I’m beginning to see Encoded Archival Description (EAD) files everywhere, but maybe it is because I am involved with a project called the Catholic Research Resources Alliance (CRRA).

As you may or may not know, EAD files are the “MODS files” of the archival community. These XML files provide the means to administratively describe archival collections as well as describe the things in the collections at the container, folder, or item level.

Columbia University and MARC records

During the past few months, I helped edit and shepherd an article for Code4Lib Journal by Terry Catapano, Joanna DiPasquale, and Stuart Marquis called “Building an archival collections portal“. The article describes the environment and outlines the process folks at Columbia University use to make sets of their archival collections available on the Web. Their particular process begins with sets of MARC records dumped from their integrated library system. Catapano, DiPasquale, and Marquis then crosswalk the MARC to EAD, feed the EAD to Solr/Lucene, and provide access to the resulting index. Their implementation uses a mixture of Perl, XSLT, PHP, and Javascript. What was most interesting was the way they began the process with MARC records.

Florida State University and tests/tools

Today I read an article by Plato L. Smith II from Information Technology and Libraries (volume 27, number 2, pages 26-30) called “Preparing locally encoded electronic finding aid inventories for union environments: A Publishing model for Encoded Archival Description”. [COinS] Smith describes how the Florida State University Libraries create their EAD files with Note Tab Light templates and then convert them into HTML and PDF documents using XSLT. They provide access to the results through the use of content management system — DigiTool. What I found most intriguing about this article where the links to test/tools used to enrich their EAD files, namely the RLG EAD Report Card and the Online Archive of California Best Practices Guidelines, Appendix B. While I haven’t set it up yet, the former should check EAD files for conformity (beyond validity), and the later will help create DACS-compliant EAD Formal Public Identifiers.

Catholic Research Resources Alliance portal

Both of these articles will help me implement the Catholic Research Resources Alliance (CRRA) portal. From a recent workshop I facilitated:

The ultimate goal of the CRRA is to facilitate research in Catholic scholarship. The focus of this goal is directed towards scholars but no one is excluded from using the Alliance’s resources. To this end, participants in the Alliance are expected to make accessible rare, unique, or infrequently held materials. Alliance members include but are not limited to academic libraries, seminaries, special collections, and archives. Similarly, content might include but is not limited to books, manuscripts, letters, directories, newspapers, pictures, music, videos, etc. To date, some of the Alliance members are Boston College, Catholic University, Georgetown University, Marquette University, Seton Hall University, University of Notre Dame, and University of San Diego.

Like the Columbia University implementation, the portal is expected to allow Alliance members to submit MARC records describing individual items. The Catapano, DiPasquale, and Marquis article will help me map my MARC fields to my local index. Like the Florida Sate University implementation, the portal is expected to allow Alliance members to submit EAD files. The Smith article will help me create unique identifiers. For Alliance members who have neither MARC nor EAD files, the portal is expected to allow Alliance members submit their content via a fill-in-the-blank interface which I am adopting from the good folks at the Archives Hub.

The CRRA portal application is currently based on MyLibrary and an indexer/search engine called KinoSearch. After submitting them to the portal, EAD files and MARC records are parsed and saved to a MySQL database using the Perl-based MyLibrary API. Various reports are then written against the database, again, using the MyLibrary API. These reports are used to create on-the-fly browsable lists of formats, names, subjects, and CRRA “themes”. They are used to create sets of XML files for OAI-PMH harvesting. They are used to feed data to Kinosearch to create an index. (For example, see mylibrary2files.pl and then ead2kinosearch.pl.) Finally, the whole thing is brought together with a single Perl script for searching (via SRU) and browsing.

It is nice to see a growing interest in EAD. I think the archival community has a leg up on it library brethren regarding metadata. They are using XML more and more. Good for them!

Finally, let’s hear it for the ‘Net, free-flowing communication, and open source software. Without these things I would not have been able to accomplish nearly as much as I have regarding the portal. “Thanks guys and gals!”