Archive for the ‘MARC and technical services’ Category

Editing authorities at the speed of four records per minute

Thursday, April 7th, 2016

This missive outlines and documents an automated process I used to “cleanup” and “improve” a set of authority records, or, to put it another way, how I edited authorities at the speed of four records per minute.

springAs you may or may not know, starting in September 2015, I commenced upon a sort of “leave of absence” from my employer.† This leave took me to Tuscany, Venice, Rome, Provence, Chicago, Philadelphia, Boston, New York City, and back to Rome. In Rome I worked for the American Academy of Rome doing short-term projects in the library. The first project revolved around authority records. More specifically, the library’s primary clientele were Americans, but the catalog’s authority records included a smattering of Italian headings. The goal of the project was to automatically convert as many of the “invalid” Italian headings into “authoritative” Library of Congress headings.

Identify “invalid” headings

pantheonWhen I first got to Rome I had the good fortune to hang out with Terry Reese, the author of the venerable MarcEdit.‡ He was there giving workshops. I participated in the workshops. I listened, I learned, and I was grateful for a Macintosh-based version of Terry’s application.

When the workshops were over and Terry had gone home I began working more closely with Sebastian Hierl, the director of the Academy’s library.❧ Since the library was relatively small (about 150,000 volumes), and because the Academy used Koha for its integrated library system, it was relatively easy for Sebastian to give me the library’s entire set of 124,000 authority records in MARC format. I fed the authority records into MarcEdit, and ran a report against them. Specifically, I asked MarcEdit to identify the “invalid” records, which really means, “Find all the records not found in the Library of Congress database.” The result was a set of approximately 18,000 records or approximately 14% of the entire file. I then used MarcEdit to extract the “invalid” records from the complete set, and this became my working data.

Search & download

alterI next created a rudimentary table denoting the “invalid” records and the subsequent search results for them. This tab-delimited file included values of MARC field 001, MARC field 1xx, an integer denoting the number of times I searched for a matching record, an integer denoting the number of records I found, an identifier denoting a Library of Congress authority record of choice, and a URL providing access to the remote authority record. This table was initialized using a script called authority2list.pl. Given a file of MARC records, it outputs the table.

I then systematically searched the Library of Congress for authority headings. This was done with a script called search.pl. Given the table created in the previous step, this script looped through each authority, did a rudimentary search for a valid entry, and output an updated version of the table. This script was a bit “tricky”.❦ It first searched the Library of Congress by looking for the value of MARC 1xx$a. If no records were found, then no updating was done and processing continued. If one record was found, then the Library of Congress identifier was saved to the output and processing continued. If many records were found, then a more limiting search was done by adding a date value extracted from MARC 1xx$d. Depending on the second search result, the output was updated (or not), and processing continued. Out of original 18,000 “invalid” records, about 50% of them were identified with no (zero) Library of Congress records, about 30% were associated with multiple headings, and the remaining 20% (approximately 3,600 records) were identified with one and only one Library of Congress authority record.

I now had a list of 3,600 “valid” authority records, and I needed to download them. This was done with a script called harvest.pl. This script is really a wrapper around a program called GNU Wget. Given my updated table, the script looped through each row, and if it contained a URL pointing to a Library of Congress authority record, then the record was cached to the file system. Since the downloaded records were formatted as MARCXML, I then needed to transform them into MARC21. This was done with a pair of scripts: xml2marc.sh and xml2marc.pl. The former simply looped through each file in a directory, and the later did the actual transformation but along the way updated MARC 001 to the value of the local authority record.

Verify and merge

backyardIn order to allow myself as well as others to verify that correct records had been identified, I wrote another pair of programs: marc2compare.pl and compare2html.pl. Given two MARC files, marc2compare.pl created a list of identifiers, original authority values, proposed authority values, and URLs pointing to full descriptions of each. This list was intended to be poured into a spreadsheet for compare & contrast purposes. The second script, compare2html.pl, simply took the output of the first and transformed it into a simple HTML page making it easier for a librarian to evaluate correctness.

Assuming the 3,600 records were correct, the next step was to merge/overlay the old records with the new records. This was a two-step process. The first step was accomplished with a script called rename.pl. Given two MARC files, rename.pl first looped through the set of new authorities saving each identifier to memory. It then looped through the original set of authorities looking for records to update. When records to update were found, each was marked for deletion by prefixing MARC 001 with “x-“. The second step employed MarcEdit to actually merge the set of new authorities with the original authorities. Consequently, the authority file increased in size by 3,600 records. It was then up to other people to load the authorities into Koka, re-evaluate the authorities for correctness, and if everything was okay, then delete each authority record prefixed with “x-“.

Done.❀

Summary and possible next steps

In summary, this is how things happened. I:

  1. got a complete dump of original authority 123,329 records
  2. extracted 17,593 “invalid” records
  3. searched LOC for “valid” records and found 3,627 of them
  4. harvested the found records
  5. prefixed the 3,627 001 fields in the original file with “x-“
  6. merged the original authority records with the harvested records
  7. made the new set of 126,956 updated records available

academyThere were many possible next steps. One possibility was to repeat the entire process but with an enhanced search algorithm. This could be difficult considering the fact that searches using merely the value of 1xx$a returned zero hits for half of the working data. A second possibility was to identify authoritative records from a different system such as VIAF or Worldcat. Even if this was successful, I wonder how possible it would have been to actually download authority records as MARC. A third possibility was to write a sort of disambiguation program allowing librarians to choose from a set of records. This could have been accomplished by searching for authorities, presenting possibilities, allowing librarians to make selections via an HTML form, caching the selections, and finally, batch updating the master authority list. Here at the Academy we denoted the last possibility as the “cool” one.

Now here’s an interesting way to look at the whole thing. This process took me about two weeks worth of work, and in that two weeks I processed 18,000 authority records. That comes out to 9,000 records/week. There are 40 hours in work week, and consequently, I processed 225 records/hour. Each hour is made up of 60 minutes, and therefore I processed approximately 4 records/minute, and that is 1 record every fifteen seconds for the last two weeks. Wow!?

Finally, I’d like to thank the Academy (with all puns intended). Sebastian, his colleagues, and especially my office mate (Kristine Iara) were all very supportive throughout my visit. They provided intellectual stimulation and something to do while I contemplated my navel during the “adventure”.

Notes

bicycles† Strictly speaking, my adventure was not a sabbatical nor a leave of absence because: 1) as a librarian I was not authorized to take a sabbatical, and 2) I did not have any healthcare issues. Instead, after bits of negotiation, my contract was temporarily changed from full-time faculty to adjunct faculty, and I worked for my employer 20% of the time. The other 80% of time was spent on my “adventure”. And please don’t get me wrong, this whole thing was a wonderful opportunity for which I will be eternally grateful. “Thank you!”

‡ During our overlapping times there in Rome, Terry & I played tourist which included the Colosseum, a happenstance mass at the Pantheon, a Palm Sunday Mass in St. Peter’s Square with tickets generously given to us by Joy Nelson of ByWater Solutions, and a day-trip to Florence. Along the way we discussed librarianship, open source software, academia, and life in general. A good time was had by all.

❧ Ironically, Sebastian & I were colleagues during the dot-com boom when we both worked at North Caroline State University. The world of librarianship is small.

❦ This script — search.pl — was really a wrapper around an application called curl, and thanks go to Jeff Young of OCLC who pointed me to the ATOM interface of the LC Linked Data Service. Without Jeff’s helpful advice, I would have wrestled with OCLC’s various authentication systems and Web Service interfaces.

❀ Actually, I skipped a step in this narrative. Specifically, there are some records in the authority file that were not expected to be touched, even if they are “invalid”. This set of records was associated with a specific call number pattern. Two scripts (fu-extract.pl and fu-remove.pl) did the work. The first extracted a list of identifiers not to touch and the second removed them from my table of candidates to validate.

XML 101

Wednesday, January 6th, 2016

This past Fall I taught “XML 101” online and to library school graduate students. This posting echoes the scripts of my video introductions, and I suppose this posting could also be used as very gentle introduction to XML for librarians.

Introduction

another fieldI work at the University of Notre Dame, and my title is Digital Initiatives Librarian. I have been a librarian since 1987. I have been writing software since 1976, and I will be your instructor. Using materials and assignments created by the previous instructors, my goal is to facilitate your learning of XML.

XML is a way of transforming data into information. It is a method for marking up numbers and text, giving them context, and therefore a bit of meaning. XML includes syntactical characteristics as well as semantic characteristics. The syntactical characteristics are really rather simple. There are only five or six rules for creating well-formed XML, such as: 1) there must be one and only one root element, 2) element names are case-sensitive, 3) elements must be close properly, 4) elements must be nested properly, 4) attributes must be quoted, and 5) there are a few special characters (&, <, and >) which must be escaped if they are to be used in their literal contexts. The semantics of XML is much more complicated and they denote the intended meaning of the XML elements and attributes. The semantics of XML are embodied in things called DTDs and schemas.

Again, XML is used to transform data into information. It is used to give data context, but XML is also used to transmit this information in an computer-independent way from one place to another. XML is also a data structure in the same way MARC, JSON, SQL, and tab-delimited files are data structures. Once information is encapsulated as XML, it can unambiguously transmitted from one computer to another where it can be put to use.

This course will elaborate upon these ideas. You will learn about the syntax and semantics of XML in general. You will then learn how to manipulate XML using XML-related technologies called XPath and XSLT. Finally, you will learn library-specific XML “languages” to learn how XML can be used in Library Land.

Well-formedness

In this, the second week of “XML 101 for librarians”, you will learn about well-formed XML and valid XML. Well-formed XML is XML that conforms to the five or six syntactical rules. (XML must have one and only one root element. Element names are case sensitive. Elements must be closed. Elements must be nested correctly. Attributes must be quoted. And there are a few special characters that must be escaped (namely &, <, and >). Valid XML is XML that is not only well-formed but also conforms to a named DTD or schema. Think of valid XML as semantically correct.

Jennifer Weintraub and Lisa McAulay, the previous instructors of this class, provide more than a few demonstrations of how to create well-formed as well as valid XML. Oxygen, the selected XML editor for this course is both powerful and full-featured, but using it efficiently requires practice. That’s what the assignments are all about. The readings supplement the demonstrations.

DTD’s and namespaces

DTD’s, schemas, and namespaces put the “X” in XML. They make XML extensible. They allow you to define your own elements and attributes to create your own “language”.

DTD’s — document type declarations — and schemas are the semantics of XML. They define what elements exists, what order they appear in, what attributes they can contain, and just as importantly what the elements are intended to contain. DTD’s are older than schemas and not as robust. Schemas are XML documents themselves and go beyond DTD’s in that they provide the ability to define the types of data elements and attributes contain.

Namespaces allow you, the author, to incorporate multiple DTD and schema definitions into a single XML document. Namespaces provide a way for multiple elements of the same name to exist concurrently in a document. For example, two different DTD’s may contain an element called “title”, but one DTD refers to a title as in the title of a book, and the other refers to “title” as if it were an honorific.

Schemas

Schemas are an alternative and more intelligent alternative to DTDs. While DTDs define the structure of XML documents, schemas do it with more exactness. While DTDs only allow you to define elements, the number of elements, the order of elements, attributes, and entities, schemas allow you to do these things and much more. For example, they allow you to define the types of content that go into elements or attributes. Strings (characters). Numbers. Lists of characters or numbers. Boolean (true/false) values. Dates. Times. Etc. Schemas are XML documents in an of themselves, and therefore they can be validated just like any other XML document with a pre-defined structure.

The reading and writing of XML schemas is very librarian-ish because it is about turning data into information. It is about structuring data so it makes sense, and it does this in an unambiguous and computer-independent fashion. It is too bad our MARC (bibliographic) standards are not as rigorous.

RelaxNG, Schematron, and digital libraries

fieldsThe first is yet another technology for modeling your XML, and it is called RelaxNG. This third modeling technology is intended to be more human readable than schemas and more robust that DTDs. Frankly, I have not seen RelaxNG implements very many times, but it behooves you to know it exists and how it compares to other modeling tools.

The second is Schematron. This tool too is used to validate XML, but instead of returning “ugly” computer-looking error messages, its errors are intended to be more human-readable and describe why things are the way they are instead of just saying “Wrong!”

Lastly, there is an introduction to digital libraries and trends in their current development. More and more, digital libraries are really and truly implementing the principles of traditional librarianship complete with collection, organization, preservation, and dissemination. At the same time, they are pushing the boundaries of the technology and stretching our definitions. Remember, it is not so much about the technology (the how of librarianship) that is important, but rather the why of libraries and librarianship. The how changes quickly. The why changes slowly, albiet sometimes too slowly.

XPath

This week is all about XPath, and it is used to select content from your XML files. It is akin to navigating a computer’s filesystem from the command line in order to learn what is located in different directories.

XPath is made up of expressions which return values of true, false, strings (characters), numbers, or nodes (subsets of XML files). XPath is used in conjunction with other XML technologies, most notably XSTL and XQuery. XSLT is used to transform XML files into other plain text files. XQuery is akin to the structured query language of relational databases.

You will not be able to do very much with XML other than read or write it, unless you understand XPath. An understanding XPath is essencial if you want to do truly interesting things with XML.

XSLT

This week you will be introduced to XSLT, a programming language used to transform XML into other plain text files.

XML is all about information, and it is not about use nor display. In order for XML to be actually useful — to be applied towards some sort of end — specific pieces of data need to be extracted from XML or the whole of the XML file needs to be converted into something else. The most common conversion (or “transformation”) is from some sort of XML into HTML for display in a Web browser. For example, bibliographic XML (MARCXML or MODS) may be transformed into a sort of “catalog card” for display, or a TEI file may be transformed into a set of Web pages, or an EAD file may be transformed into a guide intended for printing. Alternatively, you may want to tranform the bibliographic data into a tab-delimited text file for a spreadsheet or an SQL file for a relational database. Along with other sets of information, an XML file may contain geographic coordinates, and you may want to extract just those coordinates to create a KML file — a sort of map file.

XSLT is a programming language but not like most programming languages you may know. Most programming languages are “procedural” (like Perl, PHP, or Python), meaning they execute their commands in a step-wise manner. “First do this, then do that, then do the other thing.” This can be contrasted with “declarative” programming languages where events occur or are encountered in a data file, and then some sort of execution happens. There are relatively few declarative programming languages, but LISP is/was one of them. Because of the declarative nature of XSLT, the apply-templates command is so important. The apply-templates command sort of tells the XSLT processor to go off and find more events.

Now that you are beginning to learn XSLT and combining it with XPath, you are beginning to do useful things with the XML you have been creating. This is where the real power is. This is where it gets really interesting.

TEI — Text Encoding Initiative

TEI is a granddaddy, when it comes to XML “languages”. It started out as a different from of mark-up, a mark-up called SGML, and SGML was originally a mark-up language designed at IBM for the purposes of creating, maintaining, and distributing internal documentation. Now-a-days, TEI is all but a hallmark of XML.

TEI is a mark-up language for any type of literature: poetry or prose. Like HTML, it is made up of head and body sections. The head is the place for administrative, bibliographic, and provenance metadata. The body is where the poetry or prose is placed, and there are elements for just about anything you can imagine: paragraphs, lines, headings, lists, figures, marginalia, comments, page breaks, etc. And if there is something you want to mark-up, but an element does not explicitly exist for it, then you can almost make up your own element/attribute combination to suit your needs.

TEI is quite easily the most well-documented XML vocabulary I’ve ever seen. The community is strong, sustainable, albiet small (if not tiny). The majority of the community is academic and very scholarly. Next to a few types of bibliographic XML (MARCXML, MODS, OAIDC, etc.), TEI is probably the most commonly used XML vocabulary in Library Land, with EAD being a close second. In libraries, TEI is mostly used for the purpose of marking-up transcriptions of various kinds: letters, runs of out-of-print newsletters, or parts of a library special collection. I know of no academic journals marked-up in TEI, no library manuals, nor any catalogs designed for printing and distribution.

TEI, more than any other type of XML designed for literature, is designed to support the computed critical analysis of text. But marking something up in TEI in a way that supports such analysis is extraordinarily expensive in terms of both time and expertise. Consequently, based on my experience, there are relatively very few such projects, but they do exist.

XSL-FO

As alluded to throughout this particular module, XSL-FO is not easy, but despite this fact, I sincerely believe it is under-utilized tool.

FO stands for “Formatting Objects”, and it in an of itself is an XML vocabulary used to define page layout. It has elements defining the size of a printed page, margins, running headers & footers, fonts, font sizes, font styles, indenting, pagination, tables of contents, back-of-the-book indexes, etc. Almost all of these elements and their attributes use a syntax similar to the syntax of HTML’s cascading stylesheets.

Once an XML file is converted into an FO document, you are expected to feed the FO document to a FO processor, and the FO processor will convert the document into something intended for printing — usually a PDF document.

FO is important because not everything is designed nor intended to be digital. Digital everything is mis-nomer. The graphic design of a printed medium is different from the graphic design of computer screens or smart phones. In my opinion, important XML files ought to be transformed into different formats for different mediums. Sometimes those mediums are screen oriented. Sometimes it is better to print something, and printed somethings last a whole lot longer. Sometimes it is important to do both.

FO is another good example of what XML is all about. XML is about data and information, not necessarily presentation. XSL transforms data/information into other things — things usually intended for reading by people.

EAD — Encoded Archival Description

Encoded Archival Description (or EAD) is the type of XML file used to enumerate, evaluate, and make accessible the contents of archival collections. Archival collections are often the raw and primary materials of new humanities scholarship. They are usually “the papers” of individuals or communities. They may consist of all sorts of things from letters, photographs, manuscripts, meeting notes, financial reports, audio cassette tapes, and now-a-days computers, hard drives, or CDs/DVDs. One thing, which is very important to understand, is that these things are “collections” and not intended to be used as individual items. MARC records are usually used as a data structure for bibliographically describing individual items — books. EAD files describe an entire set of items, and these descriptions are more colloquially called “finding aids”. They are intended to be read as intellectual works, and the finding aids transform collections into coherent wholes.

Like TEI files, EAD files are comprised of two sections: 1) a header and 2) a body. The header contains a whole lot or very little metadata of various types: bibliographic, administrative, provenance, etc. Some of this metadata is in the form of lists, and some of it is in the form of narratives. More than TEI files, EAD files are intended to be displayed on a computer screen or printed on paper. This is why you will find many XSL files transforming EAD into either HTML or FO (and then to PDF).

RDF

RDF is an acronym for Resource Description Framework. It is a data model intended to describe just about anything. The data model is based on an idea called triples, and as the name implies, the triples have three parts: 1) subjects, 2) predicates, and 3) objects.

Subjects are always URIs (think URLs), and they are the things described. Objects can be URIs or literals (words, phrases, or numbers), and objects are the descriptions. Predicates are also always URIs, and they denote the relationship between the subjects and the objects.

The idea behind RDF was this. Describe anything and everthing in RDF. Resuse as many of the URIs used by other people as possible. Put the RDF on the Web. Allow Internet robots/spiders to harvest and cache the RDF. Allow other computer programs to ingest the RDF, analyse it for the similar uses of subjects, predicates, and objects, and in turn automatically uncover new knowledge and new relationships between things.

RDF is/was originally expressed as XML, but the wider community had two problems with RDF. First, there were no “killer” applications using RDF as input, and second, RDF expressed as XML was seen as too verbose and too confusing. Thus, the idea of RDF languished. More recently, RDF is being expressed in other forms such as JSON and Turtle and N3, but there are still no killer applications.

You will hear the term “linked data” in association with RDF, and linked data is the process of making RDF available on the Web.

RDF is important for libraries and “memory” or “cultural heritage” institutions, because the goal of RDF is very similar to the goals of libraries, archives, and museums.

MARC

wavesThe MARC standard has been the bibliographic bread & butter of Library Land since the late 1960’s. When it was first implemented it was an innovative and effect data structure used primarily for the production of catalog cards. With the increasing availability of computers, somebody got the “cool” idea of creating an online catalog. While logical, the idea did not mature with a balance of library and computing principles. To make a long story short, library principles prevailed and the result has been and continues to be painful for both the profession as well as the profession’s clientele.

MARCXML was intended to provide a pathway out of this morass, but since it was designed from the beginning to be “round tripable” with the original MARC standard, all of the short-comings of the original standard have come along for the ride. The Library Of Congress was aware of these short-comings, and consequently MODS was designed. Unlike MARC and MARCXML, MODS has no character limit and its field names are human-readable, not based on numeric codes. Given that MODS is flavor of XML, all of this is a giant step forward.

Unfortunately, the library profession’s primary access tools — the online catalog and “discovery system” — still heavily rely on traditional MARC for input. Consequently, without a wholesale shift in library practice, the intellectual capital the profession so dearly wants to share is figuratively locked in the 1960’s.

Not a panacea

XML really is an excellent technology, and it is most certainly apropos for the work of cultural heritage institutions such as libraries, archives, and museums. This is true for many reasons:

  1. it is computing platform independent
  2. it requires a minimum of computer technology to read and write
  3. to some degree, it is self-documenting, and
  4. especially considering our profession, it is all about data, information, and knowlege

On the other hand, it does have a number of disadvantages, for example:

  1. it is verbose — not necessarily succinct
  2. while easy to read and write, it can be difficult to process
  3. like all things computer program-esque, it imposes a set of syntactical rules, which people can sometimes find frustrating
  4. its adoption as standard has not been as ubiquitous as desired

To date you have learned how to read, write, and process XML and a number of its specific “flavors”, but you have by no means learned everything. Instead you have received a more than adequate introduction. Other XML topics of importance include:

  • evolutions in XSLT and XPath
  • XML-based databases
  • XQuery, a standardized method for querying sets of XML similar to the standard query language of relational databases
  • additional XML vocabularies, most notably RSS
  • a very functional way of making modern Web browsers display XML files
  • XML processing instructions as well as reserved attributes like lang

In short, XML is not a panacea, but it is an excellent technology for library work.

Summary

You have all but concluded a course on XML in libraries, and now is a good time for a summary.

First of all, XML is one of culture’s more recent attempts at formalizing knowledge. At its root (all puns intended) is data, such as the number like 1776. Through mark-up we might say this number is a year, thus turning the data into information. By putting the information into context, we might say that 1776 is when the Declaration of Independence was written and a new type of government was formed. Such generalizations fall into the realm of knowledge. To some degree, XML facilitates the transformation of data into knowledge. (Again, all puns intended.)

Second, understand that XML is also a data structure defined by the characteristics of well-formedness. By that I mean XML has one and only one root element. Elements must be opened and closed in a hierarchal manner. Attributes of elements must be quoted, and a few special characters must always be escaped. The X in XML stands for “extensible”, and through the use of DTDs and schemas, specific XML “flavors” can be specified.

With this under your belts you then experimented with at least a couple of XML flavors: TEI and EAD. The former is used to mark-up literature. The later is used to describe archival collections. You then learned about the XML transformation process through the application of XSL and XPath, two rather difficult technologies to master. Lastly, you made strong efforts to apply the principles of XML to the principles of librarianship by marking up sets of documents or creating your own knowledge entity. It is hoped you have made a leap from mere technology to system. It is not about Oxygen nor graphic design. It is about the chemistry of disseminating data as unambiguously as possible for the purposes of increasing the sphere of knowledge. With these things understood, you are better equipped to practice librarianship in the current technological environment.

Finally, remember, there is no such thing as a Dublin Core record.

Epilogue — Use and understanding

iceburgThis course in XML was really only an introduction. You were expected to read, write, and transform XML. This process turns data into information. All of this is fine, but what about knowledge?

One of the original reasons texts were marked up was to facilitate analysis. Researchers wanted to extract meaning from texts. One way to do that is to do computational analysis against text. To facilitate computational analysis people thought is was necessary for essential characteristics of a text to be delimited. (It is/was thought computers could not really do natural language processing.) How many paragraphs exists? What are the names in a text? What about places? What sorts of quantitative data can be statistically examined? What main themes does the text include? All of these things can be marked-up in a text and then counted (analyzed).

Now that you have marked up sets of letters with persname elements, you can use XPath to not only find persname elements but count them as well. Which document contains the most persnames? What are the persnames in each document. Tabulate their frequency. Do this over a set of documents to look for trends across the corpus. This is only a beginning, but entirely possible given the work you have already done.

Libraries do not facilitate enough quantitative analysis against our content. Marking things up in XML is a good start, but lets go to the next step. Let’s figure out how the profession can move its readership from discovery to analysis — towards use & understand.

Mr. Serials continues

Wednesday, January 6th, 2016

The (ancient) Mr. Serials Process continues to support four mailing list archives, specifically, the archives of ACQNET, Colldv-l, Code4Lib, and NGC4Lib, and this posting simply makes the activity explicit.

flowersMr. Serials is/was a process I developed quite a number of years ago as a method for collecting, organizing, archiving electronic journals (serials). The process worked well for a number of years, until electronic journals were no longer distributed via email. Now-a-days, Mr. Serials only collects the content of a few mailing lists. That’s okay. Things change. No big deal.

On the other hand, from a librarian’s and archivist’s point-of-view, it is important to collect mailing list content in its original form — email. Email uses the SMTP protocol. The communication sent back and forth, between email server and client, is well-structured albiet becoming verbose. Probably “the” standard for saving email on a file system is called mbox. Given a mbox file, it is possible to use any number of well-known applications to read/write mbox data. Heck, all you need is a text editor. Increasingly, email archives are not available from mailing list applications, and if they are, then they are available only to mailing list administrators and/or in a proprietary format. For example, if you host a mailing list on Google, can you download an archive of the mailing list in a form that is easily and universally readable? I think not.

Mr. Serials circumvents this problem. He subscribes to mailing lists, saves the incoming email to mbox files, and processes the mbox files to create searchable/browsable interfaces. The interfaces are not hugely aesthetically appealing, but they are more than functional, and the source files are readily available. Just ask.

Most recently both the ACQNET and Colldv-l mailing lists moved away from their hosting institutions to servers hosted by the American Library Association. This has not been the first time these lists have moved. It probably won’t be the last, but since Mr. Serials continues subscribe to these lists, comprehensive archives persevere. Score a point for librarianship and the work of archives. Long live Mr. Serials.

Re-MARCable

Tuesday, November 17th, 2015

This blog posting contains: 1) questions/statements about MARC and posted by graduate library school students taking an online XML class I’m teaching this semester, and 2) my replies. Considering my previously published blog posting, you might say this posting is “re-MARCable”.

I’m having some trouble accessing the file named data.marc for the third question in this week’s assignment. It keeps opening in word and all I get is completely unreadable. Is there another way of going about finding the answer for that particular question?

Okay. I have to admit. I’ve been a bit obtuse about the MARC file format.

MARC is/was designed to contain ASCII characters, and therefore it ought to be human-readable. MARC does not contain binary characters and therefore ought to be readable in text editors. DO NOT open the .marc file in your word processor. Use your text editor to open it up. If you have line wrap turned off, then you ought to see one very long line of ugly text. If you turn on line wrap, then you will see many lines of… ugly text. Attached (hopefully) is a screen shot of many MARC records loaded into my text editor. And I rhetorically ask, “How many records are displayed, and how do you know?”

marc

I’m trying to get y’all to answer a non-rhetorical question asked against yourself, “Considering the state of today’s computer technology, how viable is MARC? What are the advantages and disadvantages of MARC?”

I am taking Basic Cataloging and Classification this semester, but we did not discuss octets or have to look at an actual MARC file. Since this is supposed to be read by a machine, I don’t think this file format is for human consumption which is why it looks scary.

[Student], you continue to be a resource for the entire class. Thank you.

Everybody, yes, you will need to open the .marc file in your text editor. All of the files we are creating in this class ought to be readable in your text editor. True and really useful data files ought to be text files so they can be transferred from application to application. Binary files are sometimes more efficient, but not long-lasting. Here in Library Land we are in it for the long haul. Text files are where it is at. PDF is bad enough. Knowing how to manipulate things in a text editor is imperative when it comes to really using a computer. Imperative!!! Everything on the Web is in plain text.

In any event, open the .marc file in your text editor. On a Macintosh that is Text Edit. On Windows it is NotePad or WordPad. Granted all of these particular text editors are rather brain-dead, but they all function necessarily. A better text editor for Macintosh is Text Wrangler, and for Windows is NotePad++. When you open the .marc file, it will look ugly. It will seem unreadable, but that is not the case at all. Instead, a person needs to know the “secret codes” of cataloging, as well as a bit of an obtuse data structure in order to make sense of the whole thing.

Okay. Octets. Such are 8-bit characters, as opposed to the 7-bit characters of ASCII enclosing. The use of 8-bit characters enabled Library Land to integrate characters such as ñ, é, or å into its data. And while Library Land was ahead of the game in this regard, it did not embrace Unicode when it came along:

Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world’s writing systems. Developed in conjunction with the Universal Character Set standard and published as The Unicode Standard, the latest version of Unicode contains a repertoire of more than 120,000 characters covering 129 modern and historic scripts, as well as multiple symbol sets. [1]

Nor did Library Land update its data when changes happened. Consequently, not only do folks outside Library Land need to know how to read and write MARC records (which they can’t), they also need to know and understand the weird characters encodings which we use. In short, the data of Library Land is not very easily readable by the wider community, let alone very many people within our own community. Now that is irony. Don’t you think so!? Our data is literally and figuratively stuck in 1965, and we continue to put it there.


Professor, is this data.marc file suppose to be read only by a machine as [a fellow classmate] suggested?

Only readable by a computer? The answer is both no and yes.

Any data file intended to be shared between systems (sets of applications) ought to be saved as plain text in order to facilitate transparency and eliminate application monopolies/tyrannies. Considering the time when MARC was designed, it fulfilled these requirements. The characters were 7-bits long (ASCII), the MARC codes were few and far between, and its sequential nature allowed it to be shipped back and forth on things like tape or even a modem. (“Remember modems?”) Without the use of an intermediary computer program, is is entirely possible to read and write a MARC records with a decent text editor. So, the answer is “No, MARC is not only readable by a machine.”

On the other hand, considering how much extra data (“information”) the profession has stuffed into MARC data structure, it is really really hard to edit MARC records with a text editor. Library Land has mixed three things into a single whole: data, presentation, and data structure. This is really bad when it comes to computing. For example, a thing may have been published in 1542, but the cataloger is not certain of this date. Consequently, they will enter a data value of [1542]. Well, that is not a date (a number), but rather a string (a word). To make matters worse, the cataloger may think the date (year) of publication is within a particular decade but not exactly sure, and the date may be entered like as [154?]. Ack! Then let’s get tricky and add a copyright notation to a more recent but uncertain date — [c1986]. Does it never end? Then lets’ talk about the names of people. The venerable Fred Kilgour — founder of OCLC — is denoted in cataloging rules as Kilgour, Fred. Well, I don’t think Kilgour, Fred ever backwards talked so make sure his ideas sortable. Given the complexity of cataloging rules, which never simplify, it is really not feasible to read and write MARC records without an intermediate computer program. So, on the other hand, “Yes, an intermediary computer is necessary.” But if this is true, then why don’t catalogers know to read and write MARC records? The answer lies in what I said above. We have mixed three things into a single whole, and that is a really bad idea. We can’t expect catalogers to be computer programmers too.

The bottom line is this. Library Land automated its processes but it never really went to the next level and used computers to enhance library collections and services. All Library Land has done is used computers to facilitate library practice; Library Land has not embraced the true functionality of computers such as its ability to evaluate data/information. We have simply done the same thing. We wrote catalog cards by hand. We then typed catalog cards. We then used a computer to create them.

One more thing, Library Land simply does not have enough computer programmer types. Libraries build collections. Cool. Libraries provide services against the collections. Wonderful. This worked well (more or less) when libraries were physical entities in a localized environment. Now-a-days, when libraries are a part of a global network, libraries need to speak the global language, and that global language is spoken through computers. Computers use relational databases to organize information. Computers use indexes to make the information findable. Computers use well-structured Unicode files (such XML, JSON, and SQL files) to transmit information from one computer to another. In order to function, people who work in libraries (librarians) need to know these sorts of technologies in order to work on a global scale, but realistically speaking, what percentage of librarians, now how to do these thing, let alone know what they are? Probably less than 10%. It needs to be closer to 33%. Where 33% of the people build collections, 33% of the people provide services, and 33% of the people glue the work of the first 66% into a coherent whole. What to do with the remaining 1%? Call them “administrators”.

[1] Unicode – https://en.wikipedia.org/wiki/Unicode

MARC, MARCXML, and MODS

Wednesday, November 11th, 2015

screencastThis is the briefest of comparisons between MARC, MARCXML, and MODS. Its was written for a set of library school students learning XML.

MARC is an acronym for Machine Readable Cataloging. It was designed in the 1960’s, and its primary purpose was to ship bibliographic data on tape to libraries who wanted to print catalog cards. Consider the computing context of the time. There were no hard drives. RAM was beyond expensive. And the idea of a relational database had yet to be articulated. Consider the idea of a library’s access tool — the card catalog. Consider the best practice of catalog cards. “Generate no more than four or five cards per book. Otherwise, we will not be able to accommodate all of the cards in our drawers.” MARC worked well, and considering the time, it represented a well-designed serial data structure complete with multiple checksum redundancy.

Someone then got the “cool” idea to create an online catalog from MARC data. The idea was logical but grew without a balance of library and computing principles. To make a long story short, library principles sans any real understanding of computing principles prevailed. The result was a bloating of the MARC record to include all sorts of administrative data that never would have made it on to a catalog card, and this data was delimited in the MARC record with all sorts of syntactical “sugar” in the form of punctuation. Moreover, as bibliographic standards evolved, the previously created data was not updated, and sometimes people simply ignored the rules. The consequence has been disastrous, and even Google can’t systematically parse the bibliographic bread & butter of Library Land.* The folks in the archives community — with the advent of EAD — are so much better off.

Soon after XML was articulated the Library Of Congress specified MARCXML — a data structure designed to carry MARC forward. For the most part, it addressed many of the necessary issues, but since it insisted on making the data in a MARCXML file 100% transformable into a “traditional” MARC record, MARCXML falls short. For example, without knowing the “secret codes” of cataloging — the numeric field names — it is very difficult to determine what are the authors, titles, and subjects of a book.

The folks at the Library Of Congress understood these limitations almost from the beginning, and consequently they created an additional bibliographic standard called MODS — Metadata Object Description Schema. This XML-based metadata schema goes a long way in addressing both the computing times of the day and the needs for rich, full, and complete bibliographic data. Unfortunately, “traditional” MARC records are still the data structure ingested and understood by the profession’s online catalogs and “discovery systems”. Consequently, without a wholesale shift in practice, the profession’s intellectual content is figuratively stuck in the 1960’s.

* Consider the hodgepodge of materials digitized by Google and accessible in the HathiTrust. A search for Walden by Henry David Thoreau returns a myriad of titles, all exactly the same.

Readings

  1. MARC (http://www.loc.gov/marc/bibliographic/bdintro.html) – An introduction to the MARC standard
  2. leader (http://www.loc.gov/marc/specifications/specrecstruc.html#leader) – All about the leader of a traditional MARC record
  3. MARC Must Die (http://lj.libraryjournal.com/2002/10/ljarchives/marc-must-die/) – An essay by Roy Tennent outlining why MARC is not a useful bibliographic format. Notice when it was written.
  4. MARCXML (https://www.loc.gov/standards/marcxml/marcxml-design.html) – Here are the design considerations for MARCXML
  5. MODS (http://www.loc.gov/standards/mods/userguide/) – This is an introduction to MODS

Exercise

This is much more of an exercise than it is an assignment. The goal of the activity is not to get correct answers but instead to provide a framework for the reader to practice critical thinking against some of the bibliographic standards of the library profession. To the best of your ability, and in the form of an written essay between 500 and 1000 words long, answer and address the following questions based on the contents of the given .zip file:

  1. Measured in characters (octets), what is the maximum length of a MARC record? (Hint: It is defined in the leader of a MARC record.)
  2. Given the maximum length of a MARC record (and therefore a MARCXML record), what are some of the limitations this imposes when it comes to full and complete bibliographic description?
  3. Given the attached .zip file, how many bibliographic items are described in the file named data.marc? How many records are described in the file named data.xml? How many records are described in the file named data.mods? How do did you determine the answers to the previous three questions? (Hint: Open and read the files in your favorite text and/or XML editor.)
  4. What is the title of the book in the first record of data.marc? Who is the author of the second record in the file named data.xml. What are the subjects of the third record in the file named data.mods? How did you determine the answers the previous three questions? Be honest.
  5. Compare & contrast the various bibliographic data structures in the given .zip file. There are advantages and disadvantages to all three.

Fun with Koha

Saturday, July 19th, 2014

These are brief notes about my recent experiences with Koha.

Introduction

koha logoAs you may or may not know, Koha is a grand daddy of library-related open source software, and it is an integrated library system to boot. Such are no small accomplishments. For reasons I will not elaborate upon, I’ve been playing with Koha for the past number of weeks, and in short, I want to say, “I’m impressed.” The community is large, international, congenial, and supportive. The community is divided into a number of sub-groups: developers, committers, commercial support employees, and, of course, librarians. I’ve even seen people from another open source library system (Evergreen) provide technical support and advice. For the most part, everything is on the ‘Net, well laid out, and transparent. There are some rather “organic” parts to the documentation akin to an “English garden”, but that is going to happen in any de-centralized environment. All in all, and without any patronizing intended, “Kudos to Koha!”

Installation

Looking through my collection of tarballs, I see I’ve installed Koha a number of times over the years, but this time it was challenging. Sparing you all the details, I needed to use a specific version of MySQL (version 5.5), and I had version 5.6. The installation failure was not really Koha’s fault. It is more the fault of MySQL because the client of MySQL version 5.6 outputs a warning message to STDOUT when a password is passed on the command line. This message confused the Koha database initialization process, thus making Koha unusable. After downgrading to version 5.5 the database initialization process was seamless.

My next step was to correctly configure Zebra — Koha’s default underlying indexer. Again, I had installed from source, and my Zebra libraries, etc. were saved in a directory different from the configuration files created by the Koha’s installation process. After correctly updating the value of modulePath to point to /usr/local/lib/idzebra-2.0/ in zebra-biblios-dom.cfg, zebra-authorities.cfg, zebra-biblios.cfg, and zebra-authorities-dom.cfg I could successfully index and search for content. I learned this from a mailing list posting.

Koha “extras”

Koha comes (for free) with a number of “extras”. For example, the Zebra indexer can be deployed as both a Z39.50 server as well as an SRU server. Turning these things on was as simple as uncommenting a few lines in the koha-conf.xml file and opening a few ports in my firewall. Z39.50 is inherently unusable from a human point of view so I didn’t go into configuring it, but it does work. Through the use of XSL stylesheets, SRU can be much more usable. Luckily I have been here before. For example, a long time ago I used Zebra to index my Alex Catalogue as well as some content from the HathiTrust (MBooks). The hidden interface to the Catalogue sports faceted searching and used to support spelling corrections. The MBooks interface transforms MARCXML into simple HTML. Both of these interfaces are quite zippy. In order to get Zebra to recognize my XSL I needed to add an additional configuration directive to my koha-conf.xml file. Specifically, I need to add a docpath element to my public server’s configuration. Once I re-learned this fact, implementing a rudimentary SRU interface to my Koha index was easy and results are returned very fast. I’m impressed.

My big goal is to figure out ways Koha can expose its content to the wider ‘Net. To this end sKoha comes with an OAI-PMH interface. It needs to be enabled, and can be done through the Koha Web-based backend under Home -> Koha Administration -> Global Preferences -> General Systems Preferences -> Web Services. Once enabled, OAI sets can be created through the Home -> Administration -> OAI sets configuration module. (Whew!) Once this is done Koha will respond to OAI-PMH requests. I then took it upon myself to transform the OAI output into linked data using a program called OAI2LOD. This worked seamlessly, and for a limited period of time you can browse my Koha’s cataloging data as linked data. The viability of the resulting linked data is questionable, but that is another blog posting.

Ideas and next steps

Library catalogs (OPACs, “discovery systems”, whatever you want to call them) are not simple applications/systems. They are a mixture of very specialized inventory lists, various types of people with various skills and authorities, indexing, and circulation, etc. Then we — as librarians — add things like messages of the day, record exporting, browsable lists, visualizations, etc. that complicate the whole thing. It is simply not possible to create a library catalog in the “Unix way“. The installation of Koha was not easy for me. There are expenses with open source software, and I all but melted down my server during the installation process. (Everything is now back to normal.) I’ve been advocating open source software for quite a while, and I understand the meaning of “free” in this context. I’m not complaining. Really.

Now that I’ve gotten this far, my next step is to investigate the feasibility of using a different indexer with Koha. Zebra is functional. It is fast. It is multi-faceted (all puns intended). But configuring it is not straight-forward, and its community of support is tiny. I see from rooting around in the Koha source code that Solr has been explored. I have also heard through the grapevine that ElasticSearch has been explored. I will endeavor to explore these things myself and report on what I learn. Different indexers, with more flexible API’s may make the possibility of exposing Koha content as linked data more feasible as well.

Wish me luck.

Fun with ElasticSearch and MARC

Sunday, June 22nd, 2014

For a good time I have started to investigate how to index MARC data using ElasticSearch. This posting outlines some of my initial investigations and hacks.

ElasticSearch seems to be an increasingly popular indexer. Getting it up an running on my Linux host was… trivial. It comes withe a full-fledged Perl interface. Nice! Since ElasticSearch takes JSON as input, I needed to serialize my MARC data accordingly, and MARC::File::JSON seems to do a fine job. With this in hand, I wrote three programs:

  1. index.pl – create an index of MARC records
  2. get.pl – retrieve a specific record from the index
  3. search.pl – query the index

I have some work to do, obviously. First of all, do I really want to index MARC in its raw, communications format? I don’t think so, but that is where I’ll start. Second, the search script doesn’t really search. Instead it simply gets all the records. This is because I really don’t know how to search yet; I don’t really know how to query fields like “245 subfield a”.

index.pl

#!/usr/bin/perl

# configure
use constant INDEX => 'pamphlets';
use constant MARC  => './pamphlets.marc';
use constant MAX   => 100;
use constant TYPE  => 'marc';

# require
use MARC::Batch;
use MARC::File::JSON;
use Search::Elasticsearch;
use strict;

# initialize
my $batch = MARC::Batch->new( 'USMARC', MARC );
my $count = 0;
my $e     = Search::Elasticsearch->new;

# process each record in the batch
while ( my $record = $batch->next ) {

  # debug
  print $record->title, "\n";
  
  # serialize the record into json
  my $json = &MARC::File::JSON::encode( $record );
  
  # increment
  $count++;
  
  # index; do the work
  $e->index(  index   => INDEX,
                type    => TYPE,
                id      => $count,
                body    => { "$json" }
    );
    
  # check; only do a few
  last if ( $count > MAX );
  
}

# done
exit;

get.pl

# configure 
use constant INDEX => 'pamphlets';
use constant TYPE  => 'marc';

# require
use MARC::File::JSON;
use Search::Elasticsearch;
use strict;

# initialize
my $e = Search::Elasticsearch->new;

# get; do the work
my $doc = $e->get( index   => INDEX,
                   type    => TYPE,
                   id      => $ARGV[ 0 ]
);

# reformat and output; done
my $record = MARC::Record->new_from_json( keys( $doc->{ '_source' } ) );
print $record->as_formatted, "\n";
exit;

search.pl

# configure 
use constant INDEX => 'pamphlets';

# require
use MARC::File::JSON;
use Search::Elasticsearch;
use strict;

# initialize
my $e = Search::Elasticsearch->new;

# search
my $results = $e->search(
  index => INDEX,
    body  => { query => { match_all => { $ARGV[ 0 ] } } }
);

# output
my $hits = $results->{ 'hits' }->{ 'hits' };
for ( my $i = 0; $i <= $#$hits; $i++ ) {

  my $record = MARC::Record->new_from_json( keys( $hits[ $i ]->{ '_source' } ) );
  print $record->as_formatted, "\n\n";

}

# done
exit;