MARC « Infomotions Mini-Musings

Posts Tagged ‘MARC’

EAD2MARC

Friday, June 5th, 2009

This posting simply shares three hacks I’ve written to enable me to convert EAD files to MARC records, and ultimately add them to my “discovery” layer — VUFind — for the Catholic Portal:

ead2marcxml.sh – Using xsltproc and a modified version of Terry Reese’s XSL stylesheet, converts all the EAD/.xml files in the current directory into MARCXML files. “Thanks Terry!”
marcxml2marc.sh – Using yaz-marcdump, convert all .marcxml files in the current directory into “real” MARC records.
add-001.pl – A hack to add 001 fields to MARC records. Sometimes necessary since the EAD files do not always have unique identifiers.

The distribution is available in the archives, and distributed under the GNU Public License.

Now, off to go fishing.

Tags: Encoded Archival Description (EAD), MARC
Posted in Hacks | Comments Off on EAD2MARC

Metadata and data structures

Tuesday, August 5th, 2008

It is important to understand the differences between metadata and data structures. This posting outlines some of the differences between the two.

Introduction

Every once in a while people ask me for advice that I am usually very happy to give because the answers usually involve succinctly articulating some of the things floating around in my head. Today someone asked:

I’ve been looking at Dublin Core and looking at MODS to arrive at the best metadata for converting MARC records into human readable format. Dublin Core lacks specificity, but maybe I don’t understand it that well. Plus, I cannot find what parts of the MARC are mapped to what–where are the “rules.” I look at Mods and find it overwhelming and I’m not even sure of its intended purpose.

Below is how I replied.

Dublin Core is a list of element names

First of all, please understand that Dublin Core is really just a list of fifteen or so metadata element names. Title. Creator. Publisher. Format. Identifier. Etc. Moreover, each of these names come with simple definitions denoting the type of content they are expected to represent. Dublin Core is NOT a metadata format. Dublin Core does not define how data should be encoded. It is simply a list of elements.

MARC and XML as data structures

MARC is a metadata format — a data structure — a container — a “bit bucket”. The MARC standard defines how data should be encoded. First there is a leader. It is always 24 characters long and different characters in the leader denote different things. Then there is the directory — a “map” of where the data resides in the file. Finally, there is the data itself which is divided into indicators, fields, and subfields. This MARC standard has been used to hold bibliographic data as well as authority data. In one case the 245 field is intended to encode title/author information. In another case the 245 means something else. In both cases they are using MARC — a data structure.

XML is second type of data structure. Instead of leaders, directories, and data sections, XML is made up of nested elements where the elements of the file are denoted by a Document Type Definition (DTD) or XML schema. XML is much more flexible than MARC. XML is much more verbose than MARC. There are many industries supporting XML. MARC is supported by a single industry. MARC was cool in its time, but it has grown long in the tooth. XML is definitely the data structure to use now-a-days.

MARCXML and MODS

MARCXML is a specific flavor of XML used to contain 100% of the data in a bibliographic MARC file. It works. It does what it is suppose to do, but in order to really take advantage of it the user needs to know that the 245 field contains title information, the 100 field contains author information, etc. In other words, to use MARCXML the user needs to know the “secret code book” translating library tags into human-readable elements. Moreover, MARCXML retains all of the “syntactical” sugar of MARC. Last name first. First name last. Parentheses around birth and death dates. “pbk” to denote paperback. Etc.

MODS is a second flavor of XML also designed to contain bibliographic data. In at least a couple of ways, MODS is much better than MARCXML. First and foremost, MODS removes the need for “secret code book” because the element names are human-readable, not integers. Second, some but not all, of the syntactical sugar is removed.

When it comes to bibliographic data, I advocate MODS over MARCXML any day. Not perfect, but a step in the right direction. There are utilities to convert MARC to MARCXML and then to MODS. Conversion is almost a trivial computing problem to solve.

The “right” metadata standard

When it comes to choosing the “right” metadata standard it is often about choosing the “right” flavor of XML. VRACore, for example, is more amenable to describing image data. TEI is best suited to describe — mark-up — prose and/or poetry. EAD is the “best” candidate for archival finding aids. Authority data can be represented in a relatively new XML flavor called MADS. METS is used, more or less, to create collections of metadata objects. RDF is similar to METS and is intended to form the basis of the Semantic Web. SKOS is an XML format for thesauri.

In short, there are two things to consider. First, what is your data? Bibliographic? Image? Full texts? Second, what data structure do you want to employ? MARC? XML? Something else such as a tab-delimited file? (Ick!) Or maybe a relational database schema? (Maybe.) In most cases I expect XML will be the data structure you want to employ, and then the question is, “What XML DTD or schema do I want to exploit?”

I allude to many of these issues in an XML workshop I wrote called XML In Libraries.

‘Hope this helps.

Tags: Dublin Core, MARC, MARCXML, metadata, MODS, XML
Posted in Librarianship | 4 Comments »

Creator: Eric Lease Morgan <eric_morgan@infomotions.com>
Date created: 2008-05-26
Date updated: 2010-05-09
URL: ./

Archives

Categories