Metadata and data structures « Infomotions Mini-Musings

Metadata and data structures

It is important to understand the differences between metadata and data structures. This posting outlines some of the differences between the two.

Introduction

Every once in a while people ask me for advice that I am usually very happy to give because the answers usually involve succinctly articulating some of the things floating around in my head. Today someone asked:

I’ve been looking at Dublin Core and looking at MODS to arrive at the best metadata for converting MARC records into human readable format. Dublin Core lacks specificity, but maybe I don’t understand it that well. Plus, I cannot find what parts of the MARC are mapped to what–where are the “rules.” I look at Mods and find it overwhelming and I’m not even sure of its intended purpose.

Below is how I replied.

Dublin Core is a list of element names

First of all, please understand that Dublin Core is really just a list of fifteen or so metadata element names. Title. Creator. Publisher. Format. Identifier. Etc. Moreover, each of these names come with simple definitions denoting the type of content they are expected to represent. Dublin Core is NOT a metadata format. Dublin Core does not define how data should be encoded. It is simply a list of elements.

MARC and XML as data structures

MARC is a metadata format — a data structure — a container — a “bit bucket”. The MARC standard defines how data should be encoded. First there is a leader. It is always 24 characters long and different characters in the leader denote different things. Then there is the directory — a “map” of where the data resides in the file. Finally, there is the data itself which is divided into indicators, fields, and subfields. This MARC standard has been used to hold bibliographic data as well as authority data. In one case the 245 field is intended to encode title/author information. In another case the 245 means something else. In both cases they are using MARC — a data structure.

XML is second type of data structure. Instead of leaders, directories, and data sections, XML is made up of nested elements where the elements of the file are denoted by a Document Type Definition (DTD) or XML schema. XML is much more flexible than MARC. XML is much more verbose than MARC. There are many industries supporting XML. MARC is supported by a single industry. MARC was cool in its time, but it has grown long in the tooth. XML is definitely the data structure to use now-a-days.

MARCXML and MODS

MARCXML is a specific flavor of XML used to contain 100% of the data in a bibliographic MARC file. It works. It does what it is suppose to do, but in order to really take advantage of it the user needs to know that the 245 field contains title information, the 100 field contains author information, etc. In other words, to use MARCXML the user needs to know the “secret code book” translating library tags into human-readable elements. Moreover, MARCXML retains all of the “syntactical” sugar of MARC. Last name first. First name last. Parentheses around birth and death dates. “pbk” to denote paperback. Etc.

MODS is a second flavor of XML also designed to contain bibliographic data. In at least a couple of ways, MODS is much better than MARCXML. First and foremost, MODS removes the need for “secret code book” because the element names are human-readable, not integers. Second, some but not all, of the syntactical sugar is removed.

When it comes to bibliographic data, I advocate MODS over MARCXML any day. Not perfect, but a step in the right direction. There are utilities to convert MARC to MARCXML and then to MODS. Conversion is almost a trivial computing problem to solve.

The “right” metadata standard

When it comes to choosing the “right” metadata standard it is often about choosing the “right” flavor of XML. VRACore, for example, is more amenable to describing image data. TEI is best suited to describe — mark-up — prose and/or poetry. EAD is the “best” candidate for archival finding aids. Authority data can be represented in a relatively new XML flavor called MADS. METS is used, more or less, to create collections of metadata objects. RDF is similar to METS and is intended to form the basis of the Semantic Web. SKOS is an XML format for thesauri.

In short, there are two things to consider. First, what is your data? Bibliographic? Image? Full texts? Second, what data structure do you want to employ? MARC? XML? Something else such as a tab-delimited file? (Ick!) Or maybe a relational database schema? (Maybe.) In most cases I expect XML will be the data structure you want to employ, and then the question is, “What XML DTD or schema do I want to exploit?”

I allude to many of these issues in an XML workshop I wrote called XML In Libraries.

‘Hope this helps.

Tags: Dublin Core, MARC, MARCXML, metadata, MODS, XML

This entry was posted on Tuesday, August 5th, 2008 at 9:18 pm and is filed under Librarianship. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

4 Responses to “Metadata and data structures”

Bruce D'Arcus says:

August 6, 2008 at 11:44 am

I don’t mean to be pedantic, but you’re conclusions are a little sloppy. You say:

1) “Dublin Core does not define how data should be encoded. It is simply a list of elements.”

So you’re using a concept from XML (“element”) to describe a format that is independent of XML, and so in fact distorting its purpose in the process.

I’d say the core DC terms (title, etc.) are data properties or attributes. But DC is of course much more than just those core terms thee days.

2) “RDF is similar to METS”

“Similar” how?? I see very little similarity, except at the most superficial level that both allow mixing of different metadata structures. RDF, however, is a data model; METS is not.

3) “SKOS is an XML format for thesauri”

SKOS is an RDF vocabulary that can (like any RDF) be serialized as XML. But it is NOT fundamentally an “XML format.”

I don’t mean to pick on you, but if you’re making sweeping suggestions like this, you need to be more careful about the details.
Eric Lease Morgan says:

August 6, 2008 at 12:41 pm

Thank you for the feedback, and you make a number of great points.

Yes, the word “properties” would have been better than “elements” to describe Dublin Core items. Yes, exactly, RDF, just like METS, can be used to mix content from various vocabularies together into a single XML file. That is what I meant. Regarding SKOS, again, your distinction is more precise than my description.

The devil is in the details.
Avi Rappoport says:

August 22, 2008 at 9:42 pm

I’m pretty much a fan of MODS at this point, it’s much less fiddly than MARC for my purposes.

However, I’m wondering if there’s any discussion of a “full-text” tag for journal articles, etc. I don’t mean articles that an author has posted, but something more general, like the PLoS or preprint servers. In my case, there’s no URI because the text doesn’t exist as separate from its metadata, the MODS record *is* the online version. Any guidance for me?
Eric Lease Morgan says:

August 26, 2008 at 8:18 am

Avi, thank you for the feedback.

If I understand your question correction, then I know for certain MARCXML is not designed to contain full-text mark-up. I’m pretty sure MODS is the same way. Both are intended to contain bibliographic metadata. On the other hand, it would be entirely possible to included a link (think “call number”) in either a MARCXML or MODS file pointing to the full-text of a journal article.

As for the mark-up of the journal article itself, I would advocate the use of TEI. Many might think this is overkill since TEI leans towards the very analytic and scholarly, but in reality, TEI is well-suited to general mark-up of text — prose or poetry.

HTH. –ELM

Creator: Eric Lease Morgan <eric_morgan@infomotions.com>
Date created: 2008-05-26
Date updated: 2010-05-09
URL: ./

Archives

Categories