What is the Open Archives Initiative?

In a sentence, the Open Archives Initiative (OAI) is a protocol built on top of HTTP designed to distribute, gather, and federate meta data. The protocol is expressed in XML. This article describes the problems the OAI is trying to address and outlines how the OAI system is intended to work. By the end of the article you will be more educated about the OAI and hopefully become inspired to implement your own OAI repository or even become a service provider. The conical home page for the Open Archives Initiative is http://www.openarchives.org/ .

The Problem

Simply stated, the problem is, "How do I identify and locate the information I need?"

We all seem to be drinking from the proverbial fire hose and suffering from at least a little bit of information overload. Using Internet search engines to find the information we need and desire literally return thousands of hits. Items in these search results are often times inadequately described making the selection of particular returned items a hit or miss proposition. Bibliographic databases -- indexes of scholarly, formally published journal and magazine literature -- overwhelm the user with too many input options and rarely return the full-text of identified articles. Instead, these databases leave the user with a citation requiring a trip to the library where they will have to navigate a physically large space and hope the article is on the shelf.

From a content provider's point of view, the problem can be stated conversely, "How do I make people aware of the data and information I disseminate?"

There are many people, content providers, who have information to share to people who really need it. Collecting, organizing, and maintaining the information is only half the battle. Without access these processes are meaningless. Additionally, there may be sets of content providers who have sets of information with something in common such as subject matter (literature, mathematics, gardening), file format (images, texts, sounds), or community (a library, a business, user group). These sets of people may want to co-operate by assimilating information about their content together into a single-source search engine and therefore save the time of the user by reducing the number of databases people have to search as well as provide access to the provider's content.

The Solution

The OAI addresses the problems outlined above by articulating a method -- a protocol built on top of HTTP -- for sharing meta data buried in Internet-accessible databases. The protocol defines two enities and the language whereby these two entities communicate. The first entity is called a "data provider" or a "repository". For example, a data provider may have a collection of digital images. Each of these images may be described with a set of qualities: title, accession number, data, resolution, narrative description, etc. Alternatively, a data provider may be a pre-print archive -- a collection of pre-published papers, and therefore each of the items in the archive could be described using title, author, data, summary, and possibly subject heading. Another example could be a list of people, experts in field of study. The qualities describing this collection may be name, email address, postal address, telephone number, and institutional affiliation.

Thus, the purpose of the first OAI entity -- the data provider -- is to expose the qualities of its collection -- the meta data -- to a second entity, a "service provider". The purpose of the service provider is to harvest the meta data from one or more data providers in turn creating a some sort of value-added utility. This utility is undefined by the protocol but could include things such as a printed directory, a federated index available for searching, a mirror of a data provider, a current awareness service, syndicated news feeds, etc.

In summary, the OAI defines two entities (data provider and service provider) and a protocol for these two entities to share meta data between themselves. The balance of this article describes the protocol in greater detail.

Verbs

The OAI protocol consists of only a few "verbs" (think "commands"), and a set of standardized XML responses. All of the verbs are communicated from the service provider to a data provider via an HTTP request. They are a set of one or more name/value pairs embedded in a URL (as in the GET method) or encoded in the HTTP header (as in the POST method). Most of the verbs can be qualified with additional name/value pairs. The simplest verb is "Identify", and a real example of how this might be passed to a data provider via the GET method includes the following: http://www.infomotions.com/alex/oai/?verb=Identify

The example above assumes there is some sort of OAI-aware application saved as the default executable in the /alex/oai directory of the www.infomotions.com host. This application takes the name/value pair, verb=Identify, as input and outputs an XML stream confirming itself as an OAI data provider.

Other verbs work in a similar manner but may include a number of qualifiers in the form of additional name/value pairs. For example, the following verb requests a record, in the Dublin Core meta data format, describing Mark Twain's The Prince And The Pauper: http://www.infomotions.com/alex/oai/?verb=GetRecord&metadataPrefix=oai_dc&identifier=twain-prince-30

Again, the default application in the /alex/oai directory takes the value of the GET request as input and outputs a reply in the form of an XML stream.

All six of the protocol's verbs are enumerated and very briefly described below:

  1. Identify - This verb is used to verify that a particular service is an OAI repository. The reply to an Identify command includes things like the name of the service, a URL where the services can be reached, the version number of the protocol the repository supports, and the email address to contact for more information. This is by far the easiest verb. Example: http://www.infomotions.com/alex/oai/?verb=Identify
  2. ListMetadataFormats - Meta data takes on many formats, and this command queries the repository for a list of meta data formats the repository supports. In order to be OAI compliant, a repository must at least support the Dublin Core. (For more information about the Dublin Core meta data format see http://dublin.org/ and http://www.iso.or/standards/resources/Z39-85.pdf .) Example: http://www.infomotions.com/alex/oai/?verb=ListMetadataFormats
  3. List sets - The data contained in a repository may not necessarily be homogeneous since it might contain information about more than one topic or saved in more than one format. Therefore the verb List sets is used to communicate a list of topic or collections of data in a repository. It is quite possible that a repository has no sets, and consequently a reply would be contain no set information. Example: http://www.infomotions.com/alex/oai/?verb=ListSets
  4. ListIdentifiers - It is assumed each item in a repository is associated with some sort of unique key -- an identifier. This verb requests a lists of the identifiers from a repository. Since this list can be quite long, and since the information in a repository may or may not significantly change over time, this command can take a number optional qualifiers including a resumption token, date ranges, or set specifications. In short, this command asks a repository, "What items do you have?" Example: http://www.infomotions.com/alex/oai/?verb=ListIdentifiers
  5. GetRecord - This verb provides the means of retrieving information about specific meta data records given a specific identifier. It requires two qualifiers: 1) the name of an identifier, and 2) name of the meta data format the data is expected to be encoded in. The result will be a description of an item in the repository. Example: http://www.infomotions.com/alex/oai/?verb=GetRecord&metadataPrefix=oai_dc&identifier=twain-new-36
  6. ListRecords - This command is a more generalized version of GetRecord. It allows a service provider to retrieve data from a repository without knowing specific identifiers. Instead this command allows the contents of a repository to be dumped en masse. This command can take a number of qualifiers too specifying data ranges or set specifications. This verb has one required qualifier, a meta data specification. Example: http://www.infomotions.com/alex/oai/?verb=ListRecords&metadataPrefix=oai_dc

Responses -- the XML stream

Upon receiving any one of the verbs outlined above it is the responsibility of the repository to reply in the form of an XML stream, and since this communication is happening on top of the HTTP protocol, the HTTP header's content-type must be text/xml. Error codes are passed via the HTTP status-code.

All responses have a similar format. They begin with an XML declaration. The root of the XML stream always echoes the name of the verb sent in the request as well as a listing of name spaces and schema. This is followed by a date stamp and an echoing of the original request.

For each of the verbs there are a number of different XML elements expected in the response. For example, the Identify verb requires the elements: repositoryName, baseURL, protocolVersion, and adminEmail. Below is very simple but valid reply to the Identify verb:

<?xml version="1.0" encoding="UTF-8" ?> 

<Identify
  xmlns="http://www.openarchives.org/OAI/1.0/OAI_Identify"
  xmlns:xsi="http://www.w3.org/2000/10/XMLSchema-instance"
  xsi:schemaLocation="http://www.openarchives.org/OAI/1.0/OAI_Identify
  http://www.openarchives.org/OAI/1.0/OAI_Identify.xsd">

  <responseDate>2002-02-16T09:40:35-7:00</responseDate> 
  <requestURL>http://www.infomotions.com/alex/oai/index.php?verb=Identify</requestURL>

  <!-- Identify-specific content -->
  <repositoryName>Alex Catalogue of Electronic Texts</repositoryName> 
  <baseURL>http://www.infomotions.com/alex/</baseURL> 
  <protocolVersion>1.0</protocolVersion> 
  <adminEmail>eric_morgan@infomotions.com</adminEmail> 

</Identify>

The output of the ListMetadataFormats verb requires information about what meta data formats are supported by the repository. Therefore, the response of a ListMetadataFormats request includes a metadataFormat element with a number of children: metadataPrefix, schema, metadataNamespece. Here is an example:

<?xml version="1.0" encoding="UTF-8" ?> 

<ListMetadataFormats
  xmlns="http://www.openarchives.org/OAI/1.0/OAI_ListMetadataFormats"
  xmlns:xsi="http://www.w3.org/2000/10/XMLSchema-instance"
  xsi:schemaLocation="http://www.openarchives.org/OAI/1.0/OAI_ListMetadataFormats
  http://www.openarchives.org/OAI/1.0/OAI_ListMetadataFormats.xsd">

  <responseDate>2002-02-16T09:51:49-7:00</responseDate> 
  <requestURL>http://www.infomotions.com/alex/oai/index.php?verb=ListMetadataFormats</requestURL> 

  <!-- ListMetadataFormats-specific content -->
  <metadataFormat>
    <metadataPrefix>oai_dc</metadataPrefix> 
    <schema>http://www.openarchives.org/OAI/dc.xsd</schema> 
    <metadataNamespace>http://purl.org/dc/elements/1.1/</metadataNamespace> 
  </metadataFormat>

</ListMetadataFormats>

About the simplest example can be illustrated with the ListIdentifiers verb. A response to this command might look something like this where, besides the standard output, there is a single additional XML element, identifier:

<?xml version="1.0" encoding="UTF-8" ?> 

<ListIdentifiers
  xmlns="http://www.openarchives.org/OAI/1.0/OAI_ListIdentifiers"
  xmlns:xsi="http://www.w3.org/2000/10/XMLSchema-instance"
  xsi:schemaLocation="http://www.openarchives.org/OAI/1.0/OAI_ListIdentifiers
  http://www.openarchives.org/OAI/1.0/OAI_ListIdentifiers.xsd">

  <responseDate>2002-02-16T10:03:09-7:00</responseDate> 
  <requestURL>http://www.infomotions.com/alex/oai/index.php?verb=ListIdentifiers</requestURL> 

  <!-- ListIdentifiers-specific content -->
  <identifier>twain-30-44</identifier> 
  <identifier>twain-adventures-27</identifier> 
  <identifier>twain-adventures-28</identifier> 
  <identifier>twain-connecticut-31</identifier> 
  <identifier>twain-extracts-32</identifier> 

</ListIdentifiers>

The last example shows a response to the GetRecord verb. It includes much more information than the previous examples, because it represents the real meat of the matter. XML elements include the record element and all the necessary children of a record as specified by the meta data format:

<?xml version="1.0" encoding="UTF-8" ?> 

<GetRecord
  xmlns="http://www.openarchives.org/OAI/1.0/OAI_GetRecord"
  xmlns:xsi="http://www.w3.org/2000/10/XMLSchema-instance"
  xsi:schemaLocation="http://www.openarchives.org/OAI/1.0/OAI_GetRecord
  http://www.openarchives.org/OAI/1.0/OAI_GetRecord.xsd">

  <responseDate>2002-02-16T10:09:35-7:00</responseDate> 
  <requestURL>http://www.infomotions.com/alex/oai/index.php?verb=GetRecord&metadataPrefix=oai_dc&identifier=twain-tom-40</requestURL> 

  <!-- GetRecord-specific content -->
  <record>

    <header>
      <identifier>twain-tom-40</identifier> 
      <datestamp>1999</datestamp> 
      </header>

    <metadata>

    <!-- Dublin Core metadata -->
    <dc xmlns="http://purl.org/dc/elements/1.1/"
      xmlns:xsi="http://www.w3.org/2000/10/XMLSchema-instance"
      xsi:schemaLocation="http://purl.org/dc/elements/1.1/
      http://www.openarchives.org/OAI/dc.xsd">
        <creator>Twain, Mark</creator> 
        <title>Tom Sawyer, Detective</title> 
        <date>1903</date> 
        <identifier>http://www.infomotions.com/etexts/literature/american/1900-/twain-tom-40.txt</identifier> 
        <rights>This document is in the public domain.</rights> 
        <language>en-US</language> 
        <type>text</type> 
        <format>text/plain</format> 
        <relation>http://www.infomotions.com/alex/</relation> 
        <relation>http://www.infomotions.com/alex/cgi-bin/concordance.pl?cmd=selectConcordance&bookcode=twain-tom-40</relation> 
        <relation>http://www.infomotions.com/alex/cgi-bin/configure-ebook.pl?handle=twain-tom-40</relation> 
        <relation>http://www.infomotions.com/alex/cgi-bin/pdf.pl?handle=twain-tom-40</relation> 
        <contributor>Morgan, Eric Lease</contributor> 
        <contributor>Infomotions, Inc.</contributor> 
    </dc>

    </metadata>

  </record>

</GetRecord>

An Example

In an afternoon I created the very beginnings of an OAI data provider application using PHP. The source code to this application is available at http://www.infomotions.com/alex/oai/alex-oai-1.0.tar.gz . Below is a snippet of code implementing the ListIdentifiers verb. When this verb is trapped ListIdentifiers.php queries the system's underlying (MySQL) database for a list of keys and outputs the list as per the defined protocol:

<?php
	
# begin the response
echo '<ListIdentifiers
  xmlns="http://www.openarchives.org/OAI/1.0/OAI_ListIdentifiers"
  xmlns:xsi="http://www.w3.org/2000/10/XMLSchema-instance"
  xsi:schemaLocation="http://www.openarchives.org/OAI/1.0/OAI_ListIdentifiers
  http://www.openarchives.org/OAI/1.0/OAI_ListIdentifiers.xsd">';
echo '<responseDate>'. RESPONSEDATE . '</responseDate>';
echo '<requestURL>' . REQUESTURL . '</requestURL>';

# create an sql query and execute it
$sql = "SELECT filename
  FROM titles
  WHERE filename like 'twain%'
  ORDER BY filename";
$rows = mysql_db_query (DATABASE, $sql);
checkResults();

# process each found record
while ($r = mysql_fetch_array($rows)) {

  # display it
  echo '<identifier>' . $r["filename"] . '</identifier>';

}

# finish the response
echo '</ListIdentifiers>';
	
?>

Conclusion

This article outlined the intended purpose of the Open Archives Initiative (OAI) protocol coupled with a few examples. Given this introduction you may very well now be able to read the specifications and become a data provider. A more serious challenge includes becoming a service provider, and while Google may provide excellent searching mechanisms for the Internet as a whole, services implementing OAI can provide more specialized ways of exposing the "hidden Web".


Creator: Eric Lease Morgan <eric_morgan@infomotions.com>
Source: This is a pre-edited version of an article appearing in interChange 8:2 (June 2002) pg 18-22.
Date created: 2002-02-25
Date updated: 2004-12-05
Subject(s): articles; OAI (Open Archives Initiative);
URL: http://infomotions.com/musings/what-is-oai/