metadata « Infomotions Mini-Musings

Posts Tagged ‘metadata’

Great Ideas Coefficient

Saturday, March 27th, 2010

This posting outlines a concept I call the Great Ideas Coefficient — an additional type of metadata used to denote the qualities of a text.

Great Ideas Coefficient

In the 1950s a man named Mortimer Adler and colleagues brought together what they thought were the most significant written works of Western civilization. They called this collection the Great Books of the Western World. Before they created the collection they outlined what they thought were the 100 most significant ideas of Western civilization. These are “great ideas” such as but not limited to beauty, courage, education, law, liberty, nature, sin, truth, and wisdom. Interesting.

Suppose you were able to weigh the value of a book based on these “great ideas”. Suppose you had a number of texts and you wanted to rank or list them according to the number of times they mentioned the “great ideas”. Such a thing can be done through the application of TFIDF. Here’s how:

create a list of the “great ideas”
calculate the TFIDF score for each idea in a given book
sum the scores for each idea
assign the score to the book
go to Step #2 for each book in a corpus
sort the corpus based on the total scores

Once the scores are calculated, they can be graphed, and once they are graphed they can be illustrated.

An example of this technique is shown above. For each item in a list of works by Aristotle a Great Ideas Coefficient has been calculated and assigned. The list was the ordered by the score. The score was then plotted graphically. Finally, all the graphs were joining together as an animated GIF image to show the range of scores in the list. Luckily the process seems to work because Aristotle’s Metaphysics ranks at the top with the highest Great Ideas Coefficient, and his History of Animals ranks the lowest. ‘Seems to make sense.

The concept behing the Great Ideas Coefficient is not limited to “great ideas”. Any set of words or phrases could be used. For example, one could create a list of “big names” (Plato, Shakespeare, Galileo, etc.) and calculate a Big Names Coefficient. Alternatively, a person could create a list of other words or phrases for any topic or genre to weigh a set of texts against biology, mathematics, literature, etc.

Find is not the problem that needs to be solved now-a-days. The problem of use and understanding is more pressing. People can find plenty of information. They need (want) assistance in putting the information into context. “Books are for use.” The application of something like the Great Ideas Coefficient may be just one example.

Tags: metadata
Posted in Hacks | Comments Off on Great Ideas Coefficient

Automatic metadata generation

Thursday, July 30th, 2009

I have been having a great deal of success extracting keywords and two-word phrases from documents and assigning them as “subject headings” to electronic texts — automatic metadata generation. In many cases but not all, the set of assigned keywords I’ve created are just as good if not better as the controlled vocabulary terms assigned by librarians.

The problem

The Alex Catalogue is a collection of roughly 14,000 electronic texts. The vast majority come from Project Gutenberg. Some come from the Internet Archive. The smallest number come from a defunct etext collection of Virginia Tech. All of the documents are intended to surround the themes of American and English literature and Western philosophy.

With the exception of the non-fiction works from the Internet Archive, none of the electronic texts were associated with subject-related metadata. With the exception of author names (which are yet to be “well-controlled”), it has been difficult learn the “aboutness” of each of the documents. Such a thing is desirable for two reasons: 1) to enable the reader to evaluate the relevance of document, and 2) to provide a browsable interface to the collection. Without some sort of tags, subject headings, or application of clustering techniques, browsability is all but impossible. My goal was to solve this problem in an automated manner.

The solution

A couple of years ago I used tools such as Lingua::EN::Summarize and Open Text Summarizer to extract keywords and summaries from the etexts and assign them as subject terms. The process worked, but not extraordinarily well. I then learned about Term Frequency Inverse Document Frequency (TFIDF) to calculate “relevance”, and T-Score to calculate the probability of two words appearing side-by-side — bi-grams or two-word phrases. Applying these techniques to the etexts of the Alex Catalogue I have been able to create and add meaningful subject “tags” to each of my documents which then paves the way to browsability. Here is the algorithm I used to implement the solution:

The results

Through this process I discovered a number of things.

First, in regards to fictional works, the words or phrases returned are often pronouns, and these were usually the names of characters from the work. An excellent example is Mark Twain’s Adventures of Huckleberry Finn whose currently assigned terms include: huck, tom, joe, injun joe, aunt polly, tom sawyer, muff potter, and injun joe’s.

Second, in regards to works of non-fiction, the words and phrases returned are also nouns, and these are objects referred to often in the etext. A good example includes John Stuart Mill’s Auguste Comte and Positivism where the assigned words are: comte, phaenomena, metaphysical, science, mankind, social, scientific, philosophy, and sciences.

Third, automatically generated keywords and phrases were many times just as useful as the librarian-assigned Library of Congress Subject headings. Many of the items harvested from the Internet Archive were complete with MARC records. Some of those records included subject headings. During Step #5 (above), I spent time observing the output and comparing it to previously assigned terms. Take for example a work called Universalism in America: A History by Richard Eddy. Its assigned headings included:

Universalism United States History
Unitarian Universalist churches United States

My automatically generated terms/phrases are:

Granted, the generated list is not perfect. For example, Hosea Ballou is mentioned twice, and the second was probably caused by an OCR error. On the other hand, how was a person to know that Hosea Ballou was even a part of the etext if it weren’t for this process? The same goes for the other people: Thomas Whittemore, Abner Kneeland, and Edward Turner. In defense of controlled vocabulary, the terms “church”, “sermon”, “doctrine”, and “american” could all be assumed from the (rather) hierarchal nature of LCSH, but unless a person understands the nature of LCSH such a thing is not obvious.

As a librarian I understand the power of a controlled vocabulary, but since I am not limited to three to five subject headings per entry, and because controlled vocabularies are often very specific, I have retained the LCSH in each record whenever possible. The more the merrier.

Next steps

Now that the collection has richer metadata, the next steps will be to exploit it. Some of those nexts steps include:

Normalize the data – Each of the subjects are currently saved in a single database field. They need to be normalized across the database to enable database joins and make it easier to generate reports.
Create a browsable interface – Write a set of static Web pages linking keywords and phrases to etexts. This will make it easier to see at a glance the type of content in the collection.
Re-index – Trivial. Send all the data and metadata back to the indexer ultimately improving the precision/recall ratio.
Enhance search experience – Extract the keywords and phrases from search results and display them to the user. Make them linkable to easily “find more like this one.” Extract the same keywords and phrases and use them to implement the increasingly popular browsable facets feature.
Enhance linked data – Generate a report against the database to create (better) RDF files complete with more meaningful (subject) tags. Link these tags to external vocabularies such as WordNet through the use of linked data thus contributing to the Semantic Web and enabling others to benefit from my labors. (Infomotions Man says, ‘Give back to the ‘Net”.)

Fun! Combining traditional librarianship with computer applications; not automating existing workflows as much as exploiting the inherent functions of a computer. Using mathematics to solve large-scale problems. Making it easier to do learning and research. It is the not what of librarianship that needs to change as much as the how.

Tags: bigrams, metadata, term frequency/inverse document frequency (TFIDF)
Posted in Alex Catalogue, Hacks | 4 Comments »

Metadata and data structures

Tuesday, August 5th, 2008

It is important to understand the differences between metadata and data structures. This posting outlines some of the differences between the two.

Introduction

Every once in a while people ask me for advice that I am usually very happy to give because the answers usually involve succinctly articulating some of the things floating around in my head. Today someone asked:

I’ve been looking at Dublin Core and looking at MODS to arrive at the best metadata for converting MARC records into human readable format. Dublin Core lacks specificity, but maybe I don’t understand it that well. Plus, I cannot find what parts of the MARC are mapped to what–where are the “rules.” I look at Mods and find it overwhelming and I’m not even sure of its intended purpose.

Below is how I replied.

Dublin Core is a list of element names

First of all, please understand that Dublin Core is really just a list of fifteen or so metadata element names. Title. Creator. Publisher. Format. Identifier. Etc. Moreover, each of these names come with simple definitions denoting the type of content they are expected to represent. Dublin Core is NOT a metadata format. Dublin Core does not define how data should be encoded. It is simply a list of elements.

MARC and XML as data structures

MARC is a metadata format — a data structure — a container — a “bit bucket”. The MARC standard defines how data should be encoded. First there is a leader. It is always 24 characters long and different characters in the leader denote different things. Then there is the directory — a “map” of where the data resides in the file. Finally, there is the data itself which is divided into indicators, fields, and subfields. This MARC standard has been used to hold bibliographic data as well as authority data. In one case the 245 field is intended to encode title/author information. In another case the 245 means something else. In both cases they are using MARC — a data structure.

XML is second type of data structure. Instead of leaders, directories, and data sections, XML is made up of nested elements where the elements of the file are denoted by a Document Type Definition (DTD) or XML schema. XML is much more flexible than MARC. XML is much more verbose than MARC. There are many industries supporting XML. MARC is supported by a single industry. MARC was cool in its time, but it has grown long in the tooth. XML is definitely the data structure to use now-a-days.

MARCXML and MODS

MARCXML is a specific flavor of XML used to contain 100% of the data in a bibliographic MARC file. It works. It does what it is suppose to do, but in order to really take advantage of it the user needs to know that the 245 field contains title information, the 100 field contains author information, etc. In other words, to use MARCXML the user needs to know the “secret code book” translating library tags into human-readable elements. Moreover, MARCXML retains all of the “syntactical” sugar of MARC. Last name first. First name last. Parentheses around birth and death dates. “pbk” to denote paperback. Etc.

MODS is a second flavor of XML also designed to contain bibliographic data. In at least a couple of ways, MODS is much better than MARCXML. First and foremost, MODS removes the need for “secret code book” because the element names are human-readable, not integers. Second, some but not all, of the syntactical sugar is removed.

When it comes to bibliographic data, I advocate MODS over MARCXML any day. Not perfect, but a step in the right direction. There are utilities to convert MARC to MARCXML and then to MODS. Conversion is almost a trivial computing problem to solve.

The “right” metadata standard

When it comes to choosing the “right” metadata standard it is often about choosing the “right” flavor of XML. VRACore, for example, is more amenable to describing image data. TEI is best suited to describe — mark-up — prose and/or poetry. EAD is the “best” candidate for archival finding aids. Authority data can be represented in a relatively new XML flavor called MADS. METS is used, more or less, to create collections of metadata objects. RDF is similar to METS and is intended to form the basis of the Semantic Web. SKOS is an XML format for thesauri.

In short, there are two things to consider. First, what is your data? Bibliographic? Image? Full texts? Second, what data structure do you want to employ? MARC? XML? Something else such as a tab-delimited file? (Ick!) Or maybe a relational database schema? (Maybe.) In most cases I expect XML will be the data structure you want to employ, and then the question is, “What XML DTD or schema do I want to exploit?”

I allude to many of these issues in an XML workshop I wrote called XML In Libraries.

‘Hope this helps.

Tags: Dublin Core, MARC, MARCXML, metadata, MODS, XML
Posted in Librarianship | 4 Comments »

Creator: Eric Lease Morgan <eric_morgan@infomotions.com>
Date created: 2008-05-26
Date updated: 2010-05-09
URL: ./

Archives

Categories

Posts Tagged ‘metadata’

Great Ideas Coefficient

Automatic metadata generation

The problem

The solution

The results

Next steps

Metadata and data structures

Introduction

Dublin Core is a list of element names

MARC and XML as data structures

MARCXML and MODS

The “right” metadata standard