Archive for the ‘Librarianship’ Category

MyLibrary: A Digital library framework & toolbox

Wednesday, September 17th, 2008

I recently had published an article in Information Technology and Libraries (ITAL) entitled “MyLibrary: A Digital library framework & toolkit” (volume 27, number 3, pages 12-24, September 2008). From the abstract:

This article describes a digital library framework and toolkit called MyLibrary. At its heart, MyLibrary is designed to create relationships between information resources and people. To this end, MyLibrary is made up of essentially four parts: 1) information resources, 2) patrons, 3) librarians, and 4) a set of locally-defined, institution-specific facet/term combinations interconnecting the first three. On another level, MyLibrary is a set of object-oriented Perl modules intended to read and write to a specifically shaped relational database. Used in conjunction with other computer applications and tools, MyLibrary provides a way to create and support digital library collections and services. Librarians and developers can use MyLibrary to create any number of digital library applications: full-text indexes to journal literature, a traditional library catalog complete with circulation, a database-driven website, an institutional repository, an image database, etc. The article describes each of these points in greater detail.

http://infomotions.com/musings/mylibrary-framework/

The folks at ITAL are gracious enough to allow authors to distribute their work on the Web as long as the distribution happens after print publication. “Nice policy!”

Many people will remember MyLibrary from more than ten years ago. It is alive and well. It drives a few digital library projects at Notre Dame. It is often associated with customization/personalization, but now it is more about creating relationships between people and information resources through an institution-defined controlled vocabulary — a set of facet/term combinations.

MyLibrary is about relationships

In my opinion, libraries spend too much time describing resources and creating interdependencies between them. Instead, I think libraries should be spending more time creating relationships between resources and people. You can do this in any number of ways, and sets of facet/term combinations are just one. Think up qualities used to describe people. Think up qualities used to describe information resources. Create relationships by bringing resources and people together that share qualities.

Last of the Mohicans and services against texts

Monday, August 25th, 2008

Here is a word cloud representing James Fenimore Cooper’s The Last of the Mohicans; A narrative of 1757. It is a trivial example of how libraries can provide services against documents, not just the documents themselves.

scout  heyward  though  duncan  uncas  little  without  own  eyes  before  hawkeye  indian  young  magua  much  place  long  time  moment  cora  hand  again  after  head  returned  among  most  air  huron  toward  well  few  seen  many  found  alice  manner  david  hurons  voice  chief  see  words  about  know  never  woods  great  rifle  here  until  just  left  soon  white  heard  father  look  eye  savage  side  yet  already  first  whole  party  delawares  enemy  light  continued  warrior  water  within  appeared  low  seemed  turned  once  same  dark  must  passed  short  friend  back  instant  project  around  people  against  between  enemies  way  form  munro  far  feet  nor  

About the story

While I am not a literary scholar, I am able to read a book and write a synopsis.

Set during the French And Indian War in what was to become upper New York State, two young women are being escorted from one military camp to another. Along the way the hero, Natty Bumppo (also known by quite a number of other names, most notably “Hawkeye” or the “scout”), alerts the convoy that their guide, Magua, is treacherous. Sure enough, Magua kidnaps the women. Fights and battles ensue in a pristine and idyllic setting. Heroic deeds are accomplished by Hawkeye and the “last of the Mohicans” — Uncas. Everybody puts on disguises. In the end, good triumphs over evil but not completely.

Cooper’s style is verbose. Expressive. Flowery. On this level it was difficult to read. Too many words. In the other hand the style was consistent, provided a sort of pattern, and enabled me to read the novel with a certain rhythm.

There were a couple of things I found particularly interesting. First, the allusion to “relish“. I consider this to be a common term now-a-days, but Cooper thought it needed elaboration when used to describe food. Cooper used the word within a relatively short span of text to describe condiment as well as a feeling. Second, I wonder whether or not Cooper’s description of Indians built on existing stereotypes or created them. “Hugh!”

Services against texts

The word cloud I created is simple and rudimentary. From my perspective, it is just a graphical representation of a concordance, and a concordance has to be one of the most basic of indexes. This particular word cloud (read “concordance” or “index”) allows the reader to get a sense of a text. It puts words in context. It allows the would-be reader to get an overview of the document.

This particular implementation is not pretty, nor is it quick, but it is functional. How could libraries create other services such as these? Everybody can find and get data and information these days. What people desire is help understanding and using the documents. Providing services against texts such as word clouds (concordances) might be one example.

Metadata and data structures

Tuesday, August 5th, 2008

It is important to understand the differences between metadata and data structures. This posting outlines some of the differences between the two.

Introduction

Every once in a while people ask me for advice that I am usually very happy to give because the answers usually involve succinctly articulating some of the things floating around in my head. Today someone asked:

I’ve been looking at Dublin Core and looking at MODS to arrive at the best metadata for converting MARC records into human readable format. Dublin Core lacks specificity, but maybe I don’t understand it that well. Plus, I cannot find what parts of the MARC are mapped to what–where are the “rules.” I look at Mods and find it overwhelming and I’m not even sure of its intended purpose.

Below is how I replied.

Dublin Core is a list of element names

First of all, please understand that Dublin Core is really just a list of fifteen or so metadata element names. Title. Creator. Publisher. Format. Identifier. Etc. Moreover, each of these names come with simple definitions denoting the type of content they are expected to represent. Dublin Core is NOT a metadata format. Dublin Core does not define how data should be encoded. It is simply a list of elements.

MARC and XML as data structures

MARC is a metadata format — a data structure — a container — a “bit bucket”. The MARC standard defines how data should be encoded. First there is a leader. It is always 24 characters long and different characters in the leader denote different things. Then there is the directory — a “map” of where the data resides in the file. Finally, there is the data itself which is divided into indicators, fields, and subfields. This MARC standard has been used to hold bibliographic data as well as authority data. In one case the 245 field is intended to encode title/author information. In another case the 245 means something else. In both cases they are using MARC — a data structure.

XML is second type of data structure. Instead of leaders, directories, and data sections, XML is made up of nested elements where the elements of the file are denoted by a Document Type Definition (DTD) or XML schema. XML is much more flexible than MARC. XML is much more verbose than MARC. There are many industries supporting XML. MARC is supported by a single industry. MARC was cool in its time, but it has grown long in the tooth. XML is definitely the data structure to use now-a-days.

MARCXML and MODS

MARCXML is a specific flavor of XML used to contain 100% of the data in a bibliographic MARC file. It works. It does what it is suppose to do, but in order to really take advantage of it the user needs to know that the 245 field contains title information, the 100 field contains author information, etc. In other words, to use MARCXML the user needs to know the “secret code book” translating library tags into human-readable elements. Moreover, MARCXML retains all of the “syntactical” sugar of MARC. Last name first. First name last. Parentheses around birth and death dates. “pbk” to denote paperback. Etc.

MODS is a second flavor of XML also designed to contain bibliographic data. In at least a couple of ways, MODS is much better than MARCXML. First and foremost, MODS removes the need for “secret code book” because the element names are human-readable, not integers. Second, some but not all, of the syntactical sugar is removed.

When it comes to bibliographic data, I advocate MODS over MARCXML any day. Not perfect, but a step in the right direction. There are utilities to convert MARC to MARCXML and then to MODS. Conversion is almost a trivial computing problem to solve.

The “right” metadata standard

When it comes to choosing the “right” metadata standard it is often about choosing the “right” flavor of XML. VRACore, for example, is more amenable to describing image data. TEI is best suited to describe — mark-up — prose and/or poetry. EAD is the “best” candidate for archival finding aids. Authority data can be represented in a relatively new XML flavor called MADS. METS is used, more or less, to create collections of metadata objects. RDF is similar to METS and is intended to form the basis of the Semantic Web. SKOS is an XML format for thesauri.

In short, there are two things to consider. First, what is your data? Bibliographic? Image? Full texts? Second, what data structure do you want to employ? MARC? XML? Something else such as a tab-delimited file? (Ick!) Or maybe a relational database schema? (Maybe.) In most cases I expect XML will be the data structure you want to employ, and then the question is, “What XML DTD or schema do I want to exploit?”

I allude to many of these issues in an XML workshop I wrote called XML In Libraries.

‘Hope this helps.

Origami is arscient, and so is librarianship

Wednesday, July 30th, 2008

To do origami well a person needs to apply both artistic and scientific methods to the process. The same holds true for librarianship.

Arscience

Arscience is a word I have coined to denote the salient aspects of both art and science. It is a type of thinking — thinquing — that is both intuitive as well as systematic. It exemplifies synthesis — the bringing together of ideas and concepts — and analysis — the division of our world into smaller and smaller parts. Arscience is my personal epistemological method employing a Hegalian dialectic — an internal discussion. It juxtaposes approaches to understanding including art and science, synthesis and analysis, as well as faith and experience. These epistemological methods can be compared and contrasted, used or exploited, applied and debated against many of the things we encounter in our lives. Through this process I believe a fuller understanding of many things can be achieved.

arscience

Origami

A trivial example is origami. One one hand, origami is very artistic. Observe something in the natural world. Examine its essential parts and take notice of their shape. Acquire a piece of paper. Fold the paper to bring the essential parts together to form a coherent whole. The better your observation skills, the better your command of the medium, the better your origami will be.

On the other hand, you can discover that a square can be inscribed on any plane, and upon a square any number of regular polygons can be further inscribed. All through folding. You can then go about bisecting angles and dividing paper in halves, creating symbols denoting different types of folds, and systematically recording the process so it can be shared with others, ultimately creating a myriad of three-dimensional objects from an essentially two-dimensional thing. Unfold the three-dimensional object to expose its mathematics.

Seemingly conflicting approaches to the same problem results in similar outcomes. Arscience.

arscience

Librarianship

The same artistic and scientific processes — an arscient process — can be applied to librarianship. While there are subtle differences between different libraries, they all do essentially the same thing. To some degree they all collect, organize, preserve, and disseminate data, information, and knowledge for the benefit their respective user populations.

To accomplish these goals the librarian can take both an analysis tack as well as a synthesis tack. Interactions with people is more about politics, feelings, wants, and needs. Such things are not logical but emotional. This is one side of the coin. The other side of the coin includes well-structured processes & workflows, usability studies & statistical analysis, systematic analysis & measurable results. In our hyper-dynamic environment, such as the one we are working it, innovation — thinking a bit outside the box — is a necessary ingredient for moving forward. At the same time, it is not all about creativity but it is also about strategically planning for the near, medium, and long term future.

Librarianship requires both. Librarianship is arscient.

TPM — technological protection measures

Sunday, July 20th, 2008

I learned a new acronym a few weeks ago — TPM — which stands for “technological protection measures”, and in the May 2008 issue of College & Research Libraries Kristin R. Eschenfelder wrote an article called “Every library’s nightmare?” and enumerated various types of protection measures employed by publishers to impede the use of electronic scholarly material.

Types of restrictions

In today’s environment, where digital information is increasingly bought, sold, and/or licensed, publishers feel the need to protect their product from duplication. As described by Eschenfelder, these protections — restrictions — come in two forms: soft and hard.

Soft restrictions are “configurations of hardware or software that make certain uses such as printing, saving, copy/pasting, or e-mailing more difficult — but not impossible — to achieve.” The soft restrictions have been divided into the following subtypes:

  • extent of use - page print limits; PDF download limits; data export limits; suspicious use tracking
  • obfuscation - need to select items before options become available
  • omission - not providing buttons or links to enact users
  • decomposition - saving document results in many files, making recreating or e-mailing the document difficult
  • frustration - page chunking in e-books
  • warning - copyright warnings; end-user licenses on startup

Hard restrictions are “configurations of software or hardware that strictly prevent certain uses.” The hard restrictions have been divided into the following subtypes:

  • restricted copy and paste OCR - OCR exposed for searching, but not for copying and pasting of text
  • secure container TPM - use rights vary by resource

To investigate what types of restricts were put into everyday practice Eschenfelder studied a total of about seventy-five resources from three different disciplines (engineering, history, art history) and tallied the types of restrictions employed.

Salient quotes

A few salient quotes from the article exemplify Eschenfelder’s position on TPM:

  • “This paper suggests that the soft restrictions that are present in licensed products may haver already changed user’s and librarian’s expectations about what the use rights they ought to expect from vendors and their products.” (Page 207)
  • “One concern is that the library community has already accepted many of the soft use restrictions identified in this paper.” (Page 219)
  • “[Librarians] should also advocate for removal of use restrictions, or encourage new vendors to offer competing restriction-free products.” (Page 219)
  • “A more realistic solution might be a shared knowledge base of vendor interfaces and known use restrictions.” (Page 219)
  • “The paper argues that soft use restrictions deserve more attention from the library community, and that librarians should not accept these restrictions as the natural order of things.” (Page 220)

My commentary

I agree with Eschenfelder.

Many people who work in libraries seem to be there because of the values libraries portray. Examples include but are not limited to: intellectual freedom, education, diversity, equal access to information, preservation of the historical record for future generations, etc. Heaven know, people who work in libraries are not in it for the money! I fall into the equal access to information camp, and that is why I advocate things like open access publishing and open source software development.

TPM inhibits the free and equal access of information, and I think Eschenfelder makes a good point when she says the “library community has already accepted many of the soft use restrictions.” Why do we accept them? Librarians are not required to purchase and/or license these materials. We have choice. If much of the scholarly publishing industry is driven by the marketplace — supply & demand — then why don’t/can’t we just say, “No”. Nobody is forcing us spend our money this way. If vendors don’t provide the sort of products and services we desire, then the marketplace will change. Right?

In any event, consider educating yourself on the types of TPM and read Eschenfelder’s article.

Top Tech Trends for ALA (Summer ‘08)

Wednesday, June 18th, 2008

Here is a non-exhaustive list of Top Technology Trends for the American Library Association Annual Meeting (Summer, 2008). These Trends represent general directions regarding computing in libraries — short-term future directions where, from my perspective, things are or could be going. They are listed in no priority order.

  • “Bling” in your website - I hate to admit it, but it seems increasingly necessary to make sure your institution’s website be aesthetically appealing. This might seem obvious to you, but considering the fact we all think “content is king” we might have to reconsider. Whether we like it or not, people do judge a book by its cover, and people do judge other’s on their appearance. Websites aren’t very much different. While librarians are great at organizing information bibliographically, we stink when it comes to organizing things visually. Think graphic design. Break down and hire a graphic designer, and temper their output with usability tests. We all have our various strengths and weaknesses. Graphic designers have something to offer that, in general, librarians lack.
  • Data sets - Increasingly it is not enough for the scholar or researcher to evaluate old texts or do experiments and then write an article accordingly. Instead it is becoming increasingly important to distribute the data and information the scholar or researcher used to come to their conclusions. This data and information needs to be just as accessible as the resulting article. How will this access be sustained? How will it be described and made available? To what degree will it be important to preserve this data and/or migrate it forward in time? These sorts of questions require some thought. Libraries have experience in these regards. Get your foot in the door, and help the authors address these issues.
  • Institutional repositories - I don’t hear as much noise about institutional repositories as I used to hear. I think their lack of popularity is directly related to the problems they are designed to solve, namely, long-term access. Don’t get me wrong, long-term access is definitely a good thing, but that is a library value. In order to be compelling, institutional repositories need to solve the problems of depositors, not the librarians. What do authors get by putting their content in an institutional repository that they don’t get elsewhere? If they supported version control, collaboration, commenting, tagging, better syndication and possibilities for content reuse — in other words, services against the content — then institutional repositories might prove to be more popular.
  • Mobile devices - The iPhone represents a trend in mobile computing. It is both cool and “kewl” for three reasons: 1) its physical interface complete with pinch and drag touch screen options make it easy to use; you don’t need to learn how to write in its language, 2) its always-on and endlessly-accessible connectivity to the Internet make it trivial to keep in touch, read mail, and “surf the Web”, 3) its software interface is implemented in the form of full-blown applications, not dummied down text interfaces with lot’s of scrolling lists. Apple Computer got it right. Other companies will follow suit. Sooner or later we will all by walking around like people from the Starship Enterprise. “Beam me up, Scotty!” Consider integrating into your services the ability to text the content of library research to a telephone.
  • Net Neutrality - The Internet, by design, is intended to be neutral, but increasingly Internet Service Providers (ISP) are twisting the term “neutrality” to mean, “If you pay a premium, then we won’t throttle your network connection.” Things like BitTorrent is a good example. This technique exploits the Internet making file transfers more efficient, but ISPs want to inhibit it and/or charge more for its use. Yet again, the values and morals of a larger, more established community, in this case capitalism, are influencing the Internet. Similar value changes manifested themselves when email became commonplace. Other values, such as not wasting Internet bandwidth by transferring unnecessarily large files over the ‘Net, have changed as both the technology and the numbers of people using the Internet have changed. Take a stand for “Net Neutrality”.
  • “Next generation” library catalogs - The profession has finally figured it out. Our integrated library systems don’t solve the problems of our users. Consequently, the idea of the “next generation” library catalog is all the rage, but don’t get too caught up in features such as Did You Mean?, faceted browse, cover art, or the ability of include a wide variety of content into a single interface. Such things are really characteristics and functions of underlying index. They are all things designed to make it easier to accomplish the problem of find, but this is not the problem to be solved. Google make it easy to find. Really easy. We are unable to compete in that arena. Everybody can find, and we are still “drinking” from the proverbial “fire hose”. Instead, think about ways to enable the patron to use the content they find. Put the content into context. Like the institutional repositories, above, and the open access content, below, figure out way to make the content useful. Empower the patron. Enable them to apply actions against the content, not just the index. Such things are exemplified by action verbs. Tag. Share. Review. Add. Read. Save. Delete. Annotate. Index. Syndicate. Cite. Compare forward and backward in time. Compare and contrast with other documents. Transform into other formats. Distill. Purchase. Sell. Recommend. Rate. Create flip book. Create tag cloud. Find email address of author. Discuss with colleagues. Etc. The types of services implementable by “next generation” library catalogs is as long as the list of things people do with the content they find in libraries. This is one of the greatest opportunities facing our profession.
  • Open Access Publishing - Like its sister, institutional repositories, I don’t hear as much about open access publishing as I used to hear. We all know it is a “good thing” but like so many things that are “free” its value is only calculated by the amount of money paid for it. “The journals from this publisher are very expensive. We had better promote them and make them readily visible on our website in order for us to get our money’s worth.” In a library setting, the value of material is not based on dollars but rather on things such as but limited to usefulness, applicability, keen insight, scholarship, and timeliness. Open access publishing content manifests these characteristics as much a traditionally published materials. Open access content can be made even more valuable if its open nature were exploited. Like the content found in institutional repositories, and like the functions of “next generation” library catalogs outlined above, the ability to provide services against open access content are almost limitless. More than any other content, open access content combined with content from things like the Open Content Alliance and Project Gutenburg can be freely collected, indexed, searched, and then put into the context of the patron. Create bibliography. Trace citation. Find similar words and phrases between articles and books. Take an active role in making open access publishing more of a reality. Don’t wait for the other guy. You are a part of the solution.
  • Social networking - Social networking is beyond a trend. It is all but a fact of the Internet. Facebook, MySpace, and LinkedIn as well as Wikipedia, YouTube, Flickr, and Delicious are probably the archetypical social networking sites. They have very little content of their own. Instead, they provide a platform for others to provide content — and then services against that content. (”Does anybody see a trend in these trends, yet?”) What these social networking sites are exploiting is a new form of the numbers game. Given a wide enough audience it is possible to find and create sets of others interested in just about any topic under the sun. These people will be passionate about their particular topic. They will be sincere, adamant, and arduous about making sure the content is up-date, accurate, and thoroughly described and accessible. Put your content into these sorts of platforms in the same way the Library of Congress as well as the Smithsonian Institution has put some of their content into Flickr. A rising tide floats all boats. Put your boat into the water. Participate in this numbers game. It is not really about people using your library, but rather about people using the content you have made available.
  • Web Services-based APIs - xISBN and thingISBN. The Open Library API. The DLF ILS-DI Technical Recommendation. SRU and OpenSearch. OAI-PMH and now OAI-ORE. RSS and ATOM. All of these things are computing techniques called Web Services Application Programmer Interfaces (API). They are computer-to-computer interfaces akin to things like Z39.50 of Library Land. They enable computers to unambiguously share data between themselves. A number of years ago implementing Web Services meant learning things like SOAP, WSDL, and UDDL. These things were (are) robust, well-documented, and full-featured. They are also non-trivial to learn. (OCLC’s Terminology Service embedded within Internet Explorer uses these techniques.) After that REST become more popular. Simpler, and exploits the features of HTTP. The idea was (is) send a URL to a remote computer. Get a response back as XML. Transform the response and put it to use — usually display things on a Web page. This is the way most of the services work (”There’s that word again!”) The latest paradigm and increasingly popular technique uses a data structure called JSON as opposed to XML as the form of the server’s response because JSON is easier to process with Javascript. This is very much akin to AJAX. Despite the subtle differences between each of these Web Services computing techniques, there is a fundamental commonality. Make a request. Wait. Get a response. Do something with the content — make it useful. Moreover, the returned content is devoid of display characteristics. It is just data. It is your responsibility to turn it into information. Learn to: 1) make your content accessible via Web Services, and 2) learn how to aggregate content through Web Services in order to enhance your patron’s experience.

Wow! Where did all of that come from?

(This posting is also available at on the LITA Blog. “Lot’s of copies keep stuff safe.”)