Archive for the ‘Librarianship’ Category

TFIDF In Libraries: Part I of III (For Librarians)

Monday, April 13th, 2009

This is the first of a three-part series called TFIDF In Libraries, where “relevancy ranking” will be introduced. In this part, term frequency/inverse document frequency (TFIDF) — a common mathematical method of weighing texts for automatic classification and sorting search results — will be described. Part II will illustrate an automatic classification system and simple search engine using TFIDF through a computer program written in Perl. Part III will explore the possibility of filtering search results by applying TFIDF against sets of pre-defined “Big Names” and/or “Big Ideas” — an idea apparently called “champion lists”.

The problem, straight Boolean logic

To many of us the phrase “relevancy ranked search results” is a mystery. What does it mean to be “relevant”? How can anybody determine relevance for me? Well, a better phrase might have been “statistically significant search results”. Taking such an approach — the application of statistical analysis against texts — does have its information retrieval advantages over straight Boolean logic. Take for example, the following three documents consisting of a number of words, Table #1:

Document #1 Document #2 Document #3
Word Word Word
airplane book building
blue car car
chair chair carpet
computer justice ceiling
forest milton chair
justice newton cleaning
love pond justice
might rose libraries
perl shakespeare newton
rose slavery perl
shoe thesis rose
thesis truck science

A search for “rose” against the corpus will return three hits, but which one should I start reading? The newest document? The document by a particular author or in a particular format? Even if the corpus contained 2,000,000 documents and a search for “rose” returned a mere 100 the problem would remain. Which ones should I spend my valuable time accessing? Yes, I could limit my search in any number of ways, but unless I am doing a known item search it is quite likely the search results will return more than I can use, and information literacy skills will only go so far. Ranked search results — a list of hits based on term weighting — has proven to be an effective way of addressing this problem. All it requires is the application of basic arithmetic against the documents being searched.

Simple counting

We can begin by counting the number of times each of the words appear in each of the documents, Table #2:

Document #1 Document #2 Document #3
Word C Word C Word C
airplane 5 book 3 building 6
blue 1 car 7 car 1
chair 7 chair 4 carpet 3
computer 3 justice 2 ceiling 4
forest 2 milton 6 chair 6
justice 7 newton 3 cleaning 4
love 2 pond 2 justice 8
might 2 rose 5 libraries 2
perl 5 shakespeare 4 newton 2
rose 6 slavery 2 perl 5
shoe 4 thesis 2 rose 7
thesis 2 truck 1 science 1
Totals (T) 46 41 49

Given this simple counting method, searches for “rose” can be sorted by its “term frequency” (TF) — the quotient of the number of times a word appears in each document (C), and the total number of words in the document (T) — TF = C / T. In the first case, rose has a TF value of 0.13. In the second case TF is 0.12, and in the third case it is 0.14. Thus, by this rudimentary analysis, Document #3 is most significant in terms of the word “rose”, and Document #2 is the least. Document #3 has the highest percentage of content containing the word “rose”.

Accounting for common words

Unfortunately, this simple analysis needs to be offset considering frequently occurring terms across the entire corpus. Good examples are stop words or the word “human” in MEDLINE. Such words are nearly meaningless because they appear so often. Consider Table #3 which includes the number of times each word is found in the entire corpus (DF), and the quotient of the total number of documents (D or in this case, 3) and DF — IDF = D / DF. Words with higher scores are more significant across the entire corpus. Search terms whose IDF (”inverse document frequency”) score approach 1 are close to useless because they exist in just about every document:

Document #1 Document #2 Document #3
Word DF IDF Word DF IDF Word DF IDF
airplane 1 3.0 book 1 3.0 building 1 3.0
blue 1 3.0 car 2 1.5 car 2 1.5
chair 3 1.0 chair 3 1.0 carpet 1 3.0
computer 1 3.0 justice 3 1.0 ceiling 1 3.0
forest 1 3.0 milton 1 3.0 chair 3 1.0
justice 3 1.0 newton 2 1.5 cleaning 1 3.0
love 1 3.0 pond 1 3.0 justice 3 1.0
might 1 3.0 rose 3 1.0 libraries 1 3.0
perl 2 1.5 shakespeare 1 3.0 newton 2 1.5
rose 3 1.0 slavery 1 3.0 perl 2 1.5
shoe 1 3.0 thesis 2 1.5 rose 3 1.0
thesis 2 1.5 truck 1 3.0 science 1 3.0

Term frequency/inverse document frequency (TFIDF)

By taking into account these two factors — term frequency (TF) and inverse document frequency (IDF) — it is possible to assign “weights” to search results and therefore ordering them statistically. Put another way, a search result’s score (”ranking”) is the product of TF and IDF:

TFIDF = TF * IDF where:

  • TF = C / T where C = number of times a given word appears in a document and T = total number of words in a document
  • IDF = D / DF where D = total number of documents in a corpus, and DF = total number of documents containing a given word

Table #4 is a combination of all the previous tables with the addition of the TFIDF score for each term:

Document #1
Word C T TF D DF IDF TFIDF
airplane 5 46 0.109 3 1 3.0 0.326
blue 1 46 0.022 3 1 3.0 0.065
chair 7 46 0.152 3 3 1.0 0.152
computer 3 46 0.065 3 1 3.0 0.196
forest 2 46 0.043 3 1 3.0 0.130
justice 7 46 0.152 3 3 1.0 0.152
love 2 46 0.043 3 1 3.0 0.130
might 2 46 0.043 3 1 3.0 0.130
perl 5 46 0.109 3 2 1.5 0.163
rose 6 46 0.130 3 3 1.0 0.130
shoe 4 46 0.087 3 1 3.0 0.261
thesis 2 46 0.043 3 2 1.5 0.065
Document #2
Word C T TF D DF IDF TFIDF
book 3 41 0.073 3 1 3.0 0.220
car 7 41 0.171 3 2 1.5 0.256
chair 4 41 0.098 3 3 1.0 0.098
justice 2 41 0.049 3 3 1.0 0.049
milton 6 41 0.146 3 1 3.0 0.439
newton 3 41 0.073 3 2 1.5 0.110
pond 2 41 0.049 3 1 3.0 0.146
rose 5 41 0.122 3 3 1.0 0.122
shakespeare 4 41 0.098 3 1 3.0 0.293
slavery 2 41 0.049 3 1 3.0 0.146
thesis 2 41 0.049 3 2 1.5 0.073
truck 1 41 0.024 3 1 3.0 0.073
Document #3
Word C T TF D DF IDF TFIDF
building 6 49 0.122 3 1 3.0 0.367
car 1 49 0.020 3 2 1.5 0.031
carpet 3 49 0.061 3 1 3.0 0.184
ceiling 4 49 0.082 3 1 3.0 0.245
chair 6 49 0.122 3 3 1.0 0.122
cleaning 4 49 0.082 3 1 3.0 0.245
justice 8 49 0.163 3 3 1.0 0.163
libraries 2 49 0.041 3 1 3.0 0.122
newton 2 49 0.041 3 2 1.5 0.061
perl 5 49 0.102 3 2 1.5 0.153
rose 7 49 0.143 3 3 1.0 0.143
science 1 49 0.020 3 1 3.0 0.061

Given TFIDF, a search for “rose” still returns three documents ordered by Documents #3, #1, and #2. A search for “newton” returns only two items ordered by Documents #2 (0.110) and #3 (0.061). In the later case, Document #2 is almost one and a half times more “relevant” than document #3. TFIDF scores can be summed to take into account Boolean unions (or) or intersections (and).

Automatic classification

TDIDF can also be applied a priori to indexing/searching to create browsable lists — hence, automatic classification. Consider Table #5 where each word is listed in a sorted TFIDF order:

Document #1 Document #2 Document #3
Word TFIDF Word TFIDF Word TFIDF
airplane 0.326 milton 0.439 building 0.367
shoe 0.261 shakespeare 0.293 ceiling 0.245
computer 0.196 car 0.256 cleaning 0.245
perl 0.163 book 0.220 carpet 0.184
chair 0.152 pond 0.146 justice 0.163
justice 0.152 slavery 0.146 perl 0.153
forest 0.130 rose 0.122 rose 0.143
love 0.130 newton 0.110 chair 0.122
might 0.130 chair 0.098 libraries 0.122
rose 0.130 thesis 0.073 newton 0.061
blue 0.065 truck 0.073 science 0.061
thesis 0.065 justice 0.049 car 0.031

Given such a list it would be possible to take the first three terms from each document and call them the most significant subject “tags”. Thus, Document #1 is about airplanes, shoes, and computers. Document #2 is about Milton, Shakespeare, and cars. Document #3 is about buildings, ceilings, and cleaning.

Probably a better way to assign “aboutness” to each document is to first denote a TFIDF lower bounds and then assign terms with greater than that score to each document. Assuming a lower bounds of 0.2, Document #1 is about airplanes and shoes. Document #2 is about Milton, Shakespeare, cars, and books. Document #3 is about buildings, ceilings, and cleaning.

Discussion and conclusion

Since the beginning, librarianship has focused on the semantics of words in order to create a cosmos from an apparent chaos. “What is this work about? Read the descriptive information regarding a work (author, title, publisher date, notes, etc.) to workout in your mind its importance.” Unfortunately, this approach leaves much up to interpretation. One person says this document is about horses, and the next person says it is about husbandry.

The mathematic approach is more objective and much more scalable. While not perfect, there is much less interpretation required with TFIDF. It is just about mathematics. Moreover, it is language independent; it is possible to weigh terms and provide relevance ranking without knowing the meaning of a single word in the index.

In actuality, the whole thing is not an either/or sort of question, but instead a both/and sort of question. Human interpretation provides an added value, definitely. At the same time the application of mathematics (”Can you say ’science?’”) proves to be quite useful too. The approaches compliment each other — they are arscient. Much of how we have used computers in libraries has simply been to automate existing processes. We have still to learn how to truly take advantage of a computer’s functionality. It can remember things a whole lot better than we can. It can add a whole lot faster than we can. Because of this it is almost trivial to calculate ( C / T ) * ( D / DF ) over an entire corpus of 2,000,000 MARC records or even 1,000,000 full text documents.

None of these ideas are new. It is possible to read articles describing these techniques going back about 40 years. Why has our profession not used them to our advantage. Why is it taking us so long? If you have an answer, then enter it in the comment box below.

This first posting has focused on the fundamentals of TFIDF. Part II will describe a Perl program implementing relevancy ranking and automatic classification against sets of given text files. Part III will explore the idea of using TFIDF to enable users to find documents alluding to “great ideas” or “great people”.

Code4Lib Open Source Software Award

Thursday, March 5th, 2009

As a community, let’s establish the Code4Lib Open Source Software Award.

Lot’s of good work gets produced by the Code4Lib community, and I believe it is time to acknowledge these efforts in some tangible manner. Our profession is full of awards for leadership, particular aspects of librarianship, scholarship, etc. Why not an award for the creation of software? After all, the use of computers and computer software is an essential part of our day-to-day work. Let’s grant an award for something we value — good, quality, open source software.

While I think the idea of an award is a laudable one, I have more questions than answers about the process of implementing it. Is such a thing sustainable, and if so, then how? Who is eligible for the award? Only individuals? Teams? Corporate entities? How are awardees selected? Nomination? Vote? A combination of the two? What qualities should the software exemplify? Something that solves a problem for many people? Something with a high “cool factor”? Great documentation? Easy to install? Well-supported with a large user base? Developed within the past year?

As a straw man for discussion, I suggest something like the following:

  • Regarding selection, I suggest there be a committee who solicits nominations and selects the awardee(s). As the years go by an individual from the committee drops off and the/an awardee becomes a member.
  • Regarding who is eligible, I suggest it be individuals, teams, or corporate entities. Awardees must be willing to serve on the next year’s nominating committee.
  • Regarding what is eligible, I suggest the software be open source, directly library-related, and developed within the past two years.
  • Regarding the timing, I suggest this be an annual award given at each Code4Lib conference.

These are just suggestions to get us started. What do you think? Consider sharing your thoughts as comments below, in channel, or on the Code4Lib mailing list.

Eric Lease Morgan’s Top Tech Trends for ALA Mid-Winter, 2009

Monday, February 9th, 2009

This is a list of “top technology trends” written for ALA Mid-Winter, 2009. They are presented in no particular order. [This text was originally published on the LITA Blog, but it is duplicated here because "lot's of copies keep stuff safe." --ELM]

Indexing with Solr/Lucene works well – Lucene seems to have become the gold standard when it comes to open source indexer/search engine platforms. Solr — a Web Services interface to Lucene — is increasingly the preferred way to read & write Lucene indexes. Librarians love to create lists. Books. Journals. Articles. Movies. Authoritative names and subjects. Websites. Etc. All of these lists beg for the organization. Thus, (relational) databases. But Lists need to be short, easily sortable, and/or searchable in order to be useful as finding aids. Indexers make things searchable, not databases. The library profession needs to get its head around the creation of indexes. The Solr/Lucene combination is a good place to start — er, catch up.

Linked data is a new name for the Semantic Web – The Semantic Web is about creating conceptual relationships between things found on the Internet. Believe it or not, the idea is akin to the ultimate purpose of a traditional library card catalog. Have an item in hand. Give it a unique identifier. Systematically describe it. Put all the descriptions in one place and allow people to navigate the space. By following the tracings it is possible to move from one manifestation of an idea to another ultimately providing the means to the discovery, combination, and creation of new ideas. The Semantic Web is almost the exactly the same thing except the “cards” are manifested using RDF/XML on computers through the Internet. From the beginning RDF has gotten a bad name. “Too difficult to implement, and besides the Semantic Web is a thing of science fiction.” Recently the term “linked data” has been used to denote the same process of creating conceptual relationships between things on the ‘Net. It is the Semantic Web by a different name. There is still hope.

Blogging is peaking – There is no doubt about it. The Blogosphere is here to stay, yet people have discovered that it is not very easy to maintain a blog for the long haul. The technology has made it easier to compose and distribute one’s ideas, much to the chagrin of newspaper publishers. On the other hand, the really hard work is coming up with meaningful things to say on a regular basis. People have figured this out, and consequently many blogs have gone by the wayside. In fact, I’d be willing to bet that the number of new blogs is decreasing, and the number of postings to existing blogs is decreasing as well. Blogging was “kewl” is cool but also hard work. Blogging is peaking. And by the way, I dislike those blogs which are only partial syndicated. They allow you to read the first 256 characters or so of and entry, and then encourage you to go to their home site to read the whole story whereby you are bombarded with loads of advertising.

Word/tag clouds abound – It seems very fashionable to create word/tag clouds now-a-days. When you get right down to it, word/tag clouds are a whole lot like concordances — one of the first types of indexes. Each word (or tag) in a document is itemized and counted. Stop words are removed, and the results are sorted either alphabetically or numerically by count. This process — especially if it were applied to significant phrases — could be a very effective and visual way to describe the “aboutness” of a file (electronic book, article, mailing list archive, etc.). An advanced feature is to hyperlink each word, tag, or phrase to specific locations in the file. Given a set of files on similar themes, it might be interesting to create word/tag clouds against them in order to compare and contrast. Hmmm…

“Next Generation” library catalogs seem to be defined – From my perspective, the profession has stopped asking questions about the definition of “next generation” library catalogs. I base this statement on two things. First, the number of postings and discussion on a mailing list called NGC4Lib has dwindled. There are fewer questions and even less discussion. Second, the applications touting themselves, more or less, as “next generation” library catalog systems all have similar architectures. Ingest content from various sources. Normalize it into an internal data structure. Store the normalized data. Index the normalized data. Provide access to the index as well as services against the index such as tag, review, and Did You Mean? All of this is nice, but it really isn’t very “next generation”. Instead it is slightly more of the same. An index allows people to find, but people are still drinking from the proverbial fire hose. Anybody can find. In my opinion, the current definition of “next generation” does not go far enough. Library catalogs need to provide an increased number services against the content, not just services against the index. Compare & contrast. Do morphology against. Create word cloud from. Translate. Transform. Buy. Review. Discuss. Share. Preserve. Duplicate. Trace idea, citation, and/or author forwards & backwards. It is time to go beyond novel ways to search lists.

SRU is becoming more viable – SRU (Search/Retrieve via URL) is a Web Services-based protocol for searching databases/indexes. Send a specifically shaped URL to a remote HTTP server. Get back a specifically shaped response. SRU has been joined with a no-longer competing standard called OpenSearch in the form of an Abstract Protocol Definition, and the whole is on its way to becoming an OASIS standard. Just as importantly, an increasing number of the APIs supporting the external-facing OCLC Grid Services (WorldCat, Identities, Registries, Terminologies, Metadata Crosswalk) use SRU as the query interface. SRU has many advantages, but some of those advantages are also disadvantages. For example, its query language (CQL) is expressive, especially compared to OpenSearch or Google, but at the same time, it is not easy to implement. Second, the nature of SRU responses can range from rudimentary and simple to obtuse and complicated. More over, the response is always in XML. These factors make transforming the response for human consumption sometimes difficult to implement. Despite all these things, I think SRU is a step in the right direction.

The pendulum of data ownership is swinging – I believe it was Francis Bacon who said, “Knowledge is power”. In my epistemological cosmology, knowledge is based on information, and information is based on data. (Going the other way, knowledge leads to wisdom, but that is another essay.) Therefore, he who owns or has access to the data will ultimately have more power. Google increasingly has more data than just about anybody. They have a lot of power. OCLC increasingly “owns” the bibliographic data created by its membership. Ironically, this data — in both the case of Google and OCLC — is not freely available, even when the data was created for the benefit of the wider whole. I see this movement akin to the movement of a pendulum swinging one way and then the other. On my more pessimistic days I view it as a battle. On my calmer days I see it as a natural tendency, a give and take. Many librarians I know are in the profession, not for the money, but to support some sort of cause. Intellectual freedom. The right to read. Diversity. Preservation of the historical record. If I have a cause it then is about the free and equal access to information. This is why I advocate open access publishing, open source software, and Net Neutrality. When data and information is “owned” and “sold” an environment of information have’s and have not’s manifests itself. Ultimately, this leads to individual gain but not necessarily the improvement of the human condition as a whole.

The Digital Dark Age continues – We, as a society, are continuing to create a Digital Dark Age. Considering all of the aspects of librarianship, the folks who deal with preservation, conservation, and archives have the toughest row to hoe. It is ironic. On one hand there is more data and information available than just about anybody knows what to do with. On the other hand, much of this data and information will not be readable, let alone available, in the foreseeable future. Somebody is going to want to do research on the use of blogs and email. What libraries are archiving this data? We are writing reports and summaries in binary and proprietary formats. Such things are akin to music distributed on 8-track tapes. Where are the gizmos enabling us to read these formats? We increasingly license our most desired content — scholarly journal articles — and in the end we don’t own anything. With the advent of Project Gutenberg, Google Books, and the Open Content Alliance the numbers of freely available electronic books rival the collections of many academic libraries. Who is collecting these things? Do we really want to put all of our eggs into one basket and trust these entities to keep them for the long haul? The HathiTrust understand this phenomonon, and “Lot’s of copies keep stuff safe.” Good. In the current environment of networked information, we need to re-articulate the definition of “collection”.

Finally, regarding change. It manifests itself along a continuum. At one end is evolution. Slow. Many false starts. Incremental. At the other end is revolution. Fast. Violent. Decisive. Institutions and their behaviors change slowly. Otherwise they wouldn’t be the same institutions. Librarianship is an institution. Its behavior changes slowly. This is to be expected.

Mr. Serials is dead. Long live Mr. Serials

Sunday, January 11th, 2009

This posting describes the current state of the Mr. Serials Process.

Background

Round about 1994 when I was employed by the North Carolina State University Libraries, Susan Nutter, the Director, asked me to participate in an ARL Collection Analysis Project (CAP). The goal of the Project was to articulate a mission/vision statement for the Libraries fledgling Collection Development Department. “It will be a professional development opportunity”, she told me. I don’t think she knows how much of an opportunity it really was.

Through the CAP I, along with a number of others (Margaret Hunt, John Abbott, Caroline Argentati, and Orion Pozo) became acutely aware of the “serials pricing crisis”. Academic writes article. Article gets peer-reviewed. Publisher agrees to distribute article in exchange for copyright. Article gets published in journal. Library subscribes to journal at an ever-increasing price. Academic reads journal. Repeat.

The whole “crisis” made me frustrated (angry), and others were frustrated too. Why did prices need to be increasing so dramatically? Why couldn’t the Academe coordinate peer-review? Why couldn’t the Internet be used a distribution medium? Some people tried to answer some of these questions differently than the norm, and the result was the creation of electronic journals distributed via email such as the venerable Bryn Mawr Classical Review, Psycoloquy, Postmodern Culture, and PACS Review.

Given this environment, I sought to be a part of the solution instead of perpetuating the problem. I created the Mr. Serials Process — a set of applications/scripts that collected, archived, indexed, and re-distributed sets of electronic journals. I figured I could demonstrate to the library and academic communities that if everybody does their part, then there would less of need for commercial publishers — entities who were exploiting the system and more interested in profit than the advancement of knowledge. Mr. Serials was “born” around 1994 and documented in an article from Serials Review. Mr. Serial, now 14-years old, would be considered a child by most people’s standards. Yet, fourteen years is a long time in Internet years.

Mr. Serials is dead

For all intents and purposes, Mr. Serials is dead because his process was based on the distribution of electronic serials via email. His death was long and drawn out. The final nail driven into his coffin came when ACQNET, one of the original “journals” he collected, moved from Appalachian State University to iBiblio a few months ago. After the move Mr. Serials was no longer considered the official archivist of the content, and his era had passed.

This is not a big deal. Change happens. Processes evolve. Besides, Mr. Serials created a legacy for himself, a set of early electronic serial literature exemplifying the beginnings of networked scholarly communication which includes more than thirty titles archived at serials.infomotions.com.

Long live Mr. Serials

At the same time, Mr. Serials is alive and well. Maybe, like many people his age, he is going through an adolescence.

In the middle 1990s electronic journals were distributed via email. As such the Mr. Serials Process used procmail to filter incoming mail. He then used a Hypercard program to create configuration files denoting the locations of bibliographic data in journal titles. He then used a Perl program reading the configuration files, automatically extracting the bibliographic information from each issue, removing the email header, and saving the resulting journal article in a specified location. Initially, the whole collection was made available via a Gopher server and indexed with WAIS. Later, the collection was made available via an HTTP server and other indexing technologies were used but many of them are broken.

Somewhere along the line, some of the “journals” became mailing lists, and the Process was modified to take advantage of an archiving program called Hypermail. Like the original Process, the archived materials are accessible via a Web server and indexed with some sort of search engine technology. (There have been so many.) With the movement of ACQNET, the original “journals” have all gone away, but Mr. Serials has picked up a few mailing lists along the way, notably colldv-l, Code4LIb, and NGC4Lib. Consequently, Mr. Serials is not really dead, just transformed.

A lot of the credit goes to procmail, Hypermail, Web servers, and indexers. Procmail reads incoming mail and processes it accordingly. File it here. File it there. Delete it. Send it off to another process. Hypermail makes pretty email archives which are more or less configurable. It allows one to keep email messages in their original RFC 822 (mbox) format and reuse them for many purposes. We all know what HTTP servers do. Indexers complement the Hypermail process by providing searchable interfaces to the collection. The indexer used against colldv-l, Code4Lib, and NGC4Lib is called KinoSearch and is implemented through an SRU interface.

Mr. Serials is a modern day library process. It has a set of collection development goals. It acquires content. It organizes content. It archives and preserves content. It redisseminates content. The content it currently collects may not be extraordinarily scholarly, but someday somebody is going to want it. It is a special collection. Much if its success is a testiment to open source software. All the tools it uses are open source. In fact most of them were distributed as open source even before the phrase was coined.

Long live Mr. Serials.

Snow blowing and librarianship

Sunday, December 7th, 2008

I don’t exactly know why, but I enjoy snow blowing.

snow blower


snow blower

I think it began when I was college. My freshman year I stayed on during the January earning money from Building & Grounds. For much of the time they simply said, “Go shovel some snow.” It was quiet, peaceful, and solitary. It was physical labor. It was a good time to think, and the setting was inspirational.

A couple of years later, in order to fulfill a graduation requirement, I needed to design and complete a “social practicum”. I decided to shovel snow for my neighbors. Upon asking them for permission, I got a lot of strange looks. “Why would you want to shovel my snow?”, they’d ask. I’d say, “Because I am more able to do it than you. I’m just being helpful and providing a social service.” Surprisingly, many people did not take me up on my offer, but a few did.

I now live and work in northern Indian only forty-five minutes from Lake Michigan where “lake effect” snow is common. I own a big, bad snowblower. It gives me a sense of power, and even though it disturbs the quiet, I enjoy the process of cleaning my driveway and sidewalk. I enjoy trying to figure out the most effectient way to get the job done. I enjoy it so much I even snow blow around the block.

Snow blowing and librarianship

What does this have to do with librarianship? In reality, not a whole lot. On the other hand, one of the aspects of librarianship, especially librarianship in public libraries, is community service — providing means for improving society. My clearing of snow for my neighbors is done in a similar vein, and it works for me. I can do something for my fellow man and have fun at the same time. Weird?

P.S. Mowing the grass gives me the same sort of feelings.

MyLibrary: A Digital library framework & toolbox

Wednesday, September 17th, 2008

I recently had published an article in Information Technology and Libraries (ITAL) entitled “MyLibrary: A Digital library framework & toolkit” (volume 27, number 3, pages 12-24, September 2008). From the abstract:

This article describes a digital library framework and toolkit called MyLibrary. At its heart, MyLibrary is designed to create relationships between information resources and people. To this end, MyLibrary is made up of essentially four parts: 1) information resources, 2) patrons, 3) librarians, and 4) a set of locally-defined, institution-specific facet/term combinations interconnecting the first three. On another level, MyLibrary is a set of object-oriented Perl modules intended to read and write to a specifically shaped relational database. Used in conjunction with other computer applications and tools, MyLibrary provides a way to create and support digital library collections and services. Librarians and developers can use MyLibrary to create any number of digital library applications: full-text indexes to journal literature, a traditional library catalog complete with circulation, a database-driven website, an institutional repository, an image database, etc. The article describes each of these points in greater detail.

http://infomotions.com/musings/mylibrary-framework/

The folks at ITAL are gracious enough to allow authors to distribute their work on the Web as long as the distribution happens after print publication. “Nice policy!”

Many people will remember MyLibrary from more than ten years ago. It is alive and well. It drives a few digital library projects at Notre Dame. It is often associated with customization/personalization, but now it is more about creating relationships between people and information resources through an institution-defined controlled vocabulary — a set of facet/term combinations.

MyLibrary is about relationships

In my opinion, libraries spend too much time describing resources and creating interdependencies between them. Instead, I think libraries should be spending more time creating relationships between resources and people. You can do this in any number of ways, and sets of facet/term combinations are just one. Think up qualities used to describe people. Think up qualities used to describe information resources. Create relationships by bringing resources and people together that share qualities.

Last of the Mohicans and services against texts

Monday, August 25th, 2008

Here is a word cloud representing James Fenimore Cooper’s The Last of the Mohicans; A narrative of 1757. It is a trivial example of how libraries can provide services against documents, not just the documents themselves.

scout  heyward  though  duncan  uncas  little  without  own  eyes  before  hawkeye  indian  young  magua  much  place  long  time  moment  cora  hand  again  after  head  returned  among  most  air  huron  toward  well  few  seen  many  found  alice  manner  david  hurons  voice  chief  see  words  about  know  never  woods  great  rifle  here  until  just  left  soon  white  heard  father  look  eye  savage  side  yet  already  first  whole  party  delawares  enemy  light  continued  warrior  water  within  appeared  low  seemed  turned  once  same  dark  must  passed  short  friend  back  instant  project  around  people  against  between  enemies  way  form  munro  far  feet  nor  

About the story

While I am not a literary scholar, I am able to read a book and write a synopsis.

Set during the French And Indian War in what was to become upper New York State, two young women are being escorted from one military camp to another. Along the way the hero, Natty Bumppo (also known by quite a number of other names, most notably “Hawkeye” or the “scout”), alerts the convoy that their guide, Magua, is treacherous. Sure enough, Magua kidnaps the women. Fights and battles ensue in a pristine and idyllic setting. Heroic deeds are accomplished by Hawkeye and the “last of the Mohicans” — Uncas. Everybody puts on disguises. In the end, good triumphs over evil but not completely.

Cooper’s style is verbose. Expressive. Flowery. On this level it was difficult to read. Too many words. In the other hand the style was consistent, provided a sort of pattern, and enabled me to read the novel with a certain rhythm.

There were a couple of things I found particularly interesting. First, the allusion to “relish“. I consider this to be a common term now-a-days, but Cooper thought it needed elaboration when used to describe food. Cooper used the word within a relatively short span of text to describe condiment as well as a feeling. Second, I wonder whether or not Cooper’s description of Indians built on existing stereotypes or created them. “Hugh!”

Services against texts

The word cloud I created is simple and rudimentary. From my perspective, it is just a graphical representation of a concordance, and a concordance has to be one of the most basic of indexes. This particular word cloud (read “concordance” or “index”) allows the reader to get a sense of a text. It puts words in context. It allows the would-be reader to get an overview of the document.

This particular implementation is not pretty, nor is it quick, but it is functional. How could libraries create other services such as these? Everybody can find and get data and information these days. What people desire is help understanding and using the documents. Providing services against texts such as word clouds (concordances) might be one example.

Metadata and data structures

Tuesday, August 5th, 2008

It is important to understand the differences between metadata and data structures. This posting outlines some of the differences between the two.

Introduction

Every once in a while people ask me for advice that I am usually very happy to give because the answers usually involve succinctly articulating some of the things floating around in my head. Today someone asked:

I’ve been looking at Dublin Core and looking at MODS to arrive at the best metadata for converting MARC records into human readable format. Dublin Core lacks specificity, but maybe I don’t understand it that well. Plus, I cannot find what parts of the MARC are mapped to what–where are the “rules.” I look at Mods and find it overwhelming and I’m not even sure of its intended purpose.

Below is how I replied.

Dublin Core is a list of element names

First of all, please understand that Dublin Core is really just a list of fifteen or so metadata element names. Title. Creator. Publisher. Format. Identifier. Etc. Moreover, each of these names come with simple definitions denoting the type of content they are expected to represent. Dublin Core is NOT a metadata format. Dublin Core does not define how data should be encoded. It is simply a list of elements.

MARC and XML as data structures

MARC is a metadata format — a data structure — a container — a “bit bucket”. The MARC standard defines how data should be encoded. First there is a leader. It is always 24 characters long and different characters in the leader denote different things. Then there is the directory — a “map” of where the data resides in the file. Finally, there is the data itself which is divided into indicators, fields, and subfields. This MARC standard has been used to hold bibliographic data as well as authority data. In one case the 245 field is intended to encode title/author information. In another case the 245 means something else. In both cases they are using MARC — a data structure.

XML is second type of data structure. Instead of leaders, directories, and data sections, XML is made up of nested elements where the elements of the file are denoted by a Document Type Definition (DTD) or XML schema. XML is much more flexible than MARC. XML is much more verbose than MARC. There are many industries supporting XML. MARC is supported by a single industry. MARC was cool in its time, but it has grown long in the tooth. XML is definitely the data structure to use now-a-days.

MARCXML and MODS

MARCXML is a specific flavor of XML used to contain 100% of the data in a bibliographic MARC file. It works. It does what it is suppose to do, but in order to really take advantage of it the user needs to know that the 245 field contains title information, the 100 field contains author information, etc. In other words, to use MARCXML the user needs to know the “secret code book” translating library tags into human-readable elements. Moreover, MARCXML retains all of the “syntactical” sugar of MARC. Last name first. First name last. Parentheses around birth and death dates. “pbk” to denote paperback. Etc.

MODS is a second flavor of XML also designed to contain bibliographic data. In at least a couple of ways, MODS is much better than MARCXML. First and foremost, MODS removes the need for “secret code book” because the element names are human-readable, not integers. Second, some but not all, of the syntactical sugar is removed.

When it comes to bibliographic data, I advocate MODS over MARCXML any day. Not perfect, but a step in the right direction. There are utilities to convert MARC to MARCXML and then to MODS. Conversion is almost a trivial computing problem to solve.

The “right” metadata standard

When it comes to choosing the “right” metadata standard it is often about choosing the “right” flavor of XML. VRACore, for example, is more amenable to describing image data. TEI is best suited to describe — mark-up — prose and/or poetry. EAD is the “best” candidate for archival finding aids. Authority data can be represented in a relatively new XML flavor called MADS. METS is used, more or less, to create collections of metadata objects. RDF is similar to METS and is intended to form the basis of the Semantic Web. SKOS is an XML format for thesauri.

In short, there are two things to consider. First, what is your data? Bibliographic? Image? Full texts? Second, what data structure do you want to employ? MARC? XML? Something else such as a tab-delimited file? (Ick!) Or maybe a relational database schema? (Maybe.) In most cases I expect XML will be the data structure you want to employ, and then the question is, “What XML DTD or schema do I want to exploit?”

I allude to many of these issues in an XML workshop I wrote called XML In Libraries.

‘Hope this helps.

Origami is arscient, and so is librarianship

Wednesday, July 30th, 2008

To do origami well a person needs to apply both artistic and scientific methods to the process. The same holds true for librarianship.

Arscience

Arscience is a word I have coined to denote the salient aspects of both art and science. It is a type of thinking — thinquing — that is both intuitive as well as systematic. It exemplifies synthesis — the bringing together of ideas and concepts — and analysis — the division of our world into smaller and smaller parts. Arscience is my personal epistemological method employing a Hegalian dialectic — an internal discussion. It juxtaposes approaches to understanding including art and science, synthesis and analysis, as well as faith and experience. These epistemological methods can be compared and contrasted, used or exploited, applied and debated against many of the things we encounter in our lives. Through this process I believe a fuller understanding of many things can be achieved.

arscience

Origami

A trivial example is origami. One one hand, origami is very artistic. Observe something in the natural world. Examine its essential parts and take notice of their shape. Acquire a piece of paper. Fold the paper to bring the essential parts together to form a coherent whole. The better your observation skills, the better your command of the medium, the better your origami will be.

On the other hand, you can discover that a square can be inscribed on any plane, and upon a square any number of regular polygons can be further inscribed. All through folding. You can then go about bisecting angles and dividing paper in halves, creating symbols denoting different types of folds, and systematically recording the process so it can be shared with others, ultimately creating a myriad of three-dimensional objects from an essentially two-dimensional thing. Unfold the three-dimensional object to expose its mathematics.

Seemingly conflicting approaches to the same problem results in similar outcomes. Arscience.

arscience

Librarianship

The same artistic and scientific processes — an arscient process — can be applied to librarianship. While there are subtle differences between different libraries, they all do essentially the same thing. To some degree they all collect, organize, preserve, and disseminate data, information, and knowledge for the benefit their respective user populations.

To accomplish these goals the librarian can take both an analysis tack as well as a synthesis tack. Interactions with people is more about politics, feelings, wants, and needs. Such things are not logical but emotional. This is one side of the coin. The other side of the coin includes well-structured processes & workflows, usability studies & statistical analysis, systematic analysis & measurable results. In our hyper-dynamic environment, such as the one we are working it, innovation — thinking a bit outside the box — is a necessary ingredient for moving forward. At the same time, it is not all about creativity but it is also about strategically planning for the near, medium, and long term future.

Librarianship requires both. Librarianship is arscient.

TPM — technological protection measures

Sunday, July 20th, 2008

I learned a new acronym a few weeks ago — TPM — which stands for “technological protection measures”, and in the May 2008 issue of College & Research Libraries Kristin R. Eschenfelder wrote an article called “Every library’s nightmare?” and enumerated various types of protection measures employed by publishers to impede the use of electronic scholarly material.

Types of restrictions

In today’s environment, where digital information is increasingly bought, sold, and/or licensed, publishers feel the need to protect their product from duplication. As described by Eschenfelder, these protections — restrictions — come in two forms: soft and hard.

Soft restrictions are “configurations of hardware or software that make certain uses such as printing, saving, copy/pasting, or e-mailing more difficult — but not impossible — to achieve.” The soft restrictions have been divided into the following subtypes:

  • extent of use – page print limits; PDF download limits; data export limits; suspicious use tracking
  • obfuscation – need to select items before options become available
  • omission – not providing buttons or links to enact users
  • decomposition – saving document results in many files, making recreating or e-mailing the document difficult
  • frustration – page chunking in e-books
  • warning – copyright warnings; end-user licenses on startup

Hard restrictions are “configurations of software or hardware that strictly prevent certain uses.” The hard restrictions have been divided into the following subtypes:

  • restricted copy and paste OCR – OCR exposed for searching, but not for copying and pasting of text
  • secure container TPM – use rights vary by resource

To investigate what types of restricts were put into everyday practice Eschenfelder studied a total of about seventy-five resources from three different disciplines (engineering, history, art history) and tallied the types of restrictions employed.

Salient quotes

A few salient quotes from the article exemplify Eschenfelder’s position on TPM:

  • “This paper suggests that the soft restrictions that are present in licensed products may haver already changed user’s and librarian’s expectations about what the use rights they ought to expect from vendors and their products.” (Page 207)
  • “One concern is that the library community has already accepted many of the soft use restrictions identified in this paper.” (Page 219)
  • “[Librarians] should also advocate for removal of use restrictions, or encourage new vendors to offer competing restriction-free products.” (Page 219)
  • “A more realistic solution might be a shared knowledge base of vendor interfaces and known use restrictions.” (Page 219)
  • “The paper argues that soft use restrictions deserve more attention from the library community, and that librarians should not accept these restrictions as the natural order of things.” (Page 220)

My commentary

I agree with Eschenfelder.

Many people who work in libraries seem to be there because of the values libraries portray. Examples include but are not limited to: intellectual freedom, education, diversity, equal access to information, preservation of the historical record for future generations, etc. Heaven know, people who work in libraries are not in it for the money! I fall into the equal access to information camp, and that is why I advocate things like open access publishing and open source software development.

TPM inhibits the free and equal access of information, and I think Eschenfelder makes a good point when she says the “library community has already accepted many of the soft use restrictions.” Why do we accept them? Librarians are not required to purchase and/or license these materials. We have choice. If much of the scholarly publishing industry is driven by the marketplace — supply & demand — then why don’t/can’t we just say, “No”. Nobody is forcing us spend our money this way. If vendors don’t provide the sort of products and services we desire, then the marketplace will change. Right?

In any event, consider educating yourself on the types of TPM and read Eschenfelder’s article.