Text mining against NGC4Lib « Infomotions Mini-Musings

Text mining against NGC4Lib

I “own” a mailing list called NCG4Lib. It’s purpose is to provide a forum for the discussion of all things “next generation library catalog”. As of this writing, there are about 2,000 subscribers.

Lately I have been asking myself, “What sorts of things get discussed on the list and who participates in the discussion?” I thought I’d try to answer this question with a bit of text mining. This analysis only covers the current year to date, 2010.

Author names

Even though there are as many as 2,000 subscribers, only a tiny few actually post comments. The following pie and line charts illustrate the point without naming any names. As you can see, eleven (11) people contribute 50% of the postings.

posters
11 people post 50% of the messages

The lie chart illustrates the same point differently; a few people post a lot. We definitely have a long tail going on here.

posters
They definitely represent a long tail

Subject lines

The most frequently used individual subject line words more or less reflect traditional library cataloging practices. MARC. MODS. Cataloging. OCLC. But also notice how the word “impasse” is included. This may reflect something about the list.

The subject words look “traditional”

I’m not quite sure what to make of the most commonly used subject word bigrams.

subject bigrams
‘Don’t know what to make of these bigrams

Body words

The most frequently used individual words in the body of the postings tell a nice story. Library. Information. Data. HTTP. But notice what is not there — books. I also don’t see things like collections, acquisitions, public, services, nor value or evaluation. Hmm…

body words
These tell a nice story

The most frequently used bigrams in the body of the messages tell an even more interesting story because the they are dominated by the names of people and things.

body bigrams
Names of people and things

The phrases “information services” and “technical services” do not necessarily fit my description. Using a concordance to see how these words were being used, I discovered they were overwhelmingly a part of one or more persons’ email signatures or job descriptions. Not what I was hoping for. (Sigh.)

Conclusions

Based on these observations, as well as my personal experience, I believe the NGC4Lib mailing list needs more balance. It needs more balance in a couple of ways:

There are too few people who post the majority of the content. The opinions of eleven people do not, IMHO, represent the ideas and beliefs of more than 2,000. I am hoping these few people understand this and will moderate themselves accordingly.
The discussion is too much focused, IMHO, on traditional library cataloging. There is so much more to the catalog than metadata. We need to be asking questions about what it contains, how that stuff is selected and how it gets in there, what the stuff is used for, and how all of this fits into the broader, worldwide information environment. We need to be discussing issues of collection and dissemination, not just organization. Put another way, I wish I had not used the word “catalog” in the name of the list because I think the word brings along too many connotations and preconceived ideas.

As the owner of the list, what will I do? Frankly, I don’t know. Your thoughts and comments are welcome.

Tags: NGC4Lib, text mining

This entry was posted on Friday, June 25th, 2010 at 11:23 am and is filed under Text Mining and Natural Langauge Processing. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

5 Responses to “Text mining against NGC4Lib”

Tweets that mention Text mining against NGC4Lib « Infomotions Mini-Musings -- Topsy.com says:

June 25, 2010 at 4:05 pm

[…] This post was mentioned on Twitter by infopeep, Becky Yoose. Becky Yoose said: Text mining against NGC4Lib ./2010/06/text-mining-against-ngc4lib/ […]
Karen Coyle says:

June 25, 2010 at 4:10 pm

Eric, I’m guessing that your software only took into account words of 4 letters or more — because I looked for RDA and didn’t find it anywhere.
Eric Lease Morgan says:

June 25, 2010 at 4:32 pm

Karen, that is exactly correct. I removed all words from the output that were less than 4 characters long. By doing so I removed more noise than signal. It was a trade-off.
Leo Robert Klein says:

June 26, 2010 at 1:28 pm

What’s traffic like? The fact that half the content comes from ‘just’ 11 members may not be that big of a deal in cases where monthly totals are that high.
Training and Technical Service says:

September 1, 2010 at 9:31 am

Kind of skeptical that is only came from 11 members.

Creator: Eric Lease Morgan <eric_morgan@infomotions.com>
Date created: 2008-05-26
Date updated: 2010-05-09
URL: ./

Archives

Categories