I “own” a mailing list called NCG4Lib. It’s purpose is to provide a forum for the discussion of all things “next generation library catalog”. As of this writing, there are about 2,000 subscribers.
Lately I have been asking myself, “What sorts of things get discussed on the list and who participates in the discussion?” I thought I’d try to answer this question with a bit of text mining. This analysis only covers the current year to date, 2010.
Author names
Even though there are as many as 2,000 subscribers, only a tiny few actually post comments. The following pie and line charts illustrate the point without naming any names. As you can see, eleven (11) people contribute 50% of the postings.
11 people post 50% of the messages
The lie chart illustrates the same point differently; a few people post a lot. We definitely have a long tail going on here.
They definitely represent a long tail
Subject lines
The most frequently used individual subject line words more or less reflect traditional library cataloging practices. MARC. MODS. Cataloging. OCLC. But also notice how the word “impasse” is included. This may reflect something about the list.
The subject words look “traditional”
I’m not quite sure what to make of the most commonly used subject word bigrams.
‘Don’t know what to make of these bigrams
Body words
The most frequently used individual words in the body of the postings tell a nice story. Library. Information. Data. HTTP. But notice what is not there — books. I also don’t see things like collections, acquisitions, public, services, nor value or evaluation. Hmm…
The most frequently used bigrams in the body of the messages tell an even more interesting story because the they are dominated by the names of people and things.
The phrases “information services” and “technical services” do not necessarily fit my description. Using a concordance to see how these words were being used, I discovered they were overwhelmingly a part of one or more persons’ email signatures or job descriptions. Not what I was hoping for. (Sigh.)
Conclusions
Based on these observations, as well as my personal experience, I believe the NGC4Lib mailing list needs more balance. It needs more balance in a couple of ways:
- There are too few people who post the majority of the content. The opinions of eleven people do not, IMHO, represent the ideas and beliefs of more than 2,000. I am hoping these few people understand this and will moderate themselves accordingly.
- The discussion is too much focused, IMHO, on traditional library cataloging. There is so much more to the catalog than metadata. We need to be asking questions about what it contains, how that stuff is selected and how it gets in there, what the stuff is used for, and how all of this fits into the broader, worldwide information environment. We need to be discussing issues of collection and dissemination, not just organization. Put another way, I wish I had not used the word “catalog” in the name of the list because I think the word brings along too many connotations and preconceived ideas.
As the owner of the list, what will I do? Frankly, I don’t know. Your thoughts and comments are welcome.
Tags: NGC4Lib, text mining
[…] This post was mentioned on Twitter by infopeep, Becky Yoose. Becky Yoose said: Text mining against NGC4Lib ./2010/06/text-mining-against-ngc4lib/ […]
Eric, I’m guessing that your software only took into account words of 4 letters or more — because I looked for RDA and didn’t find it anywhere.
Karen, that is exactly correct. I removed all words from the output that were less than 4 characters long. By doing so I removed more noise than signal. It was a trade-off.
What’s traffic like? The fact that half the content comes from ‘just’ 11 members may not be that big of a deal in cases where monthly totals are that high.
Kind of skeptical that is only came from 11 members.