Visualizing co-occurrences with Protovis « Infomotions Mini-Musings

Visualizing co-occurrences with Protovis

This posting describes how I am beginning to visualize co-occurrences with a Javascript library called Protovis. Alternatively, I an trying to answer the question, “What did Henry David Thoreau say in the same breath when he used the word ‘walden’?”

“In the same breath”

Network diagrams are great ways to illustrate relationships. In such diagrams nodes represent some sort of entity, and lines connecting nodes represent some sort of relationship. Nodes clustered together and sharing many lines denote some kind of similarity. Conversely, nodes whose lines are long and not interconnected represent entities outside the norm or at a distance. Network diagrams are a way of visualizing complex relationships.

Are you familiar with the phrase “in the same breath”? It is usually used to denote the relationship between one or more ideas. “He mentioned both ‘love’ and ‘war’ in the same breath.” This is exactly one of the things I want to do with texts. Concordances provide this sort of functionality. Given a word or phrase, a concordance will find the query in a corpus and display the words on either side of it. A KWIK (key word in context) index, concordances make it easier to read how words or phrases are used in relationship with their surrounding words. The use of network diagrams seem like good idea to see — visualize — how words or phrases are used within the context of surrounding words.

Protovis is a Javascript charting library developed by the Stanford Visualization Group. Using Protovis a developer can create all sorts of traditional graphs (histograms, box plots, line charts, pie charts, scatter plots) through a relatively easy-to-learn API (application programmer interface). One of the graphs Protovis supports is an interactive simulation of network diagrams called “force-directed layouts“. After experiencing some of the work done by a few of my colleagues (“Thank you Michael Clark and Ed Summers“), I wondered whether or not network diagrams could be used to visualize co-occurrences in texts. After discovering Protovis, I decided to try to implement something along these lines.

Implementation

The implementation of the visualization requires the recursive creation of a term matrix. Given a word (or regular expression), find the query in a text (or corpus). Identify and count the d most frequently used words within b number of characters. Repeat this process d times with each co-occurrence. For example, suppose the text is Walden by Henry David Thoreau, the query is “spring”, d is 5, and b is 50. The implementation finds all the occurrences of the word “spring”, gets the text 50 characters on either side of it, finds the 5 most commonly used words in those characters, and repeats the process for each of those words. The result is the following matrix:

spring	day	morning	first	winter
day	days	night	every	today
morning	spring	say	day	early
first	spring	last	yet	though
winter	summer	pond	like	snow

Thus, the most common co-occurrences for the word “spring” are “day”, “morning”, “first”, and “winter”. Each of these co-occurrences are recursively used to find more co-occurrences. In this example, the word “spring” co-occurs with times of day and seasons. These words then co-occur with more times of day and more seasons. Similarities and patterns being to emerge. Depending on the complexity of a writer’s sentence structure, the value of b (“breath”) may need to be increased or decreased. As the value of d (“detail”) is increased or decreased so does the number of co-occurrences to return.

Once this matrix is constructed, Protovis requires it to be converted into a simple JSON (Javascript Object Notation) data structure. In this example, “spring” points to “day”, “morning”, “first”, and “winter”. “Day” points to “days”, “night”, “every”, and “today”. Etc. As terms point to multiples of other terms, a network diagram is manifested, and the magic of Protovis is put to work. See the following illustration:

“spring” in Walden

It is interesting enough to see the co-occurrences of any given word in a text, but it is even more interesting to compare the co-occurrences between texts. Below are a number of visualizations from Thoreau’s Walden. Notice how the word “walden” frequently co-occurs with the words “pond”, “water”, and “woods”. This makes a lot of sense because Walden Pond is a pond located in the woods. Notice how the word “fish” is associated with “pond”, “fish”, and “fishing”. Pretty smart, huh?

“walden” in Walden

“fish” in Walden

“woodchuck” in Walden

“woods” in Walden

Compare these same words with the co-occurrences in a different work by Thoreau, A Week on the Concord and Merrimack Rivers. Given the same inputs the outputs are significantly different. For example, notice the difference in co-occurrences given the word “woodchuck”.

“walden” in Rivers

“fish” in Rivers

“woodchuck” in Rivers

“woods” in Rivers

Give it a try

Give it a try for yourself. I have written three CGI scripts implementing the things outlined above:

In each implementation you are given the opportunity to input your own queries, define the “size of the breath”, and the “level of detail”. The result is an interactive network diagram visualizing the most frequent co-occurrences of a given term.

The root of the Perl source code is located at http://infomotions.com/sandbox/network-diagrams/.

Implications for librarianship

The visualization of co-occurrences obviously has implications for text mining and the digital humanities, but it also has implications for the field of librarianship.

Given the current environment where data and information abound in digital form, libraries have found themselves in an increasingly competitive environment. What are libraries to do? Lest they become marginalized, librarians can not rest on their “public good” laurels. Merely providing access to information is not good enough. Everybody feels as if they have plenty of access to information. What is needed are methods and tools for making better use of the data and information they acquire. Implementing text mining and visualization interfaces are one way to accomplish that goal within context of online library services. Do a search in the “online catalog”. Create a subset of interesting content. Click a button to read the content from a distance. Provide ways to analyze and summarize the content thus saving the time of the reader.

Us librarians have to do something differently. Think like an entrepreneur. Take account of your resources. Examine the environment. Innovate and repeat.

Tags: digital humanities, Protovis

This entry was posted on Sunday, January 9th, 2011 at 8:34 pm and is filed under Hacks, Librarianship. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

12 Responses to “Visualizing co-occurrences with Protovis”

Tweets that mention Visualizing co-occurrences with Protovis « Infomotions Mini-Musings -- Topsy.com says:

January 9, 2011 at 11:33 pm

[…] This post was mentioned on Twitter by infopeep, Eric Lease Morgan. Eric Lease Morgan said: Visualizing co-occurrences with Protovis: #digitalhumanities; #protovis — http://bit.ly/dKOvyX […]
Király Péter says:

January 10, 2011 at 10:53 am

Eric,

when I see the examples none of the words appear as label, only the nodes and axes, while in your screeshot there are labels as well. The JS API would be interesting in using XC UI for facets or for visualizing subjects. Thank you for the post!

Péter
Eric Lease Morgan says:

January 10, 2011 at 10:55 am

Péter, other people have said the same thing. No labels. Weird. What browser are you using? I am using Safari on a Macintosh. WFM.
Király Péter says:

January 10, 2011 at 11:43 am

I have tested on 64 bit Ubuntu with Firefox and Chrome.
Eric Lease Morgan says:

January 10, 2011 at 11:47 am

I have been able to verify this “bug”. ‘More later. I hope. Thank you.
Király Péter says:

January 10, 2011 at 12:02 pm

I have tested it on Windows XC with Firefox, IE8 and Chrome. IE8 received and error, but other two browsers produced the same as on Ubuntu. “No label” means, that there are no labels displayed, but it displayed as a little popup when the mouse is over a node.
Eric Lease Morgan says:

January 10, 2011 at 12:04 pm

Péter, yep, exactly.
Eric Lease Morgan says:

January 10, 2011 at 5:43 pm

Resolved? After changing a Javascript line from:

force.label.add(pv.Label).font(’14px sans-serif’).textStyle(‘bold’);

to:

force.label.add(pv.Label).font(’14px sans-serif’);

the problem seems to have gone away on my part. Fixed?
Király Péter says:

January 10, 2011 at 6:10 pm

Yes, now it is OK in both FF and Chrome in both Ubuntu and Windows XP. Except in IE8. It reported an “this object not supported this property of method” problem at protovis.js line 82 char. 261.
Eric Lease Morgan says:

January 10, 2011 at 7:56 pm

Péter, thank you for checking, and yes, the visualization does not work with IE since Protovis creates SVG to render images, and IE does not support SVG. Alas. This is why I am hesitant to implement my network diagrams in my Alex Catalogue. (Sigh.)
Arvind says:

April 8, 2011 at 1:57 pm

I am looking for a different type of graphing where I wish to click on a logical node to reveal its inner content at the next level of hierarchy. Anyway this can be done with protovis?
Eric Lease Morgan says:

April 9, 2011 at 5:36 pm

@Arvind, yes, I believe the sort of functionality you describe is possible with Protovis because it is possible to associate a hyperlink with each of the nodes. Click on a done. Go to another URL.

Creator: Eric Lease Morgan <eric_morgan@infomotions.com>
Date created: 2008-05-26
Date updated: 2010-05-09
URL: ./

Archives

Categories