Foray’s into parts-of-speech « Infomotions Mini-Musings

Foray’s into parts-of-speech

This posting is the first of my text mining essays focusing on parts-of-speech. Based on the most rudimentary investigations, outlined below, it seems as if there is not much utility in the classification and description of texts in terms of their percentage use of parts-of-speech.

Background

For the past year or so I have spent a lot of my time counting words. Many of my friends and colleagues look at me strangely when I say this. I have to admit, it does sound sort of weird. On the other hand, the process has enabled me to easily compare & contrast entire canons in terms of length and readability, locate statistically significant words & phrases in individual works, and visualize both with charts & graphs. Through the process I have developed two Perl modules (Lingua::EN::Ngram and Lingua::Concordance), and I have integrated them into my Alex Catalogue of Electronic Texts. Many people are still skeptical about the utility of these endeavors, and my implementations do not seem to be compelling enough to sway their opinions. Oh well, such is life.

My ultimate goal is to figure out ways to exploit the current environment and provide better library service. The current environment is rich with full text. It abounds. I ask myself, “How can I take advantage of this full text to make the work of students, teachers, and scholars both easier and more productive?” My current answer surrounds the creation of tools that take advantage of the full text — making it easier for people to “read” larger quantities of information, find patterns in it, and through the process create new knowledge.

Much of my work has been based on rudimentary statistics with little regard to linguistics. Through the use of computers I strive to easily find patterns of meaning across works — an aspect of linguistics. I think such a thing is possible because the use of language assumes systems and patterns. If it didn’t then communication between ourselves would be impossible. Computers are all about systems and patterns. They are very good at counting and recording data. By using computers to count and record characteristics of texts, I think it is possible to find patterns that humans overlook or don’t figure as significant. I would really like to take advantage of core reference works which are full of meaning — dictionaries, thesauri, almanacs, biographies, bibliographies, gazetteers, encyclopedias, etc. — but the ambiguous nature of written language makes the automatic application of such tools challenging. By classifying individual words as parts-of-speech (POS), some of this ambiguity can be reduced. This posting is my first foray into this line of reasoning, and only time will tell if it is fruitful.

Comparing parts-of-speech across texts

My first experiment compares & contrasts POS usage across texts. “To what degree are there significant differences between authors’ and genres’ usage of various parts-of-speech?”, I asked myself. “Do some works contain a greater number of nouns, verbs, and adjectives than others?” If so, then maybe this would be one way to differentiate works, and make it easier for the student to both select a work for reading as well as understand its content.

POS tagging

To answer these questions, I need to first identify the POS in a document. In the English language there are eight generally accepted POS: 1) nouns, 2) pronouns, 3) verbs, 4) adverbs, 5) adjectives, 6) prepositions, 7) conjunctions, and 8) interjections. Since I am a “lazy Perl programmer”, I sought a POS tagger and in the end settled on one called Lingua::TreeTagger — a full-featured wrapper around a command line driven application called Tree Tagger. Using a process called the Hidden Markov Model, TreeTagger systematically goes through a document and guesses the POS for a given word. According to the research, it can do this with 96% accuracy because is has accurately modeled the systems and patterns of the English language alluded to above. For example, it knows that sentences begin with capital letters and end with punctuation marks. It knows that capitalized words in the middle of sentences are the names of things and the names of things are nouns. It knows that most adverbs end in “ly”. It knows that adjectives often precede nouns. Similarly, it knows the word “the” also precedes nouns. In short, it has done its best to model the syntactical nature of a number of languages and it uses these models to denote the POS in a document.

For example, below is the first sentence from Abraham Lincoln’s Gettysburg Address:

Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.

Using Lingua::TreeTagger it is trivial to convert the sentence into the following XML snippet where each element contains two attributes (a lemma of the word in question and its POS) and the word itself:

<pos><w lemma="Four" type="CD">Four</w> <w lemma="score" type="NN">score</w> <w lemma="and" type="CC">and</w> <w lemma="seven" type="CD">seven</w> <w lemma="year" type="NNS">years</w> <w lemma="ago" type="RB">ago</w> <w lemma="our" type="PP$">our</w> <w lemma="father" type="NNS">fathers</w> <w lemma="bring" type="VVD">brought</w> <w lemma="forth" type="RB">forth</w> <w lemma="on" type="IN">on</w> <w lemma="this" type="DT">this</w> <w lemma="continent" type="NN">continent</w> <w lemma="," type=",">,</w> <w lemma="a" type="DT">a</w> <w lemma="new" type="JJ">new</w> <w lemma="nation" type="NN">nation</w> <w lemma="," type=",">,</w> <w lemma="conceive" type="VVN">conceived</w> <w lemma="in" type="IN">in</w> <w lemma="Liberty" type="NP">Liberty</w> <w lemma="," type=",">,</w> <w lemma="and" type="CC">and</w> <w lemma="dedicate" type="VVN">dedicated</w> <w lemma="to" type="TO">to</w> <w lemma="the" type="DT">the</w> <w lemma="proposition" type="NN">proposition</w> <w lemma="that" type="IN/that">that</w> <w lemma="all" type="DT">all</w> <w lemma="man" type="NNS">men</w> <w lemma="be" type="VBP">are</w> <w lemma="create" type="VVN">created</w> <w lemma="equal" type="JJ">equal</w> <w lemma="." type="SENT">.</w></pos>

Each POS is represented by a different code. TreeTagger uses as many as 58 codes. Some of the less obscure are: CD for cardinal number, CC for conjunction, NN for noun, NNS for plural noun, JJ for adjective, VBP for the verb to be in the third-person plural, etc.

Using a slightly different version of the same trivial code, Lingua::TreeTagger can output a delimited stream where each line represents a record and the delimited values are words, lemmas, and POS. The first ten records from the sentence above are displayed below:

Word	Lemma	POS
Four	Four	CD
score	score	NN
and	and	CC
seven	seven	CD
years	year	NNS
ago	ago	RB
our	our	PP$
fathers	father	NNS
brought	bring	VVD
forth	forth	RB

In the end I wrote a simple program — tag.pl — taking a file name as input and streaming to standard output the tagged text in delimited form. Executing the code and saving the output to a file is simple:

$ bin/tag.pl corpus/walden.txt > pos/walden.pos

Consequently, I now have a way to quickly and easily denote the POS for each word in a given plain text file.

Counting and summarizing

Now that the POS of a given document are identified, the next step is to count and summarize them. Counting is something at which computers excel, and I wrote another program — summarize.pl — to do the work. The program’s input takes the following form:

summarize.pl <all|simple|other|pronouns|nouns|verbs|adverbs|adjectives> <t|l> <filename>

The first command line argument denotes what POS will be output. “All” denotes the POS defined by Tree Tagger. “Simple” denotes Tree Tagger POS mapped to the eight generally accepted POS of the English language. The use of “nouns”, “pronouns”, “verbs”, “adverbs”, and “adjectives” tells the script to output the tokens (words) or lemmas in each of these classes.

The second command line argument tells the script whether to tally tokens (words) or lemmas when counting specific items.

The last argument is the file to read, and it is expected to be in the form of tag.pl’s output.

Using summarize.pl to count the simple POS in Lincoln’s Address, the following output is generated:

$ summarize.pl simple t address.pos noun 41 pronoun 29 adjective 21 verb 51 adverb 31 determiner 35 preposition 39 conjunction 11 interjection 0 symbol 2 punctuation 39 other 11

In other words, of the 272 words found in the Gettysburg Address 41 are nouns, 29 are pronouns, 21 are adjectives, etc.

Using a different from of the script, a list of all the pronouns in the Address, sorted by the number of occurances, can be generated:

$ summarize.pl pronouns t address.pos we 10 it 5 they 3 who 3 us 3 our 2 what 2 their 1

In other words, the word “we” — a particular pronoun — was used 10 times in the Address.

Consequently, I now have tool enabling me to count the POS in a document.

Preliminary analysis

I now have the tools necessary to answer one of my initial questions, “Do some works contain a greater number of nouns, verbs, and adjectives than others?” To answer this I collected nine sets of documents for analysis:

Henry David Thoreau’s Excursions (73,734 words; Flesch readability score: 57 )
Henry David Thoreau’s Walden (106,486 words; Flesch readability score: 55 )
Henry David Thoreau’s A Week on the Concord and Merrimack Rivers (117,670 words; Flesch readability score: 56 )
Jane Austen’s Sense and Sensibility (119,625 words; Flesch readability score: 54 )
Jane Austen’s Northanger Abbey (76,497 words; Flesch readability score: 58 )
Jane Austen’s Emma (156,509 words; Flesch readability score: 60 )
all of the works of Plato (1,162,460 words; Flesch readability score: 54 )
all of the works of Aristotle (950,078 words; Flesch readability score: 50 )
all of the works of Shakespeare (856,594 words; Flesch readability score: 72 )

Using tag.pl I created POS files for each set of documents. I then used summary.pl to output counts of the simple POS from each POS file. For example, after creating a POS file for Walden, I summarized the results and learned that it contains 23,272 nouns, 10,068 pronouns, 8,118 adjectives, etc.:

$ summarize.pl simple t walden.pos noun 23272 pronoun 10068 adjective 8118 verb 17695 adverb 8289 determiner 13494 preposition 16557 conjunction 5921 interjection 37 symbol 997 punctuation 14377 other 2632

I then copied this information into a spreadsheet and calculated the relative percentage of each POS discovering that 19% of the words in Walden are nouns, 8% are pronouns, 7% are adjectives, etc. See the table below:

POS	%
noun	19
pronoun	8
adjective	7
verb	15
adverb	7
determiner	11
preposition	14
conjunction	5
interjection	0
symbol	1
punctuation	12
other	2

I repeated this process for each of the nine sets of documents and tabulated them here:

POS	Excursions	Rivers	Walden	Sense	Northanger	Emma	Aristotle	Shakespeare	Plato	Average
noun	20	20	19	17	17	17	19	25	18	19
verb	14	14	15	16	16	17	15	14	15	15
punctuation	13	13	12	15	15	15	11	16	13	14
preposition	13	13	14	13	13	12	15	9	14	13
determiner	12	12	11	7	8	7	13	6	11	10
pronoun	7	7	8	12	11	11	5	11	7	9
adverb	6	6	7	8	8	8	6	6	6	7
adjective	7	7	7	5	6	6	7	5	6	6
conjunction	5	5	5	3	3	3	5	3	6	4
other	2	2	2	3	3	3	3	3	3	3
symbol	1	1	1	1	1	0	1	2	1	1
interjection	0	0	0	0	0	0	0	0	0	0
Percentage and average of parts-of-speech usage in 9 works or corpra

The result was very surprising to me. Despite the wide range of document sizes, and despite the wide range of genres, the relative percentages of POS are very similar across all of the documents. The last column in the table represents the average percentage of each POS use. Notice how the each individual POS value differs very little from the average.

This analysis can be illustrated in a couple of ways. First, below are nine pie charts. Each slice of each pie represents a different POS. Notice how all the dark blue slices (nouns) are very similar in size. Notice how all the red slices (verbs), again, are very similar. The only noticeable exception is in Shakespeare where there is a greater number of nouns and pronouns (dark green).

Thoreau’s Excursions	Thoreau’s Walden	Thoreau’s Rivers
Austen’s Sense	Austen’s Northanger	Austen’s Emma
all of Plato	all of Aristotle	all of Shakespeare

The similarity across all the documents can be further illustrated with a line graph:

Across the X axis is each POS. Up and down the Y axis is the percentage of usage. Notice how the values for each POS in each document are closely clustered. Each set of documents uses relatively the same number of nouns, pronouns, verbs, adjectives, adverbs, etc.

Maybe such a relationship between POS is one of the patterns of well-written documents? Maybe it is representative of works standing the test of time? I don’t know, but I doubt I am the first person to make such an observation.

Conclusion

My initial questions were, “To what degree are there significant differences between authors’ and genres’ usage of various parts-of-speech?” and “Do some works contain a greater number of nouns, verbs, and adjectives than others?” Based on this foray and rudimentary analysis the answers are, “No, there are not significant differences, and no, works do not contain different number of nouns, verbs, adjectives, etc.”

Of course, such a conclusion is faulty without further calculations. I will quite likely commit an error of induction if I base my conclusions on a sample of only nine items. While it would require a greater amount of effort on my part, it is not beyond possibility for me to calculate the average POS usage for every item in my Alex Catalogue. I know there will be some differences — especially considering the items having gone through optical character recognition — but I do not know the degree of difference. Such an investigation is left for a later time.

Instead, I plan to pursue a different line of investigation. The current work examined how texts were constructed, but in actuality I am more interested in the meanings works express. I am interested in what they say more than how they say it. Such meanings may be gleaned not so much from gross POS measurements but rather the words used to denote each POS. For example, the following table lists the 10 most frequently used pronouns and the number of times they occur in four works. Notice the differences:

Walden	Rivers	Northanger	Sense
I (1,809)	it (1,314)	her (1,554)	her (2,500)
it (1,507)	we (1,101)	I (1,240)	I (1,917)
my (725)	his (834)	she (1,089)	it (1,711)
he (698)	I (756)	it (1,081)	she (1,553)
his (666)	our (677)	you (906)	you (1,158)
they (614)	he (649)	he (539)	he (1,068)
their (452)	their (632)	his (524)	his (1,007)
we (447)	they (632)	they (379)	him (628)
its (351)	its (487)	my (342)	my (598)
who (340)	who (352)	him (278)	they (509)

While the lists are similar, they are characteristic of work from which they came. The first — Walden — is about an individual who lives on a lake. Notice the prominence of the word “I” and “my”. The second — Rivers — is written by the same author as the first but is about brothers who canoe down a river. Notice the higher occurrence of the word “we” and “our”. The later two works, both written by Jane Austin, are works with females as central characters. Notice how the words “her” and “she” appear in these lists but not in the former two. (Compare these lists of pronouns with the list from Lincoln’s Address and even more interesting things appear.) It looks as if there are patterns or trends to be measured here.

‘More later.

Tags: digital humanities

This entry was posted on Saturday, February 5th, 2011 at 8:33 pm and is filed under Hacks. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

4 Responses to “Foray’s into parts-of-speech”

Tweets that mention Foray’s into parts-of-speech « Infomotions Mini-Musings -- Topsy.com says:

February 5, 2011 at 9:37 pm

[…] This post was mentioned on Twitter by infopeep, Eric Lease Morgan. Eric Lease Morgan said: It seems there is not much utility in the classification of texts in terms of their percentage use of POS (#dh) — http://bit.ly/hsxD2i […]
Laurie McGowan says:

February 5, 2011 at 10:32 pm

Interesting stuff, Eric! Do you suppose there are already literary categories that would mimic your pronoun lists in describing the voice of a particular work? Or could you test your lists against those categories, assuming that they exist? Also wonder if some of your work could apply to transformational grammar to find common units across languages?

FYI – it’s “Flesch” readability (not Flesh)

Look forward to seeing more of this.

Laurie
Eric Lease Morgan says:

February 6, 2011 at 8:57 am

Laurie, spelling fixed. Thank you. Regarding the other, ‘sounds interesting and we should talk. –ELM
Aris Xanthos says:

March 5, 2011 at 5:14 am

Thanks for this interesting report, I’m glad you found Lingua::TreeTagger useful.

Creator: Eric Lease Morgan <eric_morgan@infomotions.com>
Date created: 2008-05-26
Date updated: 2010-05-09
URL: ./

Archives

Categories