This posting is the first of my text mining essays focusing on parts-of-speech. Based on the most rudimentary investigations, outlined below, it seems as if there is not much utility in the classification and description of texts in terms of their percentage use of parts-of-speech.
For the past year or so I have spent a lot of my time counting words. Many of my friends and colleagues look at me strangely when I say this. I have to admit, it does sound sort of weird. On the other hand, the process has enabled me to easily compare & contrast entire canons in terms of length and readability, locate statistically significant words & phrases in individual works, and visualize both with charts & graphs. Through the process I have developed two Perl modules (Lingua::EN::Ngram and Lingua::Concordance), and I have integrated them into my Alex Catalogue of Electronic Texts. Many people are still skeptical about the utility of these endeavors, and my implementations do not seem to be compelling enough to sway their opinions. Oh well, such is life.
My ultimate goal is to figure out ways to exploit the current environment and provide better library service. The current environment is rich with full text. It abounds. I ask myself, “How can I take advantage of this full text to make the work of students, teachers, and scholars both easier and more productive?” My current answer surrounds the creation of tools that take advantage of the full text — making it easier for people to “read” larger quantities of information, find patterns in it, and through the process create new knowledge.
Much of my work has been based on rudimentary statistics with little regard to linguistics. Through the use of computers I strive to easily find patterns of meaning across works — an aspect of linguistics. I think such a thing is possible because the use of language assumes systems and patterns. If it didn’t then communication between ourselves would be impossible. Computers are all about systems and patterns. They are very good at counting and recording data. By using computers to count and record characteristics of texts, I think it is possible to find patterns that humans overlook or don’t figure as significant. I would really like to take advantage of core reference works which are full of meaning — dictionaries, thesauri, almanacs, biographies, bibliographies, gazetteers, encyclopedias, etc. — but the ambiguous nature of written language makes the automatic application of such tools challenging. By classifying individual words as parts-of-speech (POS), some of this ambiguity can be reduced. This posting is my first foray into this line of reasoning, and only time will tell if it is fruitful.
Comparing parts-of-speech across texts
My first experiment compares & contrasts POS usage across texts. “To what degree are there significant differences between authors’ and genres’ usage of various parts-of-speech?”, I asked myself. “Do some works contain a greater number of nouns, verbs, and adjectives than others?” If so, then maybe this would be one way to differentiate works, and make it easier for the student to both select a work for reading as well as understand its content.
To answer these questions, I need to first identify the POS in a document. In the English language there are eight generally accepted POS: 1) nouns, 2) pronouns, 3) verbs, 4) adverbs, 5) adjectives, 6) prepositions, 7) conjunctions, and 8) interjections. Since I am a “lazy Perl programmer”, I sought a POS tagger and in the end settled on one called Lingua::TreeTagger — a full-featured wrapper around a command line driven application called Tree Tagger. Using a process called the Hidden Markov Model, TreeTagger systematically goes through a document and guesses the POS for a given word. According to the research, it can do this with 96% accuracy because is has accurately modeled the systems and patterns of the English language alluded to above. For example, it knows that sentences begin with capital letters and end with punctuation marks. It knows that capitalized words in the middle of sentences are the names of things and the names of things are nouns. It knows that most adverbs end in “ly”. It knows that adjectives often precede nouns. Similarly, it knows the word “the” also precedes nouns. In short, it has done its best to model the syntactical nature of a number of languages and it uses these models to denote the POS in a document.
For example, below is the first sentence from Abraham Lincoln’s Gettysburg Address:
Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.
Using Lingua::TreeTagger it is trivial to convert the sentence into the following XML snippet where each element contains two attributes (a lemma of the word in question and its POS) and the word itself:
<pos><w lemma="Four" type="CD">Four</w> <w lemma="score" type="NN">score</w> <w lemma="and" type="CC">and</w> <w lemma="seven" type="CD">seven</w> <w lemma="year" type="NNS">years</w> <w lemma="ago" type="RB">ago</w> <w lemma="our" type="PP$">our</w> <w lemma="father" type="NNS">fathers</w> <w lemma="bring" type="VVD">brought</w> <w lemma="forth" type="RB">forth</w> <w lemma="on" type="IN">on</w> <w lemma="this" type="DT">this</w> <w lemma="continent" type="NN">continent</w> <w lemma="," type=",">,</w> <w lemma="a" type="DT">a</w> <w lemma="new" type="JJ">new</w> <w lemma="nation" type="NN">nation</w> <w lemma="," type=",">,</w> <w lemma="conceive" type="VVN">conceived</w> <w lemma="in" type="IN">in</w> <w lemma="Liberty" type="NP">Liberty</w> <w lemma="," type=",">,</w> <w lemma="and" type="CC">and</w> <w lemma="dedicate" type="VVN">dedicated</w> <w lemma="to" type="TO">to</w> <w lemma="the" type="DT">the</w> <w lemma="proposition" type="NN">proposition</w> <w lemma="that" type="IN/that">that</w> <w lemma="all" type="DT">all</w> <w lemma="man" type="NNS">men</w> <w lemma="be" type="VBP">are</w> <w lemma="create" type="VVN">created</w> <w lemma="equal" type="JJ">equal</w> <w lemma="." type="SENT">.</w></pos>
Each POS is represented by a different code. TreeTagger uses as many as 58 codes. Some of the less obscure are: CD for cardinal number, CC for conjunction, NN for noun, NNS for plural noun, JJ for adjective, VBP for the verb to be in the third-person plural, etc.
Using a slightly different version of the same trivial code, Lingua::TreeTagger can output a delimited stream where each line represents a record and the delimited values are words, lemmas, and POS. The first ten records from the sentence above are displayed below:
In the end I wrote a simple program — tag.pl — taking a file name as input and streaming to standard output the tagged text in delimited form. Executing the code and saving the output to a file is simple:
$ bin/tag.pl corpus/walden.txt > pos/walden.pos
Consequently, I now have a way to quickly and easily denote the POS for each word in a given plain text file.
Counting and summarizing
Now that the POS of a given document are identified, the next step is to count and summarize them. Counting is something at which computers excel, and I wrote another program — summarize.pl — to do the work. The program’s input takes the following form:
summarize.pl <all|simple|other|pronouns|nouns|verbs|adverbs|adjectives> <t|l> <filename>
The first command line argument denotes what POS will be output. “All” denotes the POS defined by Tree Tagger. “Simple” denotes Tree Tagger POS mapped to the eight generally accepted POS of the English language. The use of “nouns”, “pronouns”, “verbs”, “adverbs”, and “adjectives” tells the script to output the tokens (words) or lemmas in each of these classes.
The second command line argument tells the script whether to tally tokens (words) or lemmas when counting specific items.
The last argument is the file to read, and it is expected to be in the form of tag.pl’s output.
Using summarize.pl to count the simple POS in Lincoln’s Address, the following output is generated:
$ summarize.pl simple t address.pos
In other words, of the 272 words found in the Gettysburg Address 41 are nouns, 29 are pronouns, 21 are adjectives, etc.
Using a different from of the script, a list of all the pronouns in the Address, sorted by the number of occurances, can be generated:
$ summarize.pl pronouns t address.pos
In other words, the word “we” — a particular pronoun — was used 10 times in the Address.
Consequently, I now have tool enabling me to count the POS in a document.
I now have the tools necessary to answer one of my initial questions, “Do some works contain a greater number of nouns, verbs, and adjectives than others?” To answer this I collected nine sets of documents for analysis:
- Henry David Thoreau’s Excursions (73,734 words; Flesch readability score: 57 )
- Henry David Thoreau’s Walden (106,486 words; Flesch readability score: 55 )
- Henry David Thoreau’s A Week on the Concord and Merrimack Rivers (117,670 words; Flesch readability score: 56 )
- Jane Austen’s Sense and Sensibility (119,625 words; Flesch readability score: 54 )
- Jane Austen’s Northanger Abbey (76,497 words; Flesch readability score: 58 )
- Jane Austen’s Emma (156,509 words; Flesch readability score: 60 )
- all of the works of Plato (1,162,460 words; Flesch readability score: 54 )
- all of the works of Aristotle (950,078 words; Flesch readability score: 50 )
- all of the works of Shakespeare (856,594 words; Flesch readability score: 72 )
Using tag.pl I created POS files for each set of documents. I then used summary.pl to output counts of the simple POS from each POS file. For example, after creating a POS file for Walden, I summarized the results and learned that it contains 23,272 nouns, 10,068 pronouns, 8,118 adjectives, etc.:
$ summarize.pl simple t walden.pos
I then copied this information into a spreadsheet and calculated the relative percentage of each POS discovering that 19% of the words in Walden are nouns, 8% are pronouns, 7% are adjectives, etc. See the table below:
I repeated this process for each of the nine sets of documents and tabulated them here:
|interjection||0||0||0||0||0||0||0||0||0||0||Percentage and average of parts-of-speech usage in 9 works or corpra|
The result was very surprising to me. Despite the wide range of document sizes, and despite the wide range of genres, the relative percentages of POS are very similar across all of the documents. The last column in the table represents the average percentage of each POS use. Notice how the each individual POS value differs very little from the average.
This analysis can be illustrated in a couple of ways. First, below are nine pie charts. Each slice of each pie represents a different POS. Notice how all the dark blue slices (nouns) are very similar in size. Notice how all the red slices (verbs), again, are very similar. The only noticeable exception is in Shakespeare where there is a greater number of nouns and pronouns (dark green).
all of Plato
all of Aristotle
all of Shakespeare
The similarity across all the documents can be further illustrated with a line graph:
Across the X axis is each POS. Up and down the Y axis is the percentage of usage. Notice how the values for each POS in each document are closely clustered. Each set of documents uses relatively the same number of nouns, pronouns, verbs, adjectives, adverbs, etc.
Maybe such a relationship between POS is one of the patterns of well-written documents? Maybe it is representative of works standing the test of time? I don’t know, but I doubt I am the first person to make such an observation.
My initial questions were, “To what degree are there significant differences between authors’ and genres’ usage of various parts-of-speech?” and “Do some works contain a greater number of nouns, verbs, and adjectives than others?” Based on this foray and rudimentary analysis the answers are, “No, there are not significant differences, and no, works do not contain different number of nouns, verbs, adjectives, etc.”
Of course, such a conclusion is faulty without further calculations. I will quite likely commit an error of induction if I base my conclusions on a sample of only nine items. While it would require a greater amount of effort on my part, it is not beyond possibility for me to calculate the average POS usage for every item in my Alex Catalogue. I know there will be some differences — especially considering the items having gone through optical character recognition — but I do not know the degree of difference. Such an investigation is left for a later time.
Instead, I plan to pursue a different line of investigation. The current work examined how texts were constructed, but in actuality I am more interested in the meanings works express. I am interested in what they say more than how they say it. Such meanings may be gleaned not so much from gross POS measurements but rather the words used to denote each POS. For example, the following table lists the 10 most frequently used pronouns and the number of times they occur in four works. Notice the differences:
|I (1,809)||it (1,314)||her (1,554)||her (2,500)|
|it (1,507)||we (1,101)||I (1,240)||I (1,917)|
|my (725)||his (834)||she (1,089)||it (1,711)|
|he (698)||I (756)||it (1,081)||she (1,553)|
|his (666)||our (677)||you (906)||you (1,158)|
|they (614)||he (649)||he (539)||he (1,068)|
|their (452)||their (632)||his (524)||his (1,007)|
|we (447)||they (632)||they (379)||him (628)|
|its (351)||its (487)||my (342)||my (598)|
|who (340)||who (352)||him (278)||they (509)|
While the lists are similar, they are characteristic of work from which they came. The first — Walden — is about an individual who lives on a lake. Notice the prominence of the word “I” and “my”. The second — Rivers — is written by the same author as the first but is about brothers who canoe down a river. Notice the higher occurrence of the word “we” and “our”. The later two works, both written by Jane Austin, are works with females as central characters. Notice how the words “her” and “she” appear in these lists but not in the former two. (Compare these lists of pronouns with the list from Lincoln’s Address and even more interesting things appear.) It looks as if there are patterns or trends to be measured here.