I uploaded version 0.03 of Lingua::EN::Bigram to CPAN today, and it now supports not just bigrams, trigrams, quadgrams, but ngrams — an arbitrary phrase length.
In order to test it out, I quickly gathered together some of my more recent essays, concatonated them together, and applied Lingua::EN::Bigram against the result. Below is a list of the top 10 most common bigrams, trigrams, and quadgrams:
bigrams trigrams quadgrams 52 great ideas 36 the number of 25 the number of times 43 open source 36 open source software 13 the total number of 38 source software 32 as well as 10 at the same time 29 great books 28 number of times 10 number of words in 24 digital humanities 27 the use of 10 when it comes to 23 good man 25 the great books 10 total number of documents 22 full text 23 a set of 10 open source software is 22 search results 20 eric lease morgan 9 number of times a 20 lease morgan 20 a number of 9 as well as the 20 eric lease 19 total number of 9 through the use of
Not surprising since I have been writing about the Great Books, digital humanities, indexing, and open source software. Re-affirming.
Lingu::EN::Bigram is available locally as well as from CPAN.
Where can I get these bigrams, trigrams, quadgrams large lists in an ascii format?
Frank460, I do not know of places where you can download bigrams, trigrams, etc. Lingua::EN::Bigram is designed to generate such things though. Consider reading a particular Research Blog posting by Google: http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html