By: Eric Lease Morgan

Eric Lease Morgan — Thu, 20 May 2010 01:53:55 +0000

Allen Chen brought to my attention the reason why my compare subroutine did not return scores of 1000 when documents were duplicated in the corpus. See “(Argg! Something is incorrect with my trigonometry. When I duplicate a document and run compare.pl the resulting cosine similarity value between the exact same documents is 540, not 1000. What am I doing wrong?)” above.

Upon closer examination of the definition of Cosine Similarity he realized that my compare subroutine included too many cosine functions. After editing the subroutine and duplicating a document in the corpus, a correct value of 1000 is returned for exactly similar documents. (Actually, it sometimes returns scores of 999 which I’m going to chalk up to rounding errors.)

“Thank you, Allen.”

Now a new problem presents itself. Specifically, the similarity scores for all the other documents are upside down:

  Comparison: scores closer to 1000 approach similarity

      d1    d2   d3   d4   d5   d6

  d1   -   396  459  538  541  320
  d2   -    -   478  247  334  240
  d3   -    -    -   312  304  265
  d4   -    -    -    -   694  438
  d5   -    -    -    -    -   367
  d6   -    -    -    -    -    - 

  d1 = aristotle.txt
  d2 = hegel.txt
  d3 = kant.txt
  d4 = librarianship.txt
  d5 = mississippi.txt
  d6 = plato.txt

Previously, hegel.txt (d2) and plato.txt (d6) where considered very similar, but now they are almost opposites. Something is still not correct, and I sincerely have no idea where to begin looking for a solution.

I have updated the downloadable scripts, but as far as the compare subroutine goes, they are still not perfect (broken).

By: Infomotions Mini-Musings » Blog Archive » TFIDF In Libraries: Part I of III (For Librarians) / Eric Lease Morgan

Sun, 31 May 2009 20:32:21 +0000

[…] system and simple search engine using TFIDF through a computer program written in Perl. Part III will explore the possibility of filtering search results by applying TFIDF against sets of […]

By: Infomotions Mini-Musings » Blog Archive » TFIDF In Libraries: Part II of III (For programmers) / Eric Lease Morgan

Sun, 31 May 2009 20:31:43 +0000

[…] through a set of simple Perl programs. In Part I relevancy ranking was introduced and explained. In Part III additional word/document weighting techiques will be explored to the end of filtering search […]

Comments on: TFIDF In Libraries: Part III of III (For thinkers)

By: Eric Lease Morgan

By: Infomotions Mini-Musings » Blog Archive » TFIDF In Libraries: Part I of III (For Librarians) / Eric Lease Morgan

By: Infomotions Mini-Musings » Blog Archive » TFIDF In Libraries: Part II of III (For programmers) / Eric Lease Morgan