<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	
	>
<channel>
	<title>
	Comments on: TFIDF In Libraries: Part III of III (For thinkers)	</title>
	<atom:link href="./index.html" rel="self" type="application/rss+xml" />
	<link>./../index.html</link>
	<description>Artist- and Librarian-At-Large</description>
	<lastBuildDate>
	Sat, 04 Jun 2016 18:04:58 +0000	</lastBuildDate>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=5.1.8</generator>
			<item>
				<title>
				By: Eric Lease Morgan				</title>
				<link>./../comment-page-1/index.html#comment-1712</link>
		<dc:creator><![CDATA[Eric Lease Morgan]]></dc:creator>
		<pubDate>Thu, 20 May 2010 01:53:55 +0000</pubDate>
		<guid isPermaLink="false">./../../../../index.html?p=286#comment-1712</guid>
					<description><![CDATA[Allen Chen brought to my attention the reason why my compare subroutine did not return scores of 1000 when documents were duplicated in the corpus. See &quot;(Argg! Something is incorrect with my trigonometry. When I duplicate a document and run compare.pl the resulting cosine similarity value between the exact same documents is 540, not 1000. What am I doing wrong?)&quot; above.

Upon closer examination of the definition of Cosine Similarity he realized that my compare subroutine included too many cosine functions. After editing the subroutine and duplicating a document in the corpus, a correct value of 1000 is returned for exactly similar documents. (Actually, it sometimes returns scores of 999 which I&#039;m going to chalk up to rounding errors.)

&quot;Thank you, Allen.&quot;

Now a new problem presents itself. Specifically, the similarity scores for all the other documents are upside down:

&lt;pre&gt;
  Comparison: scores closer to 1000 approach similarity

      d1    d2   d3   d4   d5   d6

  d1   -   396  459  538  541  320
  d2   -    -   478  247  334  240
  d3   -    -    -   312  304  265
  d4   -    -    -    -   694  438
  d5   -    -    -    -    -   367
  d6   -    -    -    -    -    - 

  d1 = aristotle.txt
  d2 = hegel.txt
  d3 = kant.txt
  d4 = librarianship.txt
  d5 = mississippi.txt
  d6 = plato.txt&lt;/pre&gt;

Previously, hegel.txt (d2) and plato.txt (d6) where considered very similar, but now they are almost opposites. Something is still not correct, and I sincerely have no idea where to begin looking for a solution.

I have updated the downloadable scripts, but as far as the compare subroutine goes, they are still not perfect (broken).]]></description>
		<content:encoded><![CDATA[<p>Allen Chen brought to my attention the reason why my compare subroutine did not return scores of 1000 when documents were duplicated in the corpus. See &#8220;(Argg! Something is incorrect with my trigonometry. When I duplicate a document and run compare.pl the resulting cosine similarity value between the exact same documents is 540, not 1000. What am I doing wrong?)&#8221; above.</p>
<p>Upon closer examination of the definition of Cosine Similarity he realized that my compare subroutine included too many cosine functions. After editing the subroutine and duplicating a document in the corpus, a correct value of 1000 is returned for exactly similar documents. (Actually, it sometimes returns scores of 999 which I&#8217;m going to chalk up to rounding errors.)</p>
<p>&#8220;Thank you, Allen.&#8221;</p>
<p>Now a new problem presents itself. Specifically, the similarity scores for all the other documents are upside down:</p>
<pre>
  Comparison: scores closer to 1000 approach similarity

      d1    d2   d3   d4   d5   d6

  d1   -   396  459  538  541  320
  d2   -    -   478  247  334  240
  d3   -    -    -   312  304  265
  d4   -    -    -    -   694  438
  d5   -    -    -    -    -   367
  d6   -    -    -    -    -    - 

  d1 = aristotle.txt
  d2 = hegel.txt
  d3 = kant.txt
  d4 = librarianship.txt
  d5 = mississippi.txt
  d6 = plato.txt</pre>
<p>Previously, hegel.txt (d2) and plato.txt (d6) where considered very similar, but now they are almost opposites. Something is still not correct, and I sincerely have no idea where to begin looking for a solution.</p>
<p>I have updated the downloadable scripts, but as far as the compare subroutine goes, they are still not perfect (broken).</p>
]]></content:encoded>
						</item>
						<item>
				<title>
				By: Infomotions Mini-Musings &#187; Blog Archive &#187; TFIDF In Libraries: Part I of III (For Librarians) / Eric Lease Morgan				</title>
				<link>./../comment-page-1/index.html#comment-1041</link>
		<dc:creator><![CDATA[Infomotions Mini-Musings &#187; Blog Archive &#187; TFIDF In Libraries: Part I of III (For Librarians) / Eric Lease Morgan]]></dc:creator>
		<pubDate>Sun, 31 May 2009 20:32:21 +0000</pubDate>
		<guid isPermaLink="false">./../../../../index.html?p=286#comment-1041</guid>
					<description><![CDATA[[...] system and simple search engine using TFIDF through a computer program written in Perl. Part III will explore the possibility of filtering search results by applying TFIDF against sets of [...]]]></description>
		<content:encoded><![CDATA[<p>[&#8230;] system and simple search engine using TFIDF through a computer program written in Perl. Part III will explore the possibility of filtering search results by applying TFIDF against sets of [&#8230;]</p>
]]></content:encoded>
						</item>
						<item>
				<title>
				By: Infomotions Mini-Musings &#187; Blog Archive &#187; TFIDF In Libraries: Part II of III (For programmers) / Eric Lease Morgan				</title>
				<link>./../comment-page-1/index.html#comment-1040</link>
		<dc:creator><![CDATA[Infomotions Mini-Musings &#187; Blog Archive &#187; TFIDF In Libraries: Part II of III (For programmers) / Eric Lease Morgan]]></dc:creator>
		<pubDate>Sun, 31 May 2009 20:31:43 +0000</pubDate>
		<guid isPermaLink="false">./../../../../index.html?p=286#comment-1040</guid>
					<description><![CDATA[[...] through a set of simple Perl programs. In Part I relevancy ranking was introduced and explained. In Part III additional word/document weighting techiques will be explored to the end of filtering search [...]]]></description>
		<content:encoded><![CDATA[<p>[&#8230;] through a set of simple Perl programs. In Part I relevancy ranking was introduced and explained. In Part III additional word/document weighting techiques will be explored to the end of filtering search [&#8230;]</p>
]]></content:encoded>
						</item>
			</channel>
</rss>
