Comments on: TFIDF In Libraries: Part I of III (For Librarians)

By: Infomotions Mini-Musings » Blog Archive » Great Ideas Coefficient / Eric Lease Morgan

Sat, 27 Mar 2010 11:58:10 +0000

[…] they mentioned the “great ideas”. Such a thing can be done through the application of TFIDF. Here’s […]

By: Infomotions Mini-Musings » Blog Archive » Automatic metadata generation / Eric Lease Morgan

Fri, 31 Jul 2009 02:22:06 +0000

[…] but not extraordinarily well. I then learned about Term Frequency Inverse Document Frequency (TFIDF) to calculate “relevance”, and T-Score to calculate the probability of two words […]

By: Infomotions Mini-Musings » Blog Archive » Text mining: Books and Perl modules / Eric Lease Morgan

Thu, 04 Jun 2009 02:14:58 +0000

[…] my explorations of term frequency/inverse document frequency (TFIDF) I became aware of a relatively new field of study called text mining. In many ways, text mining is […]

By: Infomotions Mini-Musings » Blog Archive » TFIDF In Libraries: Part III of III (For thinkers) / Eric Lease Morgan

Sun, 31 May 2009 20:30:42 +0000

[…] is the third of the three-part series on the topic of TFIDF in libraries. In Part I the why’s and wherefore’s of TFIDF were outlined. In Part II TFIDF subroutines and […]

By: Infomotions Mini-Musings » Blog Archive » TFIDF In Libraries: Part II of III (For thinkers) / Eric Lease Morgan

Sun, 31 May 2009 20:28:25 +0000

[…] is the third of the three-part series on the topic of TFIDF in libraries. In Part I the why’s and wherefore’s of TFIDF were outlined. In Part II TFIDF subroutines and […]

By: Infomotions Mini-Musings » Blog Archive » TFIDF In Libraries: Part II of III (For programmers) / Eric Lease Morgan

Tue, 21 Apr 2009 02:42:42 +0000

[…] where relevancy ranking techniques are explored through a set of simple Perl programs. In Part I relevancy ranking was introduced and explained. In Part III additional word/document weighting […]

By: egarcia

egarcia — Tue, 14 Apr 2009 15:58:58 +0000

Hi, there:

I read with interest your article. Here are few points worth to mention:

1. IDF is defined as log(D/d) where D is number of documents in a collection and d is the number of documents mentioning a given term, regardless if the documents are relevant to said term. The base of the log does not matter (it can be base 10, 2, etc). The reason for taking logs is because most scoring functions in IR are assumed to be additive and because terms are assumed independent form one another (even when often this is not exactly the case).

2. IDF is a measure of the discriminatory power of a term (term specificity), but it does not relevancy. Indeed, IDF is a term weight score in the absence of relevance information.

3. IDF is a small pixel in the bigger picture of Robertson-Sparck Jones Probabilistic Model (RSJ-PM). A tutorial on the RSJ-PM Model explaining this model is available at http://www.miislita.com/.

4. With unstructured, unfocused, and generic collections at the scale of the Web (e.g. commercial search engines like Google), the stability of IDF and this as a reliable scoring function has been put into question by several authors.

Regards

Dr. Edel Garcia
http://www.miislita.com/