Skip to content

Term Frequency — Inverse Document Frequency

TFij=fijnjIDFi=1+log(Nci)TFIDFij=TFij×IDFiTF_{ij} = \frac{f_{ij}}{n_j} \\ IDF_{i} = 1 + \log(\frac{N}{c_i}) \\ TFIDF_{ij} = TF_{ij} \times IDF_{i}

Where:

  • fijf_{ij} is the frequency of term ii in document jj
  • njn_j is the total number of terms in the document jj
  • NN total number of documents
  • cic_i is the number of documents containing word ii

TF-IDF is improvement over “bag of words” - frequently repeated words (like “stop wrods”) have less weight.

Example

Original documents:

  1. abc abc def ghi
  2. abc def ghi jkl
  3. ghi jkl mno pqr

Converted to term frequency vectors (absolute):

abcdefghijklmnopqr
1211000
2111100
3001111

Converted to term frequency vectors (relative):

abcdefghijklmnopqr
10.50.250.25000
20.250.250.250.2500
3000.250.250.250.25

Document frequencies:

abcdefghijklmnopqr
DF223211
IDF1.17..1.17..1.01.17..1.47..1.47..

TF-IDF:

abcdefghijklmnopqr
10.5850.29250.25000
20.29250.29250.250.292500
3000.250.29250.36750.3675

Finally TF-IDF can be used together with Cosine similarity.