Term Frequency — Inverse Document Frequency
Where:
- is the frequency of term in document
- is the total number of terms in the document
- total number of documents
- is the number of documents containing word
TF-IDF is improvement over “bag of words” - frequently repeated words (like “stop wrods”) have less weight.
Example
Original documents:
abc abc def ghi
abc def ghi jkl
ghi jkl mno pqr
Converted to term frequency vectors (absolute):
abc | def | ghi | jkl | mno | pqr | |
---|---|---|---|---|---|---|
1 | 2 | 1 | 1 | 0 | 0 | 0 |
2 | 1 | 1 | 1 | 1 | 0 | 0 |
3 | 0 | 0 | 1 | 1 | 1 | 1 |
Converted to term frequency vectors (relative):
abc | def | ghi | jkl | mno | pqr | |
---|---|---|---|---|---|---|
1 | 0.5 | 0.25 | 0.25 | 0 | 0 | 0 |
2 | 0.25 | 0.25 | 0.25 | 0.25 | 0 | 0 |
3 | 0 | 0 | 0.25 | 0.25 | 0.25 | 0.25 |
Document frequencies:
abc | def | ghi | jkl | mno | pqr | |
---|---|---|---|---|---|---|
DF | 2 | 2 | 3 | 2 | 1 | 1 |
IDF | 1.17.. | 1.17.. | 1.0 | 1.17.. | 1.47.. | 1.47.. |
TF-IDF:
abc | def | ghi | jkl | mno | pqr | |
---|---|---|---|---|---|---|
1 | 0.585 | 0.2925 | 0.25 | 0 | 0 | 0 |
2 | 0.2925 | 0.2925 | 0.25 | 0.2925 | 0 | 0 |
3 | 0 | 0 | 0.25 | 0.2925 | 0.3675 | 0.3675 |
Finally TF-IDF can be used together with Cosine similarity.