TF-IDF model

MartinLwx included in category NLP

2023-08-16 2024-12-11 1160 words 6 minutes

Contents

Info

Further reading

Better TF-IDF: BM25

What is the TF-IDF model

In previous post, we talked about the bag-of-word model, which has many limitations. Today we take a step further to see if we can try to fix one of the limitations - Each word has the same importance.

Question

The crux of the problem - How to define the word importance?

One idea is that the more frequently a word appears within a single document, the more important it is for that document. For instance, in an article discussing dogs, the word “dog” is likely to appear frequently, reflecting the document’s main topic.

But what if a word appears very frequently in all documents? For example, the word “of” may appear quite often in every document, can we say “of” is important? Clearly, that’s not the case. So we have a clue here: If a word has a high frequency in every document, it’s probably not significant and does not convey too much information.

Therefore, a reasonable solution should consider a word’s frequency within a single document but also take into account its frequency crossing multiple documents. TF-IDF balances these two aspects.

Note

In summary, the intuition behind TF-IDF is that similar documents may use similar words, while the importance of different words should vary.

TF-IDF in detail

TF-IDF = Term frequency(TF) + Inverse document frequency(IDF). Let’s break it into two parts.

TF

The TF can be seen as a function of document $d$ and word $w$, and the equation is:

$$ \text{TF}(w, d)=\frac{\text{frequency of}\ w\ \text{in}\ d}{\text{word counts of } d} $$

That is, we just need to calculate the frequency of word $w$ in the document $d$, and then divide it by the total number of words in $d$.

Bug

In Scikit-Learn, the computation of TF is a bit different. It doesn’t involve dividing by the total number of words in the document. The purpose of dividing is to normalize the $\text{TF}(w, d)$ values within the document $d$, making them add up to one.

In Scikit-Learn, this normalization process is performed after the TF-IDF calculation. We will demonstrate this with an example later.

IDF

The goal of IDF is to reduce the importance of some common words that appear in each document. Therefore, the IDF is a function involving the word $w$ and the $corpus$.

$$ \text{IDF}(w, corpus)=log\ \frac{\text{document count of }corpus}{1+\text{count of document which contains }w} $$

We add one in the denominator to avoid division by 0.

Note

The $corpus$ is generally fixed. So it can be treated as a constant. In that case, IDF can be considered as something that’s only related to the word $w$

Warning

Note the $log$ here. Are we using $log_2$, $log_{10}$ or $ln$? Different frameworks might have variations. Scikit-Learn use $ln$

Bug

In Scikit-Learn, the calculation of IDF differs from the equations mentioned above. By default, Scikit-Learn uses the following formula¹:

$$ \text{IDF}(w, corpus)=log\ \frac{1 + \text{document count of }corpus}{1+\text{count of document which contains }w} + 1 $$

🤔️ In my opinion, the modification made by Scikit-Learn ensures that $\text{IDF}(w)$ cannot be less than 1. In the origin equation, if a word $w$ appears in each document within the corpus, $\text{IDF}(w)$ could be a negative value. Therefore, Scikit-Learn’s modification seems more practical. It provides a more intuitive comparison of the IDF values for different words.

TF-IDF

$$ \text{TF-IDF}(w, d, corpus)=\text{TF}(w, d) * \text{IDF}(w, corpus) $$

The TF-IDF in Scikit-Learn

It’s trivial to implement the TF-IDF algorithm. However, you probably will just use the well-established APIs provided by Scikit-Learn. Here, we will delve into how to calculate TF-IDF in Scikit-Learn. Let’s proceed by continuing to use the official example from Scikit-Learn:

toy_corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]

tokenized_toy_corpus = [
    ['this', 'is', 'the', 'first', 'document'],
    ['this', 'is', 'the', 'second', 'second', 'document'],
    ['and', 'the', 'third', 'one'],
    ['is', 'this', 'the', 'first', 'document']
]

Let’s retrieve the TF-IDF matrix using the APIs

from sklearn.feature_extraction.text import TfidfVectorizer
# set norm=None for comparison
vectorizer = TfidfVectorizer(norm=None)
X = vectorizer.fit_transform(toy_corpus)

We can access the TF-IDF matrix using X.toarray() (Note that I set norm=None in the code snippet)

	and	document	first	is	one	second	the	third	this
document1	0.0	1.22314	1.51082	1.22314	0.0	0.0	1.0	0.0	1.22314
document2	0.0	1.22314	0.0	1.22314	0.0	3.83258	1.0	0.0	1.22314
document3	1.91629	0.0	0.0	0.0	1.91629	0.0	1.0	1.91629	0.0
document4	0.0	1.22314	1.51082	1.22314	0.0	0.0	1.0	0.0	1.22314

Let me also put the bag-of-word matrix here:

	and	document	first	is	one	second	the	third	this
document1	0	1	1	1	0	0	1	0	1
document2	0	1	0	1	0	2	1	0	1
document3	1	0	0	0	1	0	1	1	0
document4	0	1	1	1	0	0	1	0	1

🤔️ Comparing these two matrices, we can find that the word importance of document and first inside the document1 has changed. The TF-IDF value for document is 1.22314, while the TF-IDF value for first is 1.51082, due to the unequal presence of these words in the corpus. However, the bag-of-word model fails to recognize this and considers both of them as having an importance of 1

We can retrieve the IDF value of each word by accessing the idf_ attribute.

print(vectorizer.idf_)
# [1.91629073 1.22314355 1.51082562 1.22314355 1.91629073 
#  1.91629073 1.         1.91629073 1.22314355]

🤔️ If we multiply this IDF vector by the matrix output of the bag-of-word model(note that the IDF vector will be broadcasted), you will obtain the TF-IDF matrix calculated by Scikit-Learn. This confirms what we mentioned earlier:

Scikit-Learn directly uses the output of the bag-of-word model as TF.
Scikit-Learn’s IDF calculation differs from the standard approach.

Implement TF-IDF manually

We assume that each document within the corpus is tokenized, and we use the TF-IDF definition of Scikit-Learn.

Warning

The code below is not optimized, just for demonstration :)

import math


def TF(word: str, tokenized_document: list[str]) -> float:
    return tokenized_document.count(word)


def IDF(word: str, tokenized_corpus: list[list[str]]) -> float:
    doc_count_contains_word = 0
    for doc in tokenized_corpus:
        if word in doc:
            doc_count_contains_word += 1

    return math.log((1 + len(tokenized_corpus)) / (1 + doc_count_contains_word)) + 1


def TF_IDF(
    word: str,
    tokenized_document: list[str],
    tokenized_corpus: list[list[str]],
) -> float:
    return TF(word, tokenized_document) * IDF(word, tokenized_corpus)

Wrap up

In this post, we investigate how TF-IDF has brought improvements to the bag-of-word model: by additionally considering a word’s frequency across all documents to adjust its importance. Through the analysis of official examples in Scikit-Learn, we have also validated that, the importance of words is now different. This stands as a noteworthy enhancement to the bag-of-word model.

From an implementation perspective, the calculation of TF-IDF only requires traversing the corpus once. During the traversal, a few additional variables are maintained: the count of how many documents each word appears in, the total number of processed documents, etc. You may take a look at the source code of the Dictionary class in Gensim for more details.

TF-IDF term weighting ↩︎