Contents

Better TF-IDF: BM25

You probably encountered BM25 numerous times when reading papers from the LLM (RAG) or information retrieval domain. This algorithm is a ranking method that computes relevance scores for documents given a user query.

If you examine the BM25 formula carefully, you will notice its similarity to the classic TF-IDF. In fact, BM25 is an enhanced version of TF-IDF, as we’ll explore shortly.

Before we dive into the mechanism of BM25, let’s establish some notations:

  • $q$ represents the user query.
  • $token_i$ represents the $i$-th token after tokening the user query $q$
  • $d_i$ represents the $i$-th document

The relevance score $\texttt{score}(q,d_i)$ between a user query $q$ and the $i$-th documenet $d_i$ is

$$ \texttt{score}(q,d_i)=\sum_i\texttt{IDF}(token_i)\cdot\frac{\texttt{freq}(token_i,d_i) * (k_1+1)}{\texttt{freq}(token_i,d_i)+k_1\cdot(1-b+b\cdot\frac{|d_i|}{avgdl})} $$

where

  • $\texttt{freq}(token_i, d_i)$ represents the token frequence of $token_i$ in document $d_i$.
  • $\texttt{IDF}$ is the inverse document frequency.
  • $avgdl$ = the average of document length.
  • $k_1, b$ are the hyperparameters in BM25.

The BM25 formula looks like the TF-IDF formula :)

Let’s make the $\texttt{TF}$ term in BM25 a little simplier

$$ \frac{\texttt{freq}(token_i,d_i) * (k_1+1)}{\texttt{freq}(token_i,d_i)+k_1\cdot(1-b+b\cdot\frac{|d_i|}{avgdl})}\approx \frac{\texttt{freq}(token_i,d_i) * (k_1+1)}{\texttt{freq}(token_i,d_i)+k_1} $$

From this formula, we can observe that the $\texttt{TF}$ term has an upper bound. As $\texttt{freq}(token_i, d_i)\rightarrow\infty$, the value approachces the upper bound of $k_1+1$. This creates a non-linear function is intentional - it ensures that relevance scores saturate rather than increase indefinitely with term frequency.

Let’s consider a concrete example. Suppose a document has 1000 tokens , and the term foobar appears either 10 times or 500 times.

  • foobar appears 10 times.
    • TF-IDF: $\texttt{TF}=0.01$
    • BM25 ($k_1=2$): $\texttt{TF}=10 * (2 + 1) / (10 + 2) = 2.5$
  • foobar appears 500 times.
    • TF-IDF: $\texttt{TF}=0.5$
    • BM25 ($k_1=2$): $\texttt{TF}=500*(2+1)/(500+2)\approx3$

For TF-IDF, it assumes that a 50-fold increase in $\texttt{TF}$ (500/10=50) should result in a corresponding 50-fold increase in relevance (0.5/0.01=50). However, due to the saturation effect in BM25, where the curve is non-linear, the actual relevance only increases by 1.2 times (3/2.5=1.2).

I’ve drawn a diagram to illustrate how $k_1$ influences the non-linear effects.

We previously ignored this complex equation when explaining the first improvement:

$$ 1-b+b\cdot\frac{|d_i|}{avgdl} $$

The reason behind this is: a longer document naturally has higher token frequencies, which should be penalized. That’s exactly what this equation did, where $b$ controls the degree of penalization.

  • If $b=0$, there is no penalization at all.
  • If $b=1$, this euqation is equal to $|d_i|/avgdl$.

$$ k_1\in[1.2, 2.0] $$

The parameter $k_1$ controls the saturation rate. The higher $k_1$ is, the slower the saturation occurs (closer to the top-left)

$$ b\in[0,1] $$

The parameter *$b$ controls the strength of long document penalization. It’s usually set to 0.75.

Now let’s see how to use the BM25 algorithm in Python. I use the rank-bm25 library, and this is a runnable example from its homepage.

First, tokenize the corpus and make a BM25 model.

from rank_bm25 import BM25Okapi

# Make BM25
corpus = [
    "Hello there good man!",
    "It is quite windy in London",
    "How is the weather today?",
]

tokenized_corpus = [doc.split(" ") for doc in corpus]

bm25 = BM25Okapi(tokenized_corpus)

The user query should also be tokenized.

# Handle query
query = "windy London"
tokenized_query = query.split(" ")

You can retrieve the most related n documents with the get_top_n method.

print(bm25.get_top_n(tokenized_query, corpus, 1))
# ['It is quite windy in London']