Better TF-IDF: BM25
Intro
You probably encountered BM25 numerous times when reading papers from the LLM (RAG) or information retrieval domain. This algorithm is a ranking method that computes relevance scores for documents given a user query.
If you examine the BM25 formula carefully, you will notice its similarity to the classic TF-IDF. In fact, BM25 is an enhanced version of TF-IDF, as we’ll explore shortly.
The BM25
Before we dive into the mechanism of BM25, let’s establish some notations:
- $q$ represents the user query.
- $token_i$ represents the $i$-th token after tokening the user query $q$
- $d_i$ represents the $i$-th document
The relevance score $\texttt{score}(q,d_i)$ between a user query $q$ and the $i$-th documenet $d_i$ is
$$ \texttt{score}(q,d_i)=\sum_i\texttt{IDF}(token_i)\cdot\frac{\texttt{freq}(token_i,d_i) * (k_1+1)}{\texttt{freq}(token_i,d_i)+k_1\cdot(1-b+b\cdot\frac{|d_i|}{avgdl})} $$
where
- $\texttt{freq}(token_i, d_i)$ represents the token frequence of $token_i$ in document $d_i$.
- $\texttt{IDF}$ is the inverse document frequency.
- $avgdl$ = the average of document length.
- $k_1, b$ are the hyperparameters in BM25.
BM25 Vs TF-IDF
The BM25 formula looks like the TF-IDF formula :)
Improvement 1 - Non-linear TF with saturation
Let’s make the $\texttt{TF}$ term in BM25 a little simplier
$$ \frac{\texttt{freq}(token_i,d_i) * (k_1+1)}{\texttt{freq}(token_i,d_i)+k_1\cdot(1-b+b\cdot\frac{|d_i|}{avgdl})}\approx \frac{\texttt{freq}(token_i,d_i) * (k_1+1)}{\texttt{freq}(token_i,d_i)+k_1} $$
From this formula, we can observe that the $\texttt{TF}$ term has an upper bound. As $\texttt{freq}(token_i, d_i)\rightarrow\infty$, the value approachces the upper bound of $k_1+1$. This creates a non-linear function is intentional - it ensures that relevance scores saturate rather than increase indefinitely with term frequency.
Let’s consider a concrete example. Suppose a document has 1000 tokens , and the term foobar
appears either 10 times or 500 times.
foobar
appears 10 times.- TF-IDF: $\texttt{TF}=0.01$
- BM25 ($k_1=2$): $\texttt{TF}=10 * (2 + 1) / (10 + 2) = 2.5$
foobar
appears 500 times.- TF-IDF: $\texttt{TF}=0.5$
- BM25 ($k_1=2$): $\texttt{TF}=500*(2+1)/(500+2)\approx3$
For TF-IDF, it assumes that a 50-fold increase in $\texttt{TF}$ (500/10=50) should result in a corresponding 50-fold increase in relevance (0.5/0.01=50). However, due to the saturation effect in BM25, where the curve is non-linear, the actual relevance only increases by 1.2 times (3/2.5=1.2).
I’ve drawn a diagram to illustrate how $k_1$ influences the non-linear effects.
Improvement 2 - Penalize long document
We previously ignored this complex equation when explaining the first improvement:
$$ 1-b+b\cdot\frac{|d_i|}{avgdl} $$
The reason behind this is: a longer document naturally has higher token frequencies, which should be penalized. That’s exactly what this equation did, where $b$ controls the degree of penalization.
- If $b=0$, there is no penalization at all.
- If $b=1$, this euqation is equal to $|d_i|/avgdl$.
Related Hyperparameter
$k_1$
$$ k_1\in[1.2, 2.0] $$
The parameter $k_1$ controls the saturation rate. The higher $k_1$ is, the slower the saturation occurs (closer to the top-left)
b
$$ b\in[0,1] $$
The parameter *$b$ controls the strength of long document penalization. It’s usually set to 0.75.
Python implementation - rank-bm25
Now let’s see how to use the BM25 algorithm in Python. I use the rank-bm25 library, and this is a runnable example from its homepage.
First, tokenize the corpus and make a BM25 model.
from rank_bm25 import BM25Okapi
# Make BM25
corpus = [
"Hello there good man!",
"It is quite windy in London",
"How is the weather today?",
]
tokenized_corpus = [doc.split(" ") for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)
The user query should also be tokenized.
# Handle query
query = "windy London"
tokenized_query = query.split(" ")
You can retrieve the most related n
documents with the get_top_n
method.
print(bm25.get_top_n(tokenized_query, corpus, 1))
# ['It is quite windy in London']