Dynamic Time Warping: From Euclidean Distance to Optimal Alignment
You can find the code demo here
Drain: simgple but effective log parsing algorithm
Backgroup
Web service platforms generate a large volume of unstructured logs. However, machine learning and data analysis require structured input.
Therefore, extracting structured information from unstructured logs is a critical problem. A naive approach is to use regular expressions to extract structured information. However, this method has some drawbacks1
- The volume of logs is so large that manually crafting regex patterns is impractical.
- Logs may come from different components, each with its own log format.
The Drain algorithm was proposed in 2017. At that time, many log parsing algorithms focused on offline batch processing. However, logs in web service platforms are generated as a stream. Therefore, the drain algorithm focuses on online stream processing.
Suffix array: find the needle in the hay
Suffix array
By definition, a suffix array (denoted as sa) contains the starting indices of all suffixes. It’s simply an int array where sa[i] represents the starting index of the corresponding suffix.
Taking fizzbuzz as an example, its suffix array is 4 0 1 5 7 3 6 2. The details are shown in the following table.
Suffix array sa |
Corresponding suffix |
|---|---|
| 4 | buzz |
| 0 | fizzbuzz |
| 1 | izzbuzz |
| 5 | uzz |
| 7 | z |
| 3 | zbuzz |
| 6 | zz |
| 2 | zzbuzz |
How to generate a suffix array
Naive way
Let $N$ represent the length of the string. The naive algorithm is straightforward: generate all possible suffixes and then sort them using an efficient sorting algorithm. This requires one sort iteration, which may involve $O(N log\ N)$ comparisons. In the worst case, each string comparison takes $O(N)$ time. Therefore, the overall time complexity would be
Better TF-IDF: BM25
Intro
You probably encountered BM25 numerous times when reading papers from the LLM (RAG) or information retrieval domain. This algorithm is a ranking method that computes relevance scores for documents given a user query.
If you examine the BM25 formula carefully, you will notice its similarity to the classic TF-IDF. In fact, BM25 is an enhanced version of TF-IDF, as we’ll explore shortly.
The BM25
Before we dive into the mechanism of BM25, let’s establish some notations:
t-SNE + K-Means: Data visualization and Clustering
K-Means Algorithm
I recently used the K-Means algorithm for clustering in my work. While reviewing my notes on the topic, I decided to publish them online :)
K-Means is a clustering algorithm that assigns each sample to one of the $K$ clusters.
How does AK-Means algorithm work?
Data Preparation
K-Means algorithm relies on distance calculations, so data should be normalized to prevent features with larger scales from dominating the results. The normalization can be achieved by the Scikit-Learn library as follows.
Association Rule Mining: the Apriori Algorithm
Introduction
In my recent work, I’ve been analyzing the correlation between various features in Android APKs. These features include IP characteristics, URL attributes, permission settings, and more. Feature correlation refers to identifying relationships between different features from the data, such as certain feature combinations that frequently appear together.
Relying solely on manual analysis would be impractical given the massive dataset—that’s when I remembered an algorithm from my data mining course: the Apriori algorithm! :)
Async + Leaky Bucket: How to Batch LLM API Calls Efficiently
Background
Recently, at work, I’ve been working on setting up a LLM evaluation platform. There’s one particular scenario: we need to call an LLM API provided by another department to run model evaluations on a test dataset, but this LLM API has a rate limit of a maximum of 2 calls per second (2 RPS). Thus, my task essentially boils down to: How to maximize concurrency to speed up model evaluation while strictly adhering to the API rate limits. In this brief post, I will share my thoughts about approaching this task.
Programming with Categories: Functor
Intro
What’s a functor? You might use it daily without realizing it. For example, calling map on a collection means you’re using a functor.
This post explains functors from two perspectives: category theory and programming. By this end, you will have a deeper understanding of the concept.
Functor in category theory
This section assumes you know what a category is. If you haven’t heard of it before, think of a category as a collection of objects and the relationships between them.
Transformer architecture variation: Rotary Position Embedding (RoPE)
A Recap of Self-attention Mechanism
In self-attention, the query ($\mathbf q_m$), key ($\mathbf k_n$), and value ($\mathbf v_n$) vectors are computed as follows:
$$ \begin{aligned} \mathbf q_m&=f_q(\mathbf x_m,m)\\ \mathbf k_n&=f_k(\mathbf x_n,n)\\ \mathbf v_n&=f_v(\mathbf x_n,n) \end{aligned} $$
Here, the $\mathbf x_i$ is the $i$-th token embedding, while $n$ and $m$ denote different positions.
The attention score between position $m$ and $n$ is computed as:
$$ \alpha_{m,n}=\frac{exp(\frac{\mathbf q_m^T\mathbf k_n}{\sqrt d})}{\sum_{j=1}^Nexp(\frac{\mathbf q_m^T\mathbf k_j}{\sqrt d})} $$
Transformer architecture variation: RMSNorm
Intro
It’s been 8 years since the famous transformer architecture was first proposed. You might have noticed that some modifications to the original design - for instance, most large language models (LLMs) now use RMSNorm1 instead of LayerNorm. Today I will briefly introduce RMSNorm, but first, let’s recap LayerNorm.
LayerNorm Recap
$$ \mathbf y=\frac{\mathbf x-E[\mathbf x]}{\sqrt{Var(\mathbf x)+\epsilon}}*\gamma+\beta $$
The equation above shows how LayerNorm works. If we ignore the scaling factors ($\gamma, \beta$), LayerNorm’s behavior becomes intuitive: it transforms each input $\mathbf x$ into a feature vector with zero mean and unit standard deviation .
Kosaraju's Algorithm Explained
Intro
During my daily coding practice, I encountered an interesting problem - 1682. Flight Routes Check. Solving this problem requires finding all strongly connected components (SCCs) in a directed graph. After some research, I discovered Kosaraju’s algorithm, which solves this problem in linear time. That is, the time complexity is
$$ O(V+E) $$
Where $V$ refers to the nodes and $E$ refers to the edges in the graph.
By interesting, I mean that Kosaraju’s algorithm is easy to implement yet a bit tricky to understand fully. In my opinion, knowing why it works matters more than just memorizing how to code it. That’s why I’m sharing this short post - to break down the key insights.
One for all: the torch.einsum API
Motivations
In PyTorch, multiple APIs exist for matrix multiplication operations. However, these functions often lead to memorization challenges. Additionally, many of these APIs require explicit dimension manipulation (e.g., permuting, reshaping)
Does there exist a magic API that can cover all the use cases? A potential unified solution is the torch.einsum API.
What is torch.einsum ?
The syntax of torch.einsum is