Async + Leaky Bucket: How to Batch LLM API Calls Efficiently

MartinLwx published on 2025-06-18 included in category ML-DL

Background

Recently, at work, I’ve been working on setting up a LLM evaluation platform. There’s one particular scenario: we need to call an LLM API provided by another department to run model evaluations on a test dataset, but this LLM API has a rate limit of a maximum of 2 calls per second (2 RPS). Thus, my task essentially boils down to: How to maximize concurrency to speed up model evaluation while strictly adhering to the API rate limits. In this brief post, I will share my thoughts about approaching this task.

Programming with Categories: Functor

MartinLwx published on 2025-06-01 included in category Programming-Languages

Intro

What’s a functor? You might use it daily without realizing it. For example, calling map on a collection means you’re using a functor.

This post explains functors from two perspectives: category theory and programming. By this end, you will have a deeper understanding of the concept.

Functor in category theory

Warning

This section assumes you know what a category is. If you haven’t heard of it before, think of a category as a collection of objects and the relationships between them.

Transformer architecture variation: Rotary Position Embedding (RoPE)

MartinLwx published on 2025-05-24 included in category ML-DL

A Recap of Self-attention Mechanism

In self-attention, the query ($\mathbf q_m$), key ($\mathbf k_n$), and value ($\mathbf v_n$) vectors are computed as follows:

$$ \begin{aligned} \mathbf q_m&=f_q(\mathbf x_m,m)\\ \mathbf k_n&=f_k(\mathbf x_n,n)\\ \mathbf v_n&=f_v(\mathbf x_n,n) \end{aligned} $$

Here, the $\mathbf x_i$ is the $i$-th token embedding, while $n$ and $m$ denote different positions.

The attention score between position $m$ and $n$ is computed as:

$$ \alpha_{m,n}=\frac{exp(\frac{\mathbf q_m^T\mathbf k_n}{\sqrt d})}{\sum_{j=1}^Nexp(\frac{\mathbf q_m^T\mathbf k_j}{\sqrt d})} $$

Transformer architecture variation: RMSNorm

MartinLwx published on 2025-05-11 included in category ML-DL

Intro

It’s been 8 years since the famous transformer architecture was first proposed. You might have noticed that some modifications to the original design - for instance, most large language models (LLMs) now use RMSNorm¹ instead of LayerNorm. Today I will briefly introduce RMSNorm, but first, let’s recap LayerNorm.

LayerNorm Recap

$$ \mathbf y=\frac{\mathbf x-E[\mathbf x]}{\sqrt{Var(\mathbf x)+\epsilon}}*\gamma+\beta $$

The equation above shows how LayerNorm works. If we ignore the scaling factors ($\gamma, \beta$), LayerNorm’s behavior becomes intuitive: it transforms each input $\mathbf x$ into a feature vector with zero mean and unit standard deviation .

Kosaraju's Algorithm Explained

MartinLwx published on 2025-04-26 included in category Algorithm

Intro

During my daily coding practice, I encountered an interesting problem - 1682. Flight Routes Check. Solving this problem requires finding all strongly connected components (SCCs) in a directed graph. After some research, I discovered Kosaraju’s algorithm, which solves this problem in linear time. That is, the time complexity is

$$ O(V+E) $$

Where $V$ refers to the nodes and $E$ refers to the edges in the graph.

By interesting, I mean that Kosaraju’s algorithm is easy to implement yet a bit tricky to understand fully. In my opinion, knowing why it works matters more than just memorizing how to code it. That’s why I’m sharing this short post - to break down the key insights.

One for all: the torch.einsum API

MartinLwx published on 2025-04-14 included in category ML-DL

Motivations

In PyTorch, multiple APIs exist for matrix multiplication operations. However, these functions often lead to memorization challenges. Additionally, many of these APIs require explicit dimension manipulation (e.g., permuting, reshaping)

Does there exist a magic API that can cover all the use cases? A potential unified solution is the torch.einsum API.

What is torch.einsum ?

The syntax of torch.einsum is

Two different APIs related to Process Pool in Python

MartinLwx published on 2025-03-30 included in category Programming-Languages

Intro

As a machine learning engineer, I work with vast amounts of data, performing tasks such as data cleaning and information extraction. These tasks are typically data-parallel and do not involve race conditions. Such workloads are usually referred to as CPU-intensive tasks. Typically, these tasks can be formulated as map(fn, data).

Python is limited by its Global Interpreter Lock (GIL). As a result, using multithreading could not improve the performance. Instead, multiprocessing should be used. Usually, you would not manually manage all the launched processes but would instead use a process pool. Today, I’d like to discuss two different APIs related to multiprocessing: multiprocessing.Pool and concurrent.futures.ProcessPoolExecutor, and provide a simple comparison.

Class Hierarchy Analysis: a quick way to generate call graph

MartinLwx published on 2025-03-19 included in category Program-Analysis

The key to call graph construction

For an OOP programming language, the key challenge in call graph construction is handling the virtual call, as it may involve multiple target methods, as shown in the following table¹.

	Static Call	Special Call	Virtual Call
Instruction	`invokestatic`	`invokespecial`	`invokeinterface, invokevirtual`
Receiver Objects	❌	✅	✅
Target Methods	Static Method	Constructor, Private Instance Method, Superclass Instance Method	Other Instance Method
Count of Possible Target Methods	1	1	$\ge 1$ (polymorphism)
Determinancy	Compile-time	Compile-time	Run-time

The method call and method signature

Let’s take Java as an example; a method call may have this form.

Prefix Sum Array: the secret to fast range sum query and more

MartinLwx published on 2025-03-15 included in category Algorithm

Motivations

There is a type of problem where you are given an array $arr$ and $Q$ queries. Each query is represented as $query(l, r)$, which asks for the sum of the elements in the subarray $[l, r]$, i.e., $arr[l] + arr[l + 1] + … + arr[r]$.

If we handle each query using a brute-force approach, the time complexity will be $O(N)$. Thus, solving $Q$ queries would require $O(NQ)$. Is there a more efficient approach?

Weight Tying in Language Models: A Technique to Parameter efficiency

MartinLwx published on 2025-03-11 included in category ML-DL

Intro

Quote

In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation - Attention is All You Need, Section 3.4. Embeddings and Softmax¹

What is Multi-Head Attention (MHA)

MartinLwx published on 2025-03-04 included in category ML-DL

What’s Multi-Head Attention (MHA)

In last post I have explained how the self-attention mechanism works. Today let’s take a step further and explore multi-head attention (MHA), which is the full version of self-attention as described in the original paper¹. Since I have covered most of the foundation concepts in last post, this post will be short. :)

Previously, we mentioned that the self-attention mechanism has three import matrices.

$$ \mathbf Q,\mathbf K,\mathbf V\in\mathcal{R}^{n\times d} $$

An Explanation of Self-Attention mechanism in Transformer

MartinLwx published on 2025-03-02 included in category ML-DL

Info

Further reading: