Transformer architecture variation: RMSNorm

It’s been 8 years since the famous transformer architecture was first proposed. You might have noticed that some modifications to the original design - for instance, most large language models (LLMs) now use RMSNorm1 instead of LayerNorm. Today I will briefly introduce RMSNorm, but first, let’s recap LayerNorm.

$$ \mathbf y=\frac{\mathbf x-E[\mathbf x]}{\sqrt{Var(\mathbf x)+\epsilon}}*\gamma+\beta $$

The equation above shows how LayerNorm works. If we ignore the scaling factors ($\gamma, \beta$), LayerNorm’s behavior becomes intuitive: it transforms each input $\mathbf x$ into a feature vector with zero mean and unit standard deviation .

Kosaraju's Algorithm Explained

During my daily coding practice, I encountered an interesting problem - 1682. Flight Routes Check. Solving this problem requires finding all strongly connected components (SCCs) in a directed graph. After some research, I discovered Kosaraju’s algorithm, which solves this problem in linear time. That is, the time complexity is

$$ O(V+E) $$

Where $V$ refers to the nodes and $E$ refers to the edges in the graph.

One for all: the torch.einsum API

In PyTorch, multiple APIs exist for matrix multiplication operations. However, these functions often lead to memorization challenges. Additionally, many of these APIs require explicit dimension manipulation (e.g., permuting, reshaping)

Does there exist a magic API that can cover all the use cases? A potential unified solution is the torch.einsum API.

The syntax of torch.einsum is

Two different APIs related to Process Pool in Python

As a machine learning engineer, I work with vast amounts of data, performing tasks such as data cleaning and information extraction. These tasks are typically data-parallel and do not involve race conditions. Such workloads are usually referred to as CPU-intensive tasks. Typically, these tasks can be formulated as map(fn, data).

Python is limited by its Global Interpreter Lock (GIL). As a result, using multithreading could not improve the performance. Instead, multiprocessing should be used. Usually, you would not manually manage all the launched processes but would instead use a process pool. Today, I’d like to discuss two different APIs related to multiprocessing: multiprocessing.Pool and concurrent.futures.ProcessPoolExecutor, and provide a simple comparison.

Class Hierarchy Analysis: a quick way to generate call graph

For an OOP programming language, the key challenge in call graph construction is handling the virtual call, as it may involve multiple target methods, as shown in the following table1.

Static Call Special Call Virtual Call
Instruction invokestatic invokespecial invokeinterface, invokevirtual
Receiver Objects
Target Methods Static Method Constructor, Private Instance Method, Superclass Instance Method Other Instance Method
Count of Possible Target Methods 1 1 $\ge 1$ (polymorphism)
Determinancy Compile-time Compile-time Run-time

Let’s take Java as an example; a method call may have this form.

Prefix Sum Array: the secret to fast range sum query and more

There is a type of problem where you are given an array $arr$ and $Q$ queries. Each query is represented as $query(l, r)$, which asks for the sum of the elements in the subarray $[l, r]$, i.e., $arr[l] + arr[l + 1] + … + arr[r]$.

If we handle each query using a brute-force approach, the time complexity will be $O(N)$. Thus, solving $Q$ queries would require $O(NQ)$. Is there a more efficient approach?

What is Multi-Head Attention (MHA)

In last post I have explained how the self-attention mechanism works. Today let’s take a step further and explore multi-head attention (MHA), which is the full version of self-attention as described in the original paper1. Since I have covered most of the foundation concepts in last post, this post will be short. :)

Previously, we mentioned that the self-attention mechanism has three import matrices.

The Flow of GraphRAG

The current RAG techniques can not answer the global questions about the corpus. For example, we may want to know what is the topic of the corpus. Usually, the answer does not exist in the corpus but needs to understand the whole corpus and give summarization. Such global questions are called query-focused summarization (QFS) problems in this paper1. A naive RAG technique can not handle such a situation.