One for all: the torch.einsum API
Motivations
In PyTorch, multiple APIs exist for matrix multiplication operations. However, these functions often lead to memorization challenges. Additionally, many of these APIs require explicit dimension manipulation (e.g., permuting, reshaping)
Does there exist a magic API that can cover all the use cases? A potential unified solution is the torch.einsum
API.
What is torch.einsum ?
The syntax of torch.einsum
is
Two different APIs related to Process Pool in Python
Intro
As a machine learning engineer, I work with vast amounts of data, performing tasks such as data cleaning and information extraction. These tasks are typically data-parallel and do not involve race conditions. Such workloads are usually referred to as CPU-intensive tasks. Typically, these tasks can be formulated as map(fn, data)
.
Python is limited by its Global Interpreter Lock (GIL). As a result, using multithreading could not improve the performance. Instead, multiprocessing should be used. Usually, you would not manually manage all the launched processes but would instead use a process pool. Today, I’d like to discuss two different APIs related to multiprocessing: multiprocessing.Pool
and concurrent.futures.ProcessPoolExecutor
, and provide a simple comparison.
Class Hierarchy Analysis: a quick way to generate call graph
The key to call graph construction
For an OOP programming language, the key challenge in call graph construction is handling the virtual call, as it may involve multiple target methods, as shown in the following table1.
Static Call | Special Call | Virtual Call | |
---|---|---|---|
Instruction | invokestatic |
invokespecial |
invokeinterface, invokevirtual |
Receiver Objects | ❌ | ✅ | ✅ |
Target Methods | Static Method | Constructor, Private Instance Method, Superclass Instance Method | Other Instance Method |
Count of Possible Target Methods | 1 | 1 | $\ge 1$ (polymorphism) |
Determinancy | Compile-time | Compile-time | Run-time |
The method call and method signature
Let’s take Java as an example; a method call may have this form.
Prefix Sum Array: the secret to fast range sum query and more
Motivations
There is a type of problem where you are given an array $arr$ and $Q$ queries. Each query is represented as $query(l, r)$, which asks for the sum of the elements in the subarray $[l, r]$, i.e., $arr[l] + arr[l + 1] + … + arr[r]$.
If we handle each query using a brute-force approach, the time complexity will be $O(N)$. Thus, solving $Q$ queries would require $O(NQ)$. Is there a more efficient approach?
Weight Tying in Language Models: A Technique to Parameter efficiency
Intro
In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation - Attention is All You Need, Section 3.4. Embeddings and Softmax1
What is Multi-Head Attention (MHA)
What’s Multi-Head Attention (MHA)
In last post I have explained how the self-attention mechanism works. Today let’s take a step further and explore multi-head attention (MHA), which is the full version of self-attention as described in the original paper1. Since I have covered most of the foundation concepts in last post, this post will be short. :)
Previously, we mentioned that the self-attention mechanism has three import matrices.
An Explanation of Self-Attention mechanism in Transformer
Further reading:
From Basic Block to Control Flow Graph
Note: The Three-Address Code is the basics of the Basic Block (BB), and the Basic Block is the foundation of the Control Flow Graph (CFG). Therefore, before reading this post, it’s recommended that you first understand the Three-Address Code. You may refer to my previous post
What is Three-Address Code (3AC/TAC)
Further reading
The Flow of GraphRAG
Motivation
The current RAG techniques can not answer the global questions about the corpus. For example, we may want to know what is the topic of the corpus. Usually, the answer does not exist in the corpus but needs to understand the whole corpus and give summarization. Such global questions are called query-focused summarization (QFS) problems in this paper1. A naive RAG technique can not handle such a situation.
Reading Notes: Outrageously Large Neural Networks-The Sparsely-Gated Mixture-of-Experts Layer
Motivations
The model’s performance is related to the model’s parameter. The bigger the model is, the more performant it will be. However, the computational cost also increases. To mitigate this problem, various forms of conditional computation have been proposed to increase model performance without a proportional increase in computational costs1.
Today I would like to share the Sparsely-Gated Mixture-of-Experts Layer (MoE) as proposed in this paper.1

MoE architecture
There are $n$ experts in the MoE layer (denoted as $E_1, E_2, …, E_n$) and they are controlled by a gating network $G$. The output of the gating network $G$ is a vector with length $n$.
What is the Python decorator really?
Intro
If you could understand this statement: Python function is a first-class function, then I believe you should have no problem understanding the Python decorator too. This statement means that the function is also a value, just like any other primitive types (int, str, float, etc
), and can be passed as arguments to function or returned as function outputs.
You may heard of the technical term - high-order function, which means that its arguments contain a function or (and) it returns a function. So we know that the Python decorator is a kind of high-order function.