Weight Tying in Language Models: A Technique to Parameter efficiency

MartinLwx published on 2025-03-11 included in category ML-DL

Intro

Quote

In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation - Attention is All You Need, Section 3.4. Embeddings and Softmax¹

What is Multi-Head Attention (MHA)

MartinLwx published on 2025-03-04 included in category ML-DL

What’s Multi-Head Attention (MHA)

In last post I have explained how the self-attention mechanism works. Today let’s take a step further and explore multi-head attention (MHA), which is the full version of self-attention as described in the original paper¹. Since I have covered most of the foundation concepts in last post, this post will be short. :)

Previously, we mentioned that the self-attention mechanism has three import matrices.

$$ \mathbf Q,\mathbf K,\mathbf V\in\mathcal{R}^{n\times d} $$

An Explanation of Self-Attention mechanism in Transformer

MartinLwx published on 2025-03-02 included in category ML-DL

Info

Further reading:

From Basic Block to Control Flow Graph

MartinLwx published on 2025-02-20 included in categories Program-Analysis Compiler

Info

Note: The Three-Address Code is the basics of the Basic Block (BB), and the Basic Block is the foundation of the Control Flow Graph (CFG). Therefore, before reading this post, it’s recommended that you first understand the Three-Address Code. You may refer to my previous post

What is Three-Address Code (3AC/TAC)

MartinLwx published on 2025-02-18 included in category Program-Analysis

Info

Further reading

The Flow of GraphRAG

MartinLwx published on 2025-02-12 included in category ML-DL

Motivation

The current RAG techniques can not answer the global questions about the corpus. For example, we may want to know what is the topic of the corpus. Usually, the answer does not exist in the corpus but needs to understand the whole corpus and give summarization. Such global questions are called query-focused summarization (QFS) problems in this paper¹. A naive RAG technique can not handle such a situation.

And it’s unrealistic to put all the text in the corpus in the context window of LLM. Even if we could, the LLM may miss the information in the middle of the context window.

Reading Notes: Outrageously Large Neural Networks-The Sparsely-Gated Mixture-of-Experts Layer

MartinLwx published on 2025-02-02 included in category ML-DL

Motivations

The model’s performance is related to the model’s parameter. The bigger the model is, the more performant it will be. However, the computational cost also increases. To mitigate this problem, various forms of conditional computation have been proposed to increase model performance without a proportional increase in computational costs¹.

Today I would like to share the Sparsely-Gated Mixture-of-Experts Layer (MoE) as proposed in this paper.¹

MoE architecture

There are $n$ experts in the MoE layer (denoted as $E_1, E_2, …, E_n$) and they are controlled by a gating network $G$. The output of the gating network $G$ is a vector with length $n$.

What is the Python decorator really?

MartinLwx published on 2025-01-20 included in category Programming-Languages

Intro

If you could understand this statement: Python function is a first-class function, then I believe you should have no problem understanding the Python decorator too. This statement means that the function is also a value, just like any other primitive types (int, str, float, etc), and can be passed as arguments to function or returned as function outputs.

You may heard of the technical term - high-order function, which means that its arguments contain a function or (and) it returns a function. So we know that the Python decorator is a kind of high-order function.

Reading Notes: Generalization through Memorization: Nearest Neighbor Language Models

MartinLwx published on 2024-12-23 included in category NLP

Motivation

A language solves 2 subproblems.

Mapping sentence prefixes to fixed-size representation.
Using these representations to predict the next token in the context.

The $k\texttt{NN-LM}$ proposed in this hypothesis that representation learning problem may be easier than the prediction problem

kNN-LM

The following graph demonstrates the idea behind the $k\texttt{NN-LM}$ model.

Data Preparation

To use the $k\texttt{NN-LM}$, we need to preprocess the documents in the corpus. The preprocessing procedure can be divided into some steps. Take the following sentence as an example.

How KNN Algorithm Works

MartinLwx published on 2024-12-15 included in category ML-DL

What’ is KNN Algorithm

Tip

By the definition, we know that the KNN algorithm does not have a training process

What is Phantom type in OCaml

MartinLwx published on 2024-12-08 included in category Programming-Languages

Syntax

Info

The left side of = represents type, while the right side represents the value.

Reading Notes: In-Context Retrieval-Augmented Language Models

MartinLwx published on 2024-12-04 included in category NLP

The idea

In-Context RALM¹ is the RAG technology for Autoregressive LM. In summary, the RAG technology involves using a retriever during model inference to fetch relevant documents, which are then concatenated with the origin input.

In the In-Context Learning setting, some examples are placed before the user’s input, and then they are fed to LLM. Similarly, the In-Context RALM works in a similar way: it directly concatenates the most relevant retrieved document in front of the model’s input. The advantage is that there’s no need to retrain the LLM. A diagram created with Mermaid is shown below.