Neovim Setup for OCaml
Updates:
LLM inference optimization - KV Cache
Background
The secret behind LLM is that it will generate tokens one by one based on all the previous tokens.
Let’s assume that we have already generated $t$ tokens, denoted by $x_{1:t}$. In the next iteration, the LLM will generate $x_{1:t+1}$. Note that the first $t$ tokens are the same.
$$x_{1:t+1}=\text{LLM}(x_{1:t})$$
The next iteration is similar.
$$x_{1:t+2}=\text{LLM}(x_{1:t+1})$$
In summary, in each iteration, we will use the output of the previous round as a new input for the LLM. Generally, this process will continue until the output reaches the maximum length we predefined or the LLM itself generates a special token, signifying the completion of the generating process.
LoRA fine-tuning
What’s LoRA
Since the era of LLM(large language model) arrived, fine-tuning LLM has become a challenge because the LLM models are extremely large, making it difficult to perform full fine-tuning. There are mainly two approaches: freeze the entire LLM and perform prompt tuning or In-context Learning; freeze the entire LLM but inserting trainable modules. Today, I will introduce the LoRA(Low-Rank Adaptation), which corresponds to the latter technical approach. This is a work proposed by the Microsoft team1
The next lexicographical permutation problem
Intro
Occasionally, you may want to get the next/prev lexicographical permutation of a sequence. How would you do that? If you are a C++ programmer, you are probably familiar with the next_permutation
1 and prev_permutation
2 APIs. However, Python does not provide the counterparts. So the topic today is how to do this in Python. Since the solutions of prev lexicographical permutation and the next lexicographical permutation are very similar, let us focus on the next lexicographical permutation problem.
BPE Tokenization Demystified: Implementation and Examples
A taxonomy of tokenization methods
In NLP, one crux of problems is - how to tokenize the text. There are three methods available:
- Char-level
- Word-level
- Subword-level
Let’s talk about the Char-level tokenizer. That is, we tokenize the text into a char stream. For instance, highest -> h, i, g, h, e, s, t
. One advantage of the Char-level tokenizer is that the size of Vocab won’t be that large. The size of Vocab is equal to the size of the alphabet. So you probably won’t meet the infamous Out-of-vocabulary(OOV) problem. However, the downside is that the char itself does not convey too much information, and we will get too many tokens after tokenizing. Try to imagine that a simple word highest will give us 7 tokens😨
TF-IDF model
What is the TF-IDF model
In previous post, we talked about the bag-of-word model, which has many limitations. Today we take a step further to see if we can try to fix one of the limitations - Each word has the same importance.
The crux of the problem - How to define the word importance?
Bag-of-Word model
What is the bag-of-word model?
In NLP, we need to represent each document as a vector because machine learning can only accept input as numbers. That is, we want to find a magic function that: $$ f(\text{document}) = vector $$
Today’s topic is bag-of-word(BoW) model, which can transform a document into a vector representation.
💡 Although the BoW model is outdated in 2023, I still encourage you to learn from the history and think about some essential problems:
A trick to calculating partial derivatives in machine learning
Intro
You may have difficulties when trying to calculate the partial derivatives in machine learning like me. Even though I found a good reference cookbook that could be used to derive the gradients, I still got confused. Today, I want to share a practical technique I recently learned from this video: when calculating partial derivatives in machine learning, you can treat everything as if it were a scalar and then make the shapes match
Demystifying Pytorch's Strides Format
Intro
Even though I have been using Numpy and Pytorch for a long time, I never really knew how they implemented the underlying tensors and why they are so efficient. Recently, while studying the course Deep Learning Systems, I finally got the opportunity to try implementing tensors on my own. After going through the process, my understanding of tensors is much better 🧐
As a Pytorch user, is it necessary to understand the underlying tensor storage mechanism? I believe it is essential. In most cases, understanding the underlying principles helps you grasp higher-level concepts better. For example, understanding the tensor storage mechanism can help you answer the following questions:
How to memorize the Red-black tree
Intro
If you are attracted by the title of this blog, I believe you may agree with me: The process of memorizing the insertion and deletion operations of the Red-black tree can be incredibly arduous. It entails keeping track of complex tree rotations and the necessity to recolor nodes as required. I once read the renowned Introducing to Algorithms written by the CLRS. However, there are so many cases to remember and I quickly get overwhelmed.
Git bundle guide
What is the git bundle command
git bundle
is a relatively less commonly used git command. Its purpose is to package a git repo into a single file, which can then be used by others to recreate the original git repo. Additionally, git bundle
supports incremental update. Before I learned about the git bundle
command, I would usually directly use tar czf some_git_repo
to create a package for a git repo. Recently, I accidentally discovered the git bundle
and found it quite useful🍻.
Understanding GAT throught MPNN
What’s MPNN
Justin Gilmer proposed the MPNN (Message Passing Neural Network) framework 1 for describing graph neural network models used in supervised learning on graphs. I found this to be a useful framework that provides a clear understanding of how different GNN models work and facilitates a quick grasp of the differences between them. Considering a node $v$ on the graph $G$, the update procedure for its vector representation $h_v$ is as follows: