语言模型中的 Weight Tying 技术

MartinLwx 收录于类别 ML-DL

2025-03-11 2025-03-11 约 639 字预计阅读 3 分钟

引言

Quote

In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation - Attention is All You Need, Section 3.4. Embeddings and Softmax¹

在 Attention is All You Need¹ 这一篇论文的 3.4. Embeddings and Softmax 章节里面有上面这一句话，它是什么意思呢？

意思是：作者将 token embedding 层的权重和最后做 Softmax 前的线性变换层的权重共享了，这也叫做 Weight Tying 技术，出自于 Using the Output Embedding to Improve Language Models 这一篇论文 ²。在本篇文章中，我会简单介绍下原理和实现。

Weight Tying 是什么

flowchart LR
    wte(input embedding)
    lm_head(output embedding)
    wte --> ... --> lm_head

在语言模型里面，通常存在两个权重矩阵

input embedding (用记号 $\mathbf U$ 表示）：将输入的 token 变为 token embedding，在 PyTorch 代码中对应 nn.Embedding
output embedding（用记号 $\mathbf V$ 表示）：将 token embedding 变为 Vocab 上关于 token 的概率分布，在 PyTorch代码中对应 nn.Linear

作者 argue 说，训练的时候对 $\mathbf U$ 和 $\mathbf V$ 的期望是类似的¹

对 $\mathbf U$ 来说，希望语义相似的 token 有相似的 token embedding
对 $\mathbf V$ 来说，希望 如果不同的 token 语义相似，那么他们在 token 的概率分布中有相近的概率分数

除此之外，$\mathbf U$ 和 $\mathbf V$ 的大小还是一样的。既然如此，这两者可以合并进行权重共享吗？

答案是可以，具体的实验细节可以参考原论文 ²，这里不展开

Weight Tying 的实现

在 PyTorch 代码里面，$\mathbf U$ 用 nn.Embedding 实现；$\mathbf V$ 用 nn.Linear 实现，代码如下

in_features, hidden_dim = 3, 4

U = nn.Embedding(in_features, hidden_dim)
V = nn.Linear(hidden_dim, in_features, bias=False)

Weight Tying 的 PyTorch 实现很简单，只需要将两者的 weight 指向同一个地方即可

U.weight = V.weight

总结

在语言模型里面，对 input embedding $\mathbf U$ 和 output embedding $\mathbf V$ 进行权重共享的好处是显而易见的：需要训练的参数减半，模型的表现还差不多，甚至输出的困惑度还更低了 ²。在代码实现上也算容易

目录

目录

语言模型中的 Weight Tying 技术

引言

Weight Tying 是什么

Weight Tying 的实现

总结

参考