Transformer 架构变化：旋转位置编码 (RoPE)

MartinLwx 收录于类别 ML-DL

2025-05-24 2025-05-24 约 1340 字预计阅读 6 分钟

自注意力机制回顾

用 $\mathbf x_i$ 表示没有位置编码的 token embedding，那么 $\mathbf q_m,\mathbf k_n,\mathbf v_n$ 的计算如下

$$ \begin{aligned} \mathbf q_m&=f_q(\mathbf x_m,m)\\ \mathbf k_n&=f_k(\mathbf x_n,n)\\ \mathbf v_n&=f_v(\mathbf x_n,n) \end{aligned} $$

这里的 $n, m$ 表示的是不同的位置，这里假设 $\mathbf k$ 和 $\mathbf v$ 是都是位置 $n$ 的，而 $\mathbf q$ 是位置 $m$ 的，并且 $m > n$

那么位置 $m$ 和 $n$ 这两个位置之间的注意力计算就是：

$$ \alpha_{m,n}=\frac{exp(\frac{\mathbf q_m^T\mathbf k_n}{\sqrt d})}{\sum_{j=1}^Nexp(\frac{\mathbf q_m^T\mathbf k_j}{\sqrt d})} $$

其中

$\alpha_{m,n}$ 使用注意力分数
$N$ 是输入序列的长度

每一个位置 $i$ 都可以用上面的方式得到注意力分数 $\alpha_{m, i}$，通过这些注意力分数，就得到了位置 $m$ 更新之后的向量表示 $\mathbf o_m$：

$$ \mathbf o_m=\sum_{i=1}^N\alpha_{m,i}\mathbf v_n $$

RoPE 原理

观察自注意力的式子会发现核心是 query 向量和 key 向量之间注意力分数的计算方式，作者试图找到一种编码方式 $g$ 使得：内积的结果不仅依赖于 2 个输入向量，而且只依赖于他们之间的相对距离 $n-m$¹

公式化表述如下

$$ \langle f_q(\mathbf x_m,m),f_k(\mathbf x_n,n)\rangle=g(\mathbf x_m,\mathbf x_n,m-n) $$

如果向量 $\mathbf x_m, \mathbf x_n$ 的长度是 2，作者提出 $f_q,f_k$ 可以是：先做一个权重变换然后左乘一个旋转矩阵 $\mathbf R$，即

$$ \begin{split} f_q(\mathbf x_m,m)&=\mathbf{R}_m(\mathbf W_q\mathbf x_m) \\ &=\begin{pmatrix} cos\ m\theta & -sin\ m\theta \\ sin\ m\theta & cos\ m\theta \end{pmatrix} \begin{pmatrix} \mathbf W_q\mathbf x_m \end{pmatrix} =\mathbf q_m \end{split} $$

可以看到，对于位置为 $m$ 的向量 $\mathbf x_m$，旋转的角度都是 $m\theta$

同理

$$ \begin{split} f_k(\mathbf x_n,n)&=\mathbf{R}_n(\mathbf W_k\mathbf x_n) \\ &=\begin{pmatrix} cos\ n\theta & -sin\ n\theta \\ sin\ n\theta & cos\ n\theta \end{pmatrix} (\mathbf W_k\mathbf x_n) =\mathbf k_n \end{split} $$

那么 $\mathbf q_m^T\mathbf k_n=(\mathbf R_m(\mathbf W_q\mathbf x_m))^T(\mathbf R_n(\mathbf W_q\mathbf x_n))$ 等于多少呢？可以推导一下

$$ \begin{split} \mathbf q_m^T\mathbf k_n&=(\mathbf R_m(\mathbf W_q\mathbf x_m))^T(\mathbf R_n(\mathbf W_k\mathbf x_n))\\ &=(\mathbf W_q\mathbf x_m)^T\mathbf R_m^T\mathbf R_n(\mathbf W_k\mathbf x_n) \\ &=(\mathbf W_q\mathbf x_m)^T\mathbf R_m^{-1}\mathbf R_n(\mathbf W_k\mathbf x_n) \\ &=(\mathbf W_q\mathbf x_m)^T\mathbf R_{-m}\mathbf R_n(\mathbf W_k\mathbf x_n) \\ &=(\mathbf W_q\mathbf x_m)^T\mathbf R_{-m}\mathbf R_n(\mathbf W_k\mathbf x_n) \\ &=(\mathbf W_q\mathbf x_m)^T\mathbf R_{n-m}(\mathbf W_k\mathbf x_n) \\ \end{split} $$

可以看到，现在注意力分数的值跟他们之间的相对距离 $n-m$ 有关系了

Tip

这里公式的推导用到了旋转矩阵的性质：

$$ \mathbf R_m\mathbf R_n=\mathbf R_{m+n} $$

以及

$$ \mathbf R^{-1}=\mathbf R^T $$

以及

$$ \mathbf R_m^{-1}=\mathbf R_{-m} $$

但 $\mathbf x_m,\mathbf x_n$ 如果是 $d$ 维的呢？对于一个长度为 $d$ 的向量，将相邻位置看成一个 pair，就得到了 $d/2$ 个 pair，每个 pair 乘以自己的旋转矩阵，那么整个输入的旋转矩阵就会是

$$ \mathbf R=\begin{pmatrix} cos\ m\theta_1 & -sin\ m\theta_1 & 0 & 0 & … & 0 & 0 \\ sin\ m\theta_1 & cos\ m\theta_1 & 0 & 0 & … & 0 & 0 \\ 0 & 0 & cos\ m\theta_2 & -sin\ m\theta_2 & … & 0 & 0 \\ 0 & 0 & sin\ m\theta_2 & cos\ m\theta_2 & … & 0 & 0 \\ … & … & … &… &… &… &… \\ 0 & 0 & 0 & 0 & … & cos\ m\theta_{d/2} & -sin\ m\theta_{d/2} \\ 0 & 0 & 0 & 0 & … & sin\ m\theta_{d/2} & cos\ m\theta_{d/2} \\ \end{pmatrix} $$

Tip

这里的 $\mathbf R$ 仍然是一个旋转矩阵，因为 $\mathbf R\mathbf R^{T}=\mathbf I$ 而且 $\det (\mathbf R)=1$

这里的 $\theta_i$ 的计算公式如下

$$ \theta_i=10000^{-2(i-1)/d} $$

其中 $i=1,2,…,d/2$

上面的 $\mathbf R$ 是一个稀疏矩阵，矩阵乘法 $\mathbf R\mathbf x$ 算起来比较慢，等价的实现是

$$ \mathbf R\mathbf x= \begin{pmatrix} x_1\\x_2\\x_3\\x_4\\…\\x_{d-1}\\x_d \end{pmatrix}\otimes \begin{pmatrix} cos\ m\theta_1 \\ cos\ m\theta_1\\cos\ m\theta_2\\cos\ m\theta_2\\…\\cos\ m\theta_{d/2}\\cos\ m\theta_{d/2} \end{pmatrix}+ \begin{pmatrix} -x_2\\x_1\\-x_4\\x_3\\…\\-x_{d}\\x_{d-1} \end{pmatrix}\otimes \begin{pmatrix} sin\ m\theta_1 \\ sin\ m\theta_1\\sin\ m\theta_2\\sin\ m\theta_2\\…\\sin\ m\theta_{d/2}\\sin\ m\theta_{d/2} \end{pmatrix} $$

RoPE 实现

RoPE 有 2 种实现方式

将 $x_{2i+1}, x_{2i+2}$ 作为 1 个 pair 进行旋转，这也是前面所述的方法，可以参考 LLaMA 的实现
将 $x_{i}, x_{i+d/2}$ 作为 1 个 pair 进行旋转，详情可参考这里

为了让你明白为什么第 2 种方式也是可行的，下面可以简单看下第 2 种方式的数学推导

$$ \mathbf R\mathbf x= \begin{pmatrix} cos\ m\theta_1 & 0 & 0 & …& -sin\ m\theta_1 & 0 & & 0 & \\ 0 & cos\ m\theta_2 & 0 & … & 0 & -sin\ m\theta_2 & 0 & 0 \\ 0 & 0 & … & … & 0 & 0 & … & 0 \\ … & … & … & cos\ m\theta_{d/2} & … & … & … &-sin\ m\theta_{d/2} \\ sin\ m\theta_1 & 0 & 0 & …& cos\ m\theta_1 & 0 & & 0 & \\ 0 & sin\ m\theta_2 & 0 & … & 0 & cos\ m\theta_2 & 0 & 0 \\ 0 & 0 & … & … & 0 & 0 & … & 0 \\ … & … & … & sin\ m\theta_{d/2} & … & … & … &cos\ m\theta_{d/2} \\ \end{pmatrix} \times \begin{pmatrix} x_1\\x_2\\…\\x_{d/2}\\x_{d/2+1}\\x_{d/2+2}\\…\\x_d \end{pmatrix} $$

这里的矩阵 $\mathbf R$ 同样是一个旋转矩阵

$\mathbf R\mathbf R^T=\mathbf I$
$\det (\mathbf R) = 1$。将行交换 $d/2-1$ 次，列交换 $d/2-1$ 次，一共交换 $d-2$ 次，就变成了第一种方法里的旋转矩阵注意这里 $d$ 是偶数，所以 $d-2$ 也是偶数，根据行列式交换任意 2 行或者 2 列，结果的正负号会改变这个性质，变换偶数次行列式的值不变，也等于 1

上面的矩阵 $\mathbf R$ 也是稀疏矩阵，可以改写为下面的形式

$$ \mathbf R\mathbf x= \begin{pmatrix} x_1\\x_2\\…\\x_{d/2}\\x_{d/2+1}\\…\\x_{d-1}\\x_d \end{pmatrix}\otimes \begin{pmatrix} cos\ m\theta_1 \\ cos\ m\theta_2\\…\\cos\ m\theta_{d/2}\\cos\ m\theta_{1}\\…\\cos\ m\theta_{d/2-1}\\cos\ m\theta_{d/2} \end{pmatrix} + \begin{pmatrix} -x_{d/2+1}\\-x_{d/2+2}\\…\\-x_d\\x_1\\x_2\\…\\x_{d/2} \end{pmatrix}\otimes \begin{pmatrix} sin\ m\theta_1 \\ sin\ m\theta_2\\…\\sin\ m\theta_{d/2}\\sin\ m\theta_{1}\\…\\sin\ m\theta_{d/2-1}\\sin\ m\theta_{d/2} \end{pmatrix} $$

Su, Jianlin, et al. “RoFormer: Enhanced Transformer with Rotary Position Embedding.” arXiv preprint arXiv:2104.09864 (2021). ↩︎

目录

目录

Transformer 架构变化：旋转位置编码 (RoPE)

自注意力机制回顾

RoPE 原理

RoPE 实现