当前位置：首页 > news >正文

RoPE → Attention 完整

news 2026/7/23 4:06:04

好的，我帮你把之前的“Transformer 输入 → RoPE → Attention”全流程整理成一个完整的、连贯的文档。每一步都包含数学表达 + PyTorch 示例代码，方便直接参考或实现。

Transformer 前向 + RoPE 全流程

1️⃣ 输入：Token → Embedding

数学表达：

给定 token 序列t1,t2,...,tnt_1, t_2, ..., t_nt1,t2,...,tn，embedding lookup：

E=Embedding(t)∈Rn×dE = \text{Embedding}(t) \in \mathbb{R}^{n \times d}E=Embedding(t)∈Rn×d

示例代码：

Python
运行
importtorch
importtorch.nnasnn

vocab_size=10000
d=8# 为了演示，实际 d 可能 512/1024

embedding=nn.Embedding(vocab_size,d)

tokens=torch.tensor([[1,5,9,2]])# shape: (1, n)
E=embedding(tokens)# (1, 4, 8)
print(E.shape)# (1, 4, 8)

2️⃣ 线性投影得到 Q / K / V

数学表达：

Q=EWQ,K=EWK,V=EWVQ = E W_Q,\quad K = E W_K,\quad V = E W_VQ=EWQ,K=EWK,V=EWVWQ,WK,WV∈Rd×dkW_Q, W_K, W_V \in \mathbb{R}^{d \times d_k}WQ,WK,WV∈Rd×dk

示例代码：

Python
运行
d_k=d

W_Q=nn.Linear(d,d_k,bias=False)
W_K=nn.Linear(d,d_k,bias=False)
W_V=nn.Linear(d,d_k,bias=False)

Q=W_Q(E)
K=W_K(E)
V=W_V(E)

3️⃣ 构造 RoPE 角度

数学表达：

θi,pos=pos100002i/dk\theta_{i,pos} = \frac{pos}{10000^{2i/d_k}}θi,pos=100002i/dkpos

i=0,1,...,dk/2−1i = 0,1,...,d_k/2-1i=0,1,...,dk/2−1
pos=0,1,...,n−1pos = 0,1,...,n-1pos=0,1,...,n−1

示例代码：

Python
运行
defget_rope_angles(seq_len,dim):
pos=torch.arange(seq_len).float()# (n,)
i=torch.arange(0,dim,2).float()# (d/2,)
inv_freq=1.0/(10000**(i/dim))# (d/2,)
theta=torch.outer(pos,inv_freq)# (n, d/2)
returntheta

theta=get_rope_angles(seq_len=E.shape[1],dim=d_k)

4️⃣ 计算 sin / cos

sin⁡(θ),cos⁡(θ)\sin(\theta), \quad \cos(\theta)sin(θ),cos(θ)

示例代码：

Python
运行
sin=theta.sin()[None, :, :]# (1, n, d/2)
cos=theta.cos()[None, :, :]

5️⃣ 应用 RoPE（二维旋转）

数学表达：

x2i′=x2icos⁡θ−x2i+1sin⁡θx2i+1′=x2isin⁡θ+x2i+1cos⁡θ\begin{aligned} x'_{2i} &= x_{2i} \cos\theta - x_{2i+1} \sin\theta \\ x'_{2i+1} &= x_{2i} \sin\theta + x_{2i+1} \cos\theta \end{aligned}x2i′x2i+1′=x2icosθ−x2i+1sinθ=x2isinθ+x2i+1cosθ

示例代码：

Python
运行
defapply_rope(x,sin,cos):
# x: (B, n, d)
x1=x[..., ::2]# 偶数维
x2=x[...,1::2]# 奇数维
x_rot=torch.cat([
x1*cos-x2*sin,
x1*sin+x2*cos
],dim=-1)
returnx_rot

6️⃣ 对 Q / K 应用 RoPE

Q′=RoPE(Q),K′=RoPE(K)Q' = \text{RoPE}(Q), \quad K' = \text{RoPE}(K)Q′=RoPE(Q),K′=RoPE(K)

示例代码：

Python
运行
Q_rot=apply_rope(Q,sin,cos)
K_rot=apply_rope(K,sin,cos)

7️⃣ Attention 计算

数学表达：

A=Q′K′TdkA = \frac{Q' {K'}^T}{\sqrt{d_k}}A=dkQ′K′Tα=softmax(A)\alpha = \mathrm{softmax}(A)α=softmax(A)Output=αV\text{Output} = \alpha VOutput=αV

示例代码：

Python
运行
scores=torch.matmul(Q_rot,K_rot.transpose(-2,-1))/(d_k**0.5)
attn=torch.softmax(scores,dim=-1)
output=torch.matmul(attn,V)
print(output.shape)# (1, n, d)

8️⃣ 流程总结

数学版：

E=Embedding(t)Q=EWQ,K=EWK,V=EWVQ′=Rθ(pos)(Q),K′=Rθ(pos)(K)Attn=softmax(Q′K′Tdk)V\begin{aligned} E &= \text{Embedding}(t) \\ Q &= E W_Q, \quad K = E W_K, \quad V = E W_V \\ Q' &= R_{\theta(pos)}(Q), \quad K' = R_{\theta(pos)}(K) \\ \text{Attn} &= \mathrm{softmax}\Big( \frac{Q' {K'}^T}{\sqrt{d_k}} \Big) V \end{aligned}EQQ′Attn=Embedding(t)=EWQ,K=EWK,V=EWV=Rθ(pos)(Q),K′=Rθ(pos)(K)=softmax(dkQ′K′T)V

代码版：

Python
运行
# 1. Embedding
E=embedding(tokens)

# 2. Linear projection
Q=W_Q(E)
K=W_K(E)
V=W_V(E)

# 3. RoPE angles
theta=get_rope_angles(E.shape[1],d_k)
sin=theta.sin()[None, :, :]
cos=theta.cos()[None, :, :]

# 4. Apply RoPE
Q_rot=apply_rope(Q,sin,cos)
K_rot=apply_rope(K,sin,cos)

# 5. Attention
scores=torch.matmul(Q_rot,K_rot.transpose(-2,-1))/(d_k**0.5)
attn=torch.softmax(scores,dim=-1)
output=torch.matmul(attn,V)

9️⃣ 核心理解

RoPE 只作用在 Q/K，V 不变。
旋转编码位置 → attention 依赖相对位置。
相比传统E+PE+PE+P或 learned PE，RoPE 更稳定、可外推、零参数。
与多头 attention 配合 → 每个 head 可以学习不同频率的旋转，捕捉多尺度关系。

如果你愿意，我可以帮你画一个完整流程图，把Token → Embedding → Q/K/V → RoPE → Attention → Output的每个维度、形状、矩阵关系都标出来，非常直观。

你希望我画吗？

# -*- coding: utf-8 -*- """ Created on Fri Apr 3 14:13:21 2026 @author: luogan """ import torch import torch.nn as nn vocab_size = 10000 d = 8 # 为了演示，实际 d 可能 512/1024 embedding = nn.Embedding(vocab_size, d) tokens = torch.tensor([[1, 5, 9, 2]]) # shape: (1, n) E = embedding(tokens) # (1, 4, 8) print(E.shape) # (1, 4, 8) d_k = d W_Q = nn.Linear(d, d_k, bias=False) W_K = nn.Linear(d, d_k, bias=False) W_V = nn.Linear(d, d_k, bias=False) Q = W_Q(E) K = W_K(E) V = W_V(E) def get_rope_angles(seq_len, dim): pos = torch.arange(seq_len).float() # (n,) i = torch.arange(0, dim, 2).float() # (d/2,) inv_freq = 1.0 / (10000 ** (i / dim)) # (d/2,) theta = torch.outer(pos, inv_freq) # (n, d/2) return theta theta = get_rope_angles(seq_len=E.shape[1], dim=d_k) sin = theta.sin()[None, :, :] # (1, n, d/2) cos = theta.cos()[None, :, :] def apply_rope(x, sin, cos): # x: (B, n, d) x1 = x[..., ::2] # 偶数维 x2 = x[..., 1::2] # 奇数维 x_rot = torch.cat([ x1 * cos - x2 * sin, x1 * sin + x2 * cos ], dim=-1) return x_rot Q_rot = apply_rope(Q, sin, cos) K_rot = apply_rope(K, sin, cos) scores = torch.matmul(Q_rot, K_rot.transpose(-2, -1)) / (d_k ** 0.5) attn = torch.softmax(scores, dim=-1) output = torch.matmul(attn, V) print(output.shape) # (1, n, d)

查看全文

http://www.jsqmd.com/news/584069/