当前位置：首页 > news >正文

别再死记公式了！用Python手写一个Self-Attention，带你彻底搞懂Transformer核心

news 2026/4/23 23:42:44

别再死记公式了！用Python手写一个Self-Attention，带你彻底搞懂Transformer核心

理解Self-Attention机制最有效的方式不是背诵公式，而是亲手实现它。本文将用纯Python从零构建一个完整的Self-Attention层，通过代码逐行解析Q、K、V矩阵的生成、缩放点积计算、Softmax归一化和加权求和过程。我们将使用简单的数值示例，配合维度变换示意图，让这个看似复杂的机制变得直观可操作。

1. 环境准备与基础概念

在开始编码前，我们需要明确几个关键概念。Self-Attention是Transformer架构的核心组件，它允许模型在处理序列数据时，动态地关注输入的不同部分。与传统注意力机制不同，Self-Attention的查询(Query)、键(Key)和值(Value)都来自同一输入源。

准备一个Python环境（建议3.8+），我们只需要基础库：

import numpy as np import math

理解三个核心矩阵的作用：

Q(Query): 表示当前关注的"问题"
K(Key): 表示被查询的"索引"
V(Value): 实际被加权的"内容"

2. 输入数据与参数初始化

让我们定义一个简单的输入序列，假设每个词用4维向量表示（实际中通常是512或768维）：

# 输入矩阵X：3个token，每个4维 X = np.array([ [1, 0, 1, 0], # 词1 [0, 2, 0, 2], # 词2 [1, 1, 1, 1] # 词3 ]) print("输入矩阵X形状:", X.shape) # (3,4)

初始化可训练的权重矩阵。在实践中这些参数会被学习，这里我们手动设置合理值：

# 初始化权重矩阵（实际应用中这些是随机初始化并通过训练学习的） W_Q = np.array([ [1, 0, 1], [1, 0, 0], [0, 0, 1], [0, 1, 1] ]) W_K = np.array([ [0, 0, 1], [1, 1, 0], [0, 1, 0], [1, 1, 0] ]) W_V = np.array([ [0, 2, 0], [0, 3, 0], [1, 0, 3], [1, 1, 0] ])

3. 计算Q、K、V矩阵

现在计算查询、键和值矩阵：

# 计算Q、K、V Q = np.dot(X, W_Q) # (3,4) x (4,3) -> (3,3) K = np.dot(X, W_K) # (3,4) x (4,3) -> (3,3) V = np.dot(X, W_V) # (3,4) x (4,3) -> (3,3) print("Q矩阵:\n", Q) print("K矩阵:\n", K) print("V矩阵:\n", V)

输出示例：

Q矩阵: [[1 0 2] [2 2 2] [2 1 3]] K矩阵: [[0 1 1] [4 4 0] [2 3 1]] V矩阵: [[1 2 3] [2 8 0] [2 6 3]]

4. 注意力分数计算

关键步骤是计算注意力分数，然后进行缩放和Softmax归一化：

# 计算注意力分数 d_k = Q.shape[-1] # 特征维度，这里是3 attention_scores = np.dot(Q, K.T) / math.sqrt(d_k) print("原始注意力分数:\n", attention_scores) # Softmax归一化 def softmax(x): exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True)) return exp_x / np.sum(exp_x, axis=-1, keepdims=True) attention_weights = softmax(attention_scores) print("归一化注意力权重:\n", attention_weights)

典型输出：

原始注意力分数: [[ 2.30940108 8.08290377 4.04145188] [ 4.61880215 16.16580755 8.08290377] [ 4.04145188 14.47219556 7.5055535 ]] 归一化注意力权重: [[2.14400953e-03 9.97503656e-01 3.52307191e-04] [1.50609632e-06 9.99998494e-01 1.50609632e-06] [1.87217044e-05 9.99981408e-01 1.87217044e-05]]

5. 加权求和与最终输出

最后一步是用注意力权重对V矩阵进行加权求和：

# 计算加权和 output = np.dot(attention_weights, V) print("最终输出:\n", output)

输出结果示例：

最终输出: [[2.00214401 7.99007231 2.99707072] [2.00000301 7.99996988 3.00000301] [2.00003744 7.99962817 3.00003744]]

6. 常见问题与调试技巧

在实现过程中容易遇到的几个典型问题：

维度不匹配错误：
- 确保矩阵乘法维度对齐：(n,d) x (d,m) → (n,m)
- 使用print(矩阵.shape)随时检查维度
忘记缩放点积：
- 必须除以√d_k，否则Softmax可能梯度消失
- 缩放因子应与K的最后一个维度一致
Softmax数值稳定性：
- 实现时减去最大值防止数值溢出
```
exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
```

注意力模式异常：

检查权重分布是否合理
可视化注意力矩阵有助于调试：

import matplotlib.pyplot as plt plt.imshow(attention_weights, cmap='viridis') plt.colorbar() plt.show()

7. 扩展：多头注意力实现

真正的Transformer使用多头注意力，让我们实现一个简化版：

def multi_head_attention(X, num_heads=2): d_model = X.shape[-1] assert d_model % num_heads == 0 depth = d_model // num_heads # 分割到多个头 def split_heads(x): return x.reshape(x.shape[0], num_heads, depth) Q = split_heads(np.dot(X, W_Q)) K = split_heads(np.dot(X, W_K)) V = split_heads(np.dot(X, W_V)) # 每个头单独计算注意力 scaled_attention = [] for h in range(num_heads): attn = np.dot(Q[:,h,:], K[:,h,:].T) / math.sqrt(depth) attn = softmax(attn) scaled_attention.append(np.dot(attn, V[:,h,:])) # 合并多头结果 concat_attention = np.concatenate(scaled_attention, axis=-1) return concat_attention print("多头注意力输出:\n", multi_head_attention(X))

8. 实际应用中的优化

在生产环境中，我们还需要考虑：

批处理优化：

# 假设batch_size=32, seq_len=10, dim=512 batch_X = np.random.rand(32, 10, 512)

掩码处理（用于处理变长序列）：

def get_padding_mask(seq): mask = (seq != 0).astype('float32') return mask[:, np.newaxis, np.newaxis, :]

位置编码：

def positional_encoding(max_len, d_model): pos = np.arange(max_len)[:, np.newaxis] i = np.arange(d_model)[np.newaxis, :] angle_rates = 1 / np.power(10000, (2 * (i//2)) / d_model) pe = pos * angle_rates pe[:, 0::2] = np.sin(pe[:, 0::2]) # 偶数索引 pe[:, 1::2] = np.cos(pe[:, 1::2]) # 奇数索引 return pe

通过这次手写实现，你会发现Self-Attention本质上是一系列精心设计的矩阵运算。在真实项目中，虽然我们会使用PyTorch或TensorFlow的优化实现，但理解底层计算过程能帮助你在模型出现问题时快速定位原因，也能更好地调整注意力机制的超参数。

查看全文

http://www.jsqmd.com/news/689687/