当前位置：首页 > news >正文

Transformer基础架构详解（附图 + Python Demo）

news 2026/5/12 18:26:24

一、为什么会有 Transformer？

在 Transformer 出现之前，主流模型是：

RNN（循环神经网络）
LSTM / GRU

import torch import torch.nn as nn # 定义RNN rnn = nn.RNN(input_size=128, hidden_size=128) # 输入：序列长度=5 x = torch.rand(5, 1, 128) h = torch.zeros(1, 1, 128) # 初始隐藏状态 outputs = [] for t in range(5): # ❗必须逐时间步计算（串行） out, h = rnn(x[t].unsqueeze(0), h) outputs.append(out) # outputs保存每个时间步结果

import torch import torch.nn as nn # 定义LSTM lstm = nn.LSTM(input_size=128, hidden_size=128) x = torch.rand(5, 1, 128) h = torch.zeros(1, 1, 128) # 隐藏状态 c = torch.zeros(1, 1, 128) # 细胞状态（关键！） outputs = [] for t in range(5): # ❗仍然是串行 out, (h, c) = lstm(x[t].unsqueeze(0), (h, c)) outputs.append(out)

# 此处是trasnformer案例 import torch import torch.nn as nn # 多头注意力 attn = nn.MultiheadAttention(embed_dim=128, num_heads=8) # 一次性输入整个序列 x = torch.rand(5, 1, 128) # Q = K = V = x（自注意力） output, weights = attn(x, x, x)

RNN和LSTM有两个致命问题：

❌无法并行计算（太慢）
❌长距离依赖难以捕捉

二、Transformer整体架构

二、整体架构（核心结构）

Input → Encoder → Decoder → Output

┌───────────────┐ Input → │ Encoder × N │ └───────────────┘ ↓ ┌───────────────┐ Target → │ Decoder × N │ → 输出 └───────────────┘

三、Encoder 结构（重点）

每一层Encoder包含两个核心模块：

在 Transformer 中，Encoder 是负责“理解输入”的部分。

每一层 Encoder 都包含两个核心模块：

👉Self-Attention（自注意力） + Feed Forward（前馈网络）

输入 ↓ Multi-Head Self-Attention ↓ Add & LayerNorm ↓ Feed Forward Network ↓ Add & LayerNorm ↓ 输出

1️⃣ Self-Attention（自注意力）

1. 本质作用:👉让每个词都能“看”到句子中所有词，并决定关注谁

2. 举个直觉例子:The animal didn't cross the street because it was too tired

问题：“it”指谁？

animal ❓
street ❓

👉 Self-Attention 会自动学到：

“it”更关注“animal”

3. 数学计算过程（核心）

Step 1：生成 Q / K / V

输入向量 x 通过线性变换：

Q = xWq
K = xWk
V = xWv

Step 2：计算注意力权重

Step 3：加权求和

4. Multi-Head Attention（多头）

不是只做一次 Attention，而是：

Head1: 学语法关系 Head2: 学语义关系 Head3: 学位置关系 ...

最后拼接：

Concat(head1, head2, ...) → Linear

👉 本质：

从多个“角度”理解句子

5. PyTorch 简化实现

import torch import torch.nn.functional as F def self_attention(x): d_model = x.size(-1) Wq = torch.rand(d_model, d_model) Wk = torch.rand(d_model, d_model) Wv = torch.rand(d_model, d_model) Q = x @ Wq K = x @ Wk V = x @ Wv scores = Q @ K.transpose(-2, -1) / (d_model ** 0.5) weights = F.softmax(scores, dim=-1) return weights @ V x = torch.rand(1, 5, 64) out = self_attention(x) print(out.shape)