当前位置：首页 > news >正文

解析nanogpt - 详解

news 2026/3/27 4:07:53

nanoGPT组成结构

- 总结
- 注意力机制
- nanogpt推理文件
- 模型文件
- - generate函数
  - GPT forward
  - - 流程
    - 维度
  - block forward
  - - 流程
    - 维度
  - LayerNorm 层
  - CausalSelfAttention 层
  - - 流程
    - 维度
  - MLP
  - - 流程
    - 维度

总结

该博客用于记录自己在学习GPT-2模型时候的一些记录仅此而已，且大部分的内容来源于ai生成！说明本博客使用的代码是来自 https://github.com/karpathy/nanoGPT.git

注意力机制

$QK^T / \sqrt d_k ) V$
其实注意力机制就3个值嘛 $Q$ $K$ $V$ ，只需要理解这三个值是怎么计算的就ok了。先粘贴一组实现了缩放点积的注意力机制代码。在这段代码中先看向量的维度，维度本身围绕最后两个维度进行展开计算的。Q的维度是ql,dim；V的维度是kl,dim；所以 $QV^T$ 的维度是ql,kl。然后进行softmax操作并不会改变维度只是对最后一维度进行了softmax操作。所以整体最终的维度就是ql,dim。即和输出一样。

class BaseAttention(nn.Module):
"""
Tensor          Type            Shape
===========================================================================
q               float           (..., query_len, dims)
k               float           (..., kv_len, dims)
v               float           (..., kv_len, dims)
mask            bool            (..., query_len, kv_len)
---------------------------------------------------------------------------
output          float           (..., query_len, dims)
===========================================================================
"""
def __init__(self, dropout: float = 0.1):
super().__init__()
self.dropout = nn.Dropout(dropout)
def forward(self,
q: torch.Tensor,
k: torch.Tensor,
v: torch.Tensor,
mask: Optional[torch.Tensor] = None) -> torch.Tensor:
x = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(k.size(-1))
if mask is not None:
x += mask.type_as(x) * x.new_tensor(-1e4)
x = self.dropout(x.softmax(-1))
return torch.matmul(x, v)

nanogpt推理文件

nanogpt推理文件是sample.py。推理的demo所做的事情是一直预测num_samples次创作，一次创作是一段话，这段话最多包含max_new_tokens个token。节选代码如下：

with torch.no_grad(): # 推理模式 不计算梯度
with ctx:
# 进行 num_samples 次创作
for k in range(num_samples):
y = model.generate(x, max_new_tokens, temperature=temperature, top_k=top_k)
print(decode(y[0].tolist()))
print('---------------')

模型文件

nanogpt模型文件是model.py。这部分是核心内容是模型的搭建，我这里暂时先只关心forward函数。

generate函数

这个函数的作用是最多进行max_new_tokens多次自回归预测，部分翻译注释放在代码中。

@torch.no_grad()
def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
"""
Take a conditioning sequence of indices idx (LongTensor of shape (b,t)) and complete
the sequence max_new_tokens times, feeding the predictions back into the model each time.
Most likely you'll want to make sure to be in model.eval() mode of operation for this.
接收一个条件序列索引 idx(形状为 (b,t) 的长整型张量),并完成序列生成,
重复 max_new_tokens 次,每次将预测结果反馈回模型。
通常你需要确保模型处于 model.eval() 模式下进行此操作。
"""
for _ in range(max_new_tokens):
# if the sequence context is growing too long we must crop it at block_size
# 如果序列上下文增长得太长,我们必须在 block_size 处裁剪它。
idx_cond = idx if idx.size(1) <= self.config.block_size else idx[:, -self.config.block_size:] # 选择所有行，从倒数第 block_size 列到最后一列
# forward the model to get the logits for the index in the sequence
# 进行前向模型推理
logits, _ = self(idx_cond)
# pluck the logits at the final step and scale by desired temperature
# 提取最后一步的logits(预测分数),并按照期望的温度参数进行缩放。
logits = logits[:, -1, :] / temperature
# optionally crop the logits to only the top k options
# 只考虑概率最高的k个token,忽略其他所有token:
if top_k is not None:
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
logits[logits < v[:, [-1]]] = -float('Inf')
# apply softmax to convert logits to (normalized) probabilities
# 应用softmax函数,将logits(未归一化的分数)转换为(归一化的)概率。
probs = F.softmax(logits, dim=-1)
# sample from the distribution
# 概率分布中进行采样(随机抽取)。
idx_next = torch.multinomial(probs, num_samples=1)
# append sampled index to the running sequence and continue
idx = torch.cat((idx, idx_next), dim=1)
return idx

GPT forward

流程

维度

# 模型前向推理
def forward(self, idx, targets=None):
device = idx.device # 获取设备信息
b, t = idx.size()   # batch size, sequence length 训练个数 句子的长度
assert t <= self.config.block_size, f"Cannot forward sequence of length {t}, block size is only {self.config.block_size}"
# 生成位置编码序列 0 - t-1
pos = torch.arange(0, t, dtype=torch.long, device=device) # shape (t)
# forward the GPT model itself
# 将token索引转换为token嵌入向量(Token Embeddings) 维度(b, t, n_embd)
tok_emb = self.transformer.wte(idx) # token embeddings of shape (b, t, n_embd)
# 将位置索引转换为位置嵌入向量(Position Embeddings)。 维度(b, t, n_embd)
pos_emb = self.transformer.wpe(pos) # position embeddings of shape (t, n_embd)
# x维度 (b, t, n_embd)
x = self.transformer.drop(tok_emb + pos_emb)
for block in self.transformer.h:
x = block(x)
x = self.transformer.ln_f(x)
if targets is not None:
# if we are given some desired targets also calculate the loss
# 如果我们被提供了一些期望的目标值,也要计算损失(loss)。 训练模式
logits = self.lm_head(x)
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
else:
# inference-time mini-optimization: only forward the lm_head on the very last position
# 推理时的微优化:只对最后一个位置的输出执行lm_head(语言模型头)。 不计算损失
logits = self.lm_head(x[:, [-1], :]) # note: using list [-1] to preserve the time dim
loss = None
return logits, loss

block forward

流程

维度

class Block(nn.Module):
def __init__(self, config):
super().__init__()
# 一个线性层
self.ln_1 = LayerNorm(config.n_embd, bias=config.bias)
# 注意力层
self.attn = CausalSelfAttention(config)
# 第二个线性层
self.ln_2 = LayerNorm(config.n_embd, bias=config.bias)
# 进行mlp操作
self.mlp = MLP(config)
# 前向传播
def forward(self, x):
x = x + self.attn(self.ln_1(x))
x = x + self.mlp(self.ln_2(x))
return x

LayerNorm 层

线性层其实可以看输入和输出的维度也是一样的

class LayerNorm(nn.Module):
""" LayerNorm but with an optional bias. PyTorch doesn't support simply bias=False """
# "带有可选偏置项的LayerNorm层。PyTorch不支持简单地设置bias=False"
def __init__(self, ndim, bias):
super().__init__()
self.weight = nn.Parameter(torch.ones(ndim))                    # 初始化单位权重
self.bias = nn.Parameter(torch.zeros(ndim)) if bias else None   # 初始化零偏置  
def forward(self, input):
return F.layer_norm(input, self.weight.shape, self.weight, self.bias, 1e-5)

CausalSelfAttention 层

流程

维度

# 因果自注意力机制
class CausalSelfAttention(nn.Module):
def __init__(self, config):
super().__init__()
assert config.n_embd % config.n_head == 0
# key, query, value projections for all heads, but in a batch
# 一次性生成qkv三个矩阵（效率优化）
self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
# output projection。c_proj：输出投影层，将多头注意力的结果投影回原始维度
self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)
# regularization
self.attn_dropout = nn.Dropout(config.dropout)
self.resid_dropout = nn.Dropout(config.dropout)
self.n_head = config.n_head
self.n_embd = config.n_embd
self.dropout = config.dropout
# flash attention make GPU go brrrrr but support is only in PyTorch >= 2.0
self.flash = hasattr(torch.nn.functional, 'scaled_dot_product_attention')
if not self.flash:
print("WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0")
# causal mask to ensure that attention is only applied to the left in the input sequence
# 因果掩码（Causal Mask）这是一个下三角矩阵，对角线及其以下为1，对角线以上为0。
self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
.view(1, 1, config.block_size, config.block_size))
def forward(self, x):
# 训练数据个数 句子长度 embedding的维度
B, T, C = x.size() # batch size, sequence length, embedding dimensionality (n_embd)
# calculate query, key, values for all heads in batch and move head forward to be the batch dim
q, k, v  = self.c_attn(x).split(self.n_embd, dim=2)# 从一起构建的矩阵中分隔开
# 重塑为多头形式 n_head 是头的个数 重塑前 (B, T, C) 重塑后 (B, T, n_head, head_dim) 进行转置后 (B, n_head, T, head_dim)
k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs) (batch, heads, seq_len, head_dim)
q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
# causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
if self.flash:
# efficient attention using Flash Attention CUDA kernels
y = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=None, dropout_p=self.dropout if self.training else 0, is_causal=True)
else:
# manual implementation of attention
att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
# 应用因果掩码（把看不到的数据mask掉）
att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
# 引用softmax函数计算注意力权重
att = F.softmax(att, dim=-1)
att = self.attn_dropout(att)
# 加权求和
y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
# 合并多头
y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side
# output projection
y = self.resid_dropout(self.c_proj(y))
return y

MLP

流程

维度

# MLP层
class MLP(nn.Module):
def __init__(self, config):
super().__init__()
# Linear 线性层，输入是 n_embd 维度，输出是 4 * n_embd 维度， 参数就有 4 * n_embd * n_embd + 4 * n_embd个
self.c_fc    = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias)
self.gelu    = nn.GELU()
self.c_proj  = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias)
self.dropout = nn.Dropout(config.dropout)
# 前向传播部分代码
def forward(self, x):
x = self.c_fc(x)
x = self.gelu(x)
x = self.c_proj(x)
x = self.dropout(x)
return x