LLM 基础架构:Transformer 与注意力机制
1. 技术分析
1.1 LLM 架构概述
LLM (Large Language Model) 基于 Transformer 架构:
LLM 架构 输入层 → Embedding → Transformer Blocks → 输出层 Transformer Block: Multi-Head Attention Feed Forward Network Layer Normalization Residual Connection
1.2 Transformer 核心组件
| 组件 | 作用 | 复杂度 |
|---|
| Multi-Head Attention | 捕捉不同位置关系 | O(n²d) |
| Feed Forward | 非线性变换 | O(nd²) |
| LayerNorm | 稳定训练 | O(nd) |
| Residual | 梯度传播 | O(nd) |
1.3 LLM 模型对比
| 模型 | 参数 | 架构 | 特点 |
|---|
| GPT-3 | 175B | Decoder-only | 通用能力强 |
| PaLM | 540B | Decoder-only | 推理能力强 |
| Llama | 65B | Decoder-only | 开源 |
| T5 | 11B | Encoder-Decoder | 多任务 |
2. 核心功能实现
2.1 Multi-Head Attention
import torch import torch.nn as nn import torch.nn.functional as F class MultiHeadAttention(nn.Module): def __init__(self, d_model, num_heads): super().__init__() self.d_model = d_model self.num_heads = num_heads self.d_k = d_model // num_heads self.W_q = nn.Linear(d_model, d_model) self.W_k = nn.Linear(d_model, d_model) self.W_v = nn.Linear(d_model, d_model) self.W_o = nn.Linear(d_model, d_model) def scaled_dot_product_attention(self, Q, K, V, mask=None): scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(self.d_k, dtype=torch.float32)) if mask is not None: scores = scores.masked_fill(mask == 0, -1e9) attn_weights = F.softmax(scores, dim=-1) output = torch.matmul(attn_weights, V) return output, attn_weights def split_heads(self, x, batch_size): return x.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2) def forward(self, Q, K, V, mask=None): batch_size = Q.size(0) Q = self.split_heads(self.W_q(Q), batch_size) K = self.split_heads(self.W_k(K), batch_size) V = self.split_heads(self.W_v(V), batch_size) output, attn_weights = self.scaled_dot_product_attention(Q, K, V, mask) output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model) output = self.W_o(output) return output, attn_weights
2.2 Transformer Block
class TransformerBlock(nn.Module): def __init__(self, d_model, num_heads, d_ff, dropout=0.1): super().__init__() self.self_attn = MultiHeadAttention(d_model, num_heads) self.feed_forward = nn.Sequential( nn.Linear(d_model, d_ff), nn.ReLU(), nn.Linear(d_ff, d_model) ) self.norm1 = nn.LayerNorm(d_model) self.norm2 = nn.LayerNorm(d_model) self.dropout = nn.Dropout(dropout) def forward(self, x, mask=None): attn_output, _ = self.self_attn(x, x, x, mask) x = self.norm1(x + self.dropout(attn_output)) ff_output = self.feed_forward(x) x = self.norm2(x + self.dropout(ff_output)) return x class GPTModel(nn.Module): def __init__(self, vocab_size, d_model=768, num_heads=12, d_ff=3072, num_layers=12): super().__init__() self.embedding = nn.Embedding(vocab_size, d_model) self.positional_encoding = self._create_positional_encoding(d_model) self.layers = nn.ModuleList([ TransformerBlock(d_model, num_heads, d_ff) for _ in range(num_layers) ]) self.fc = nn.Linear(d_model, vocab_size) def _create_positional_encoding(self, d_model, max_len=5000): position = torch.arange(max_len).unsqueeze(1) div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model)) pe = torch.zeros(max_len, 1, d_model) pe[:, 0, 0::2] = torch.sin(position * div_term) pe[:, 0, 1::2] = torch.cos(position * div_term) return pe def forward(self, x): x = self.embedding(x) + self.positional_encoding[:x.size(1)] mask = torch.tril(torch.ones(x.size(1), x.size(1))).bool() for layer in self.layers: x = layer(x, mask) x = self.fc(x) return x
2.3 LLM 推理
class LLMInference: def __init__(self, model, tokenizer): self.model = model self.tokenizer = tokenizer self.model.eval() def generate(self, prompt, max_len=100, temperature=1.0, top_k=50): input_ids = self.tokenizer.encode(prompt, return_tensors='pt') with torch.no_grad(): for _ in range(max_len): outputs = self.model(input_ids) logits = outputs[:, -1, :] / temperature if top_k > 0: v, _ = torch.topk(logits, top_k) logits[logits < v[:, -1]] = float('-inf') probs = F.softmax(logits, dim=-1) next_token = torch.multinomial(probs, num_samples=1) input_ids = torch.cat([input_ids, next_token], dim=1) if next_token.item() == self.tokenizer.eos_token_id: break return self.tokenizer.decode(input_ids[0], skip_special_tokens=True) class GPTDecoder: def __init__(self, model, tokenizer): self.model = model self.tokenizer = tokenizer def beam_search(self, prompt, max_len=100, beam_size=5): input_ids = self.tokenizer.encode(prompt, return_tensors='pt') beams = [(input_ids, 0.0)] for _ in range(max_len): new_beams = [] for beam, score in beams: outputs = self.model(beam) logits = outputs[:, -1, :] probs = F.log_softmax(logits, dim=-1) top_probs, top_indices = torch.topk(probs, beam_size) for i in range(beam_size): new_beam = torch.cat([beam, top_indices[:, i].unsqueeze(1)], dim=1) new_score = score + top_probs[:, i].item() new_beams.append((new_beam, new_score)) new_beams.sort(key=lambda x: x[1], reverse=True) beams = new_beams[:beam_size] if beams[0][0][0, -1].item() == self.tokenizer.eos_token_id: break best_beam = beams[0][0] return self.tokenizer.decode(best_beam[0], skip_special_tokens=True)
3. 性能对比
3.1 LLM 模型对比
| 模型 | 参数(B) | 训练数据(TB) | 推理速度(tokens/s) |
|---|
| GPT-3 | 175 | 45 | 200 |
| PaLM | 540 | 780 | 100 |
| Llama-2 | 70 | 2 | 500 |
| Mistral | 7 | 0.8 | 1000 |
3.2 注意力机制对比
| 类型 | 复杂度 | 效果 | 适用场景 |
|---|
| Full Attention | O(n²) | 最好 | 短序列 |
| Sparse Attention | O(n log n) | 好 | 长序列 |
| Linear Attention | O(n) | 较好 | 超长序列 |
3.3 生成策略对比
| 策略 | 质量 | 多样性 | 速度 |
|---|
| Greedy | 中 | 低 | 快 |
| Beam Search | 高 | 低 | 慢 |
| Top-K | 高 | 中 | 中 |
| Top-P | 高 | 高 | 中 |
4. 最佳实践
4.1 LLM 选择
def select_llm(task_type, constraints): if constraints.get('open_source', False): return 'Llama-2' elif constraints.get('speed', False): return 'Mistral' else: return 'GPT-4' class LLMFactory: @staticmethod def create(config): if config['type'] == 'gpt': from transformers import GPT2LMHeadModel, GPT2Tokenizer return GPT2LMHeadModel.from_pretrained(config['model_name']) elif config['type'] == 'llama': from transformers import LlamaForCausalLM, LlamaTokenizer return LlamaForCausalLM.from_pretrained(config['model_name'])
4.2 LLM 部署
class LLMDeployer: def __init__(self, model, tokenizer, config): self.model = model self.tokenizer = tokenizer self.config = config def optimize(self): if self.config.get('quantize', False): self.model = self._quantize_model() if self.config.get('compile', False): self.model = torch.compile(self.model) def _quantize_model(self): from torch.ao.quantization import quantize_dynamic return quantize_dynamic(self.model, {torch.nn.Linear}) def serve(self): from fastapi import FastAPI app = FastAPI() @app.post('/generate') def generate(prompt: str): return {'response': self.model.generate(prompt)} return app
5. 总结
LLM 是当前 NLP 领域的核心技术:
- Transformer:LLM 的基础架构
- 注意力机制:捕捉文本中的关系
- 生成策略:影响输出质量和多样性
- 模型选择:根据需求选择合适的模型
对比数据如下:
- Llama-2 是最佳的开源模型
- GPT-4 在综合能力上领先
- 量化可显著提升推理速度
- 推荐根据任务需求选择模型