当前位置：首页 > news >正文

PyTorch 自动微分原理：反向传播与计算图构建

news 2026/5/10 6:41:55

PyTorch 自动微分原理：反向传播与计算图构建

1. 技术分析

1.1 自动微分定义

自动微分（Automatic Differentiation）是计算函数导数的技术，PyTorch 通过计算图实现：

import torch x = torch.tensor(2.0, requires_grad=True) y = x ** 2 y.backward() print(x.grad) # tensor(4.)

1.2 计算图结构

计算图 (Computational Graph) ├── 叶子节点 (Leaf Nodes) - 输入张量 ├── 中间节点 (Intermediate Nodes) - 操作结果 └── 根节点 (Root Node) - 输出张量

1.3 反向传播流程

前向传播 x ──(pow)── y=x² ──(mul)── z=2y 反向传播 dz/dx = dz/dy * dy/dx = 2 * 2x = 4x

2. 核心功能实现

2.1 手动构建计算图

class MyTensor: def __init__(self, value, grad_fn=None): self.value = value self.grad_fn = grad_fn self.grad = 0.0 def backward(self, grad=1.0): self.grad += grad if self.grad_fn: self.grad_fn.backward(grad) class AddNode: def __init__(self, a, b): self.a = a self.b = b def backward(self, grad): self.a.backward(grad) self.b.backward(grad) class MulNode: def __init__(self, a, b): self.a = a self.b = b def backward(self, grad): self.a.backward(grad * self.b.value) self.b.backward(grad * self.a.value) def add(a, b): result = MyTensor(a.value + b.value, AddNode(a, b)) return result def mul(a, b): result = MyTensor(a.value * b.value, MulNode(a, b)) return result

2.2 PyTorch 自动微分实践

import torch class LinearModel(torch.nn.Module): def __init__(self, input_dim, output_dim): super().__init__() self.weight = torch.nn.Parameter(torch.randn(input_dim, output_dim)) self.bias = torch.nn.Parameter(torch.randn(output_dim)) def forward(self, x): return x @ self.weight + self.bias class GradientAccumulator: def __init__(self, model): self.model = model self.accumulated_grads = {} for name, param in model.named_parameters(): self.accumulated_grads[name] = torch.zeros_like(param) def accumulate(self): for name, param in self.model.named_parameters(): if param.grad is not None: self.accumulated_grads[name] += param.grad def apply(self, optimizer): for name, param in self.model.named_parameters(): param.grad = self.accumulated_grads[name] optimizer.step() self.reset() def reset(self): for name in self.accumulated_grads: self.accumulated_grads[name].zero_() def compute_gradients(model, inputs, targets, loss_fn): outputs = model(inputs) loss = loss_fn(outputs, targets) loss.backward() gradients = {} for name, param in model.named_parameters(): if param.grad is not None: gradients[name] = param.grad.detach().clone() return gradients, loss.item()

2.3 自定义反向传播

class CustomReLU(torch.autograd.Function): @staticmethod def forward(ctx, input): ctx.save_for_backward(input) return input.clamp(min=0) @staticmethod def backward(ctx, grad_output): input, = ctx.saved_tensors grad_input = grad_output.clone() grad_input[input < 0] = 0 return grad_input class CustomLinear(torch.autograd.Function): @staticmethod def forward(ctx, input, weight, bias): ctx.save_for_backward(input, weight) output = input @ weight + bias return output @staticmethod def backward(ctx, grad_output): input, weight = ctx.saved_tensors grad_input = grad_output @ weight.T grad_weight = input.T @ grad_output grad_bias = grad_output.sum(0) return grad_input, grad_weight, grad_bias class CustomModel(torch.nn.Module): def __init__(self): super().__init__() self.weight = torch.nn.Parameter(torch.randn(10, 20)) self.bias = torch.nn.Parameter(torch.randn(20)) def forward(self, x): x = CustomReLU.apply(x) x = CustomLinear.apply(x, self.weight, self.bias) return x

2.4 计算图优化

class GraphOptimizer: @staticmethod def fuse_operations(model): fused_modules = [] for name, module in model.named_modules(): if isinstance(module, torch.nn.Sequential): fused = torch.nn.utils.fuse_conv_bn_weights(module) fused_modules.append(fused) return fused_modules @staticmethod def eliminate_common_subexpressions(graph): subexpressions = {} optimized_graph = [] for node in graph: key = str(node) if key not in subexpressions: subexpressions[key] = node optimized_graph.append(node) return optimized_graph def optimize_model(model): model.eval() for module in model.modules(): if isinstance(module, torch.nn.Conv2d): torch.nn.utils.weight_norm(module) return model

3. 性能对比

3.1 自动微分开销

操作	前向传播	反向传播	总时间
简单操作	0.1ms	0.3ms	0.4ms
复杂模型	10ms	30ms	40ms
大型模型	100ms	300ms	400ms

3.2 自定义 vs 内置操作

操作类型	前向速度	反向速度	内存占用
内置操作	快	快	低
自定义操作	中	慢	高
混合操作	中	中	中

3.3 梯度累积对比

累积步数	内存占用	训练速度	梯度质量
1	高	快	好
4	低	中	好
8	很低	慢	较好
16	极低	很慢	一般

4. 最佳实践

4.1 梯度检查

def check_gradients(model, inputs, targets, loss_fn, epsilon=1e-6): model.zero_grad() outputs = model(inputs) loss = loss_fn(outputs, targets) loss.backward() for name, param in model.named_parameters(): if param.grad is None: continue analytical_grad = param.grad.detach().clone() numerical_grad = torch.zeros_like(param) for i in range(param.numel()): param_flat = param.view(-1) param_flat[i] += epsilon outputs_plus = model(inputs) loss_plus = loss_fn(outputs_plus, targets) param_flat[i] -= 2 * epsilon outputs_minus = model(inputs) loss_minus = loss_fn(outputs_minus, targets) param_flat[i] += epsilon numerical_grad.view(-1)[i] = (loss_plus - loss_minus) / (2 * epsilon) max_error = torch.abs(analytical_grad - numerical_grad).max() print(f"{name}: max error = {max_error}")

4.2 梯度裁剪

def clip_gradients(model, max_norm=1.0): torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm) def adaptive_grad_clip(model, clip_value=1.0): for param in model.parameters(): if param.grad is not None: grad_norm = param.grad.norm() if grad_norm > clip_value: param.grad.data.mul_(clip_value / grad_norm)