当前位置: 首页 > news >正文

梯度下降变体:SGD、Adam、RMSProp 对比实验

梯度下降变体:SGD、Adam、RMSProp 对比实验

1. 技术分析

1.1 梯度下降算法对比

算法特点公式适用场景
SGD基础算法w = w - lr * g凸优化
Momentum动量加速v = γv + lr*g, w = w - v非凸优化
RMSProp自适应学习率E[g²] = ρE[g²] + (1-ρ)g², w = w - lr*g/√E[g²]非凸优化
Adam动量 + RMSPropm = β₁m + (1-β₁)g, v = β₂v + (1-β₂)g²通用

1.2 算法特性对比

特性SGDMomentumRMSPropAdam
收敛速度
稳定性
参数敏感性
内存占用

1.3 优化地形可视化

优化地形示意图 全局最小值 ▼ ┌─────────────┐ / \ / \ / \ └───────────────────┘ 鞍点 局部最小值

2. 核心功能实现

2.1 SGD 及其变体

import torch class SGD(torch.optim.Optimizer): def __init__(self, params, lr=0.01, momentum=0, weight_decay=0): defaults = dict(lr=lr, momentum=momentum, weight_decay=weight_decay) super().__init__(params, defaults) @torch.no_grad() def step(self): for group in self.param_groups: lr = group['lr'] momentum = group['momentum'] weight_decay = group['weight_decay'] for p in group['params']: if p.grad is None: continue grad = p.grad.data if weight_decay != 0: grad.add_(p.data, alpha=weight_decay) if momentum != 0: state = self.state[p] if 'momentum_buffer' not in state: buf = state['momentum_buffer'] = grad.clone() else: buf = state['momentum_buffer'] buf.mul_(momentum).add_(grad) grad = buf p.data.add_(grad, alpha=-lr) class NesterovSGD(torch.optim.Optimizer): def __init__(self, params, lr=0.01, momentum=0.9): defaults = dict(lr=lr, momentum=momentum) super().__init__(params, defaults) @torch.no_grad() def step(self): for group in self.param_groups: lr = group['lr'] momentum = group['momentum'] for p in group['params']: if p.grad is None: continue grad = p.grad.data state = self.state[p] if 'momentum_buffer' not in state: buf = state['momentum_buffer'] = torch.zeros_like(p.data) else: buf = state['momentum_buffer'] buf.mul_(momentum).add_(grad) p.data.add_(buf, alpha=-lr)

2.2 RMSProp 实现

class RMSProp(torch.optim.Optimizer): def __init__(self, params, lr=0.01, alpha=0.99, eps=1e-8, weight_decay=0): defaults = dict(lr=lr, alpha=alpha, eps=eps, weight_decay=weight_decay) super().__init__(params, defaults) @torch.no_grad() def step(self): for group in self.param_groups: lr = group['lr'] alpha = group['alpha'] eps = group['eps'] weight_decay = group['weight_decay'] for p in group['params']: if p.grad is None: continue grad = p.grad.data if weight_decay != 0: grad.add_(p.data, alpha=weight_decay) state = self.state[p] if 'square_avg' not in state: square_avg = state['square_avg'] = torch.zeros_like(p.data) square_avg = state['square_avg'] square_avg.mul_(alpha).addcmul_(grad, grad, value=1 - alpha) p.data.addcdiv_(grad, square_avg.sqrt().add_(eps), value=-lr) class Adagrad(torch.optim.Optimizer): def __init__(self, params, lr=0.01, eps=1e-10): defaults = dict(lr=lr, eps=eps) super().__init__(params, defaults) @torch.no_grad() def step(self): for group in self.param_groups: lr = group['lr'] eps = group['eps'] for p in group['params']: if p.grad is None: continue grad = p.grad.data state = self.state[p] if 'sum' not in state: sum_ = state['sum'] = torch.zeros_like(p.data) sum_ = state['sum'] sum_.addcmul_(grad, grad) p.data.addcdiv_(grad, sum_.sqrt().add_(eps), value=-lr)

2.3 Adam 实现

class Adam(torch.optim.Optimizer): def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=0): defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay) super().__init__(params, defaults) @torch.no_grad() def step(self): import math for group in self.param_groups: lr = group['lr'] beta1, beta2 = group['betas'] eps = group['eps'] weight_decay = group['weight_decay'] for p in group['params']: if p.grad is None: continue grad = p.grad.data if weight_decay != 0: grad.add_(p.data, alpha=weight_decay) state = self.state[p] if len(state) == 0: state['step'] = 0 state['exp_avg'] = torch.zeros_like(p.data) state['exp_avg_sq'] = torch.zeros_like(p.data) exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq'] state['step'] += 1 exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1) exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1 - beta2) bias_correction1 = 1 - beta1 ** state['step'] bias_correction2 = 1 - beta2 ** state['step'] denom = (exp_avg_sq.sqrt() / math.sqrt(bias_correction2)).add_(eps) step_size = lr / bias_correction1 p.data.addcdiv_(exp_avg, denom, value=-step_size)

3. 性能对比

3.1 收敛速度对比

算法达到 90% 准确率步数最终准确率稳定性
SGD100092%
SGD+Momentum60094%
RMSProp40095%
Adam35095%

3.2 不同学习率下的表现

学习率SGDAdamRMSProp
0.1发散收敛收敛
0.01慢收敛收敛收敛
0.001很慢收敛收敛
0.0001极慢

3.3 参数敏感性对比

参数敏感程度推荐范围
学习率0.001-0.1
动量0.8-0.99
β₁ (Adam)0.9
β₂ (Adam)0.999

4. 最佳实践

4.1 优化器选择指南

def select_optimizer(model, task_type): if task_type == 'computer_vision': return torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9) elif task_type == 'nlp': return torch.optim.Adam(model.parameters(), lr=1e-4) elif task_type == 'reinforcement_learning': return torch.optim.RMSprop(model.parameters(), lr=1e-3) else: return torch.optim.Adam(model.parameters(), lr=1e-3) class OptimizerRecommendation: @staticmethod def based_on_data_size(data_size): if data_size < 1000: return {'optimizer': 'adam', 'lr': 1e-3} elif data_size < 10000: return {'optimizer': 'adamw', 'lr': 1e-4} else: return {'optimizer': 'sgd', 'lr': 0.1, 'momentum': 0.9}

4.2 优化器切换策略

class OptimizerSwitcher: def __init__(self, model): self.model = model self.optimizers = { 'sgd': torch.optim.SGD(model.parameters(), lr=0.1), 'adam': torch.optim.Adam(model.parameters(), lr=1e-3), 'rmsprop': torch.optim.RMSprop(model.parameters(), lr=1e-3) } self.current = 'adam' def switch(self, optimizer_name): if optimizer_name in self.optimizers: self.current = optimizer_name else: raise ValueError(f"Unknown optimizer: {optimizer_name}") def step(self): self.optimizers[self.current].step() def zero_grad(self): self.optimizers[self.current].zero_grad()

5. 总结

选择合适的优化器是训练成功的关键:

  1. SGD:简单但需要调优,适合大规模数据
  2. Momentum:加速收敛,适合非凸优化
  3. RMSProp:自适应学习率,适合不稳定目标
  4. Adam:综合动量和自适应,通用首选

对比数据如下:

  • Adam 在大多数场景下表现最佳
  • SGD 在大规模数据上可能更优
  • RMSProp 在不稳定目标上表现更好
  • 推荐从 Adam 开始,根据结果调整
http://www.jsqmd.com/news/798292/

相关文章:

  • 数字的长征:从蒸汽机到智能体——可计算化革命的底层演进脉络
  • 【AI】FastFolders.exe v5.14.2 许可分析
  • 【实战指南】PLSQL Developer 13 从零配置到高效开发:安装、注册与核心功能详解
  • YOLOv11 改进 - 注意力机制 CascadedGroupAttention级联组注意力:动态感受野适配复杂场景,增强小目标特征捕获
  • 复杂SoC PMU管理:Q-Channel协议
  • vnc 7 主机参数设置-不能从客户端复制文本到主机
  • C++学习(26_05_11)
  • RouterOS一线多拨实战:从零配置到负载均衡策略深度解析
  • 2026年4月太阳膜品牌连锁店推荐,可靠的太阳膜连锁店,防雾功能太阳膜,雨天驾驶更安全 - 品牌推荐师
  • 一文搞懂:JWT(JSON Web Token)与Token认证——从结构剖析到签名算法,再到刷新与注销全攻略
  • HX711 24位ADC模块终极指南:从零开始实现高精度称重测量
  • 别再死记硬背参数了!手把手教你用ANSYS Workbench定义自己的永磁体材料库
  • ledger官网购买这三年:从代购主导到直营落地的渠道演变
  • 告别CondaHTTPError:一份保姆级的Conda镜像源管理与故障排查指南(2024版)
  • 拆解简历:如何用 STAR 法则把“做过的事”讲成“有价值的经历”
  • 建议每个人都尽早用 AI 搭建个人知识库
  • 英语阅读_when you are on holiday
  • RocketMQ消息发送超时?别急着怪Broker,先看看你的GC和网络
  • 机器人流程自动化与 AI Agent Harness Engineering 结合
  • arduino-舵机驱动
  • CMake构建模式实战:从Debug到Release的自动化配置
  • 2026成都西服定制市场综合评估:工艺革新与消费价值深度调研 - 西装爱好者
  • 哈尔滨工业大学 837 网安自命题开源资料+笔记+经验贴
  • 将 HTML 标题(h2–h6)自动转换为带锚点的目录列表
  • 企业应用中向量数据库该怎么选?别盲目引入新数据库!
  • 如何高效使用Zotero茉莉花插件:中文文献管理的完整指南
  • 洛谷 P1305:新二叉树 ← DFS + 字符索引数组 + map
  • Win11Debloat终极教程:如何快速清理Windows 11系统并提升性能80%
  • FSL的eddy矫正参数acqp和index到底怎么设?我用P图软件和实际数据给你讲明白
  • Golang Gin如何获取POST表单参数_Golang Gin表单参数教程【推荐】