当前位置：首页 > news >正文

从DDPG到MADDPG：给单智能体算法加上‘队友视野’需要改哪几行代码？

news 2026/6/17 18:07:12

从DDPG到MADDPG：核心代码改造实战指南

当我们需要将单智能体强化学习算法扩展到多智能体场景时，MADDPG（Multi-Agent DDPG）提供了一种优雅的解决方案。本文将以代码对比的方式，逐步展示如何将一个标准的DDPG实现改造成支持多智能体协作的MADDPG版本。我们将聚焦于三个关键改造点：Critic网络的输入扩展、经验回放缓冲区的调整以及训练流程的协同优化。

1. 网络架构的改造：让Critic拥有全局视野

DDPG的Critic网络只需要评估单个智能体的状态-动作对，而MADDPG的核心创新在于让Critic在训练时能够访问所有智能体的信息。这种"集中式训练"的设计需要我们对网络结构进行针对性调整。

1.1 Critic网络的输入维度扩展

在DDPG中，Critic的输入通常是(state, action)对。我们需要将其扩展为接收所有智能体的联合状态和动作：

# DDPG Critic网络输入 class Critic(nn.Module): def __init__(self, state_dim, action_dim): super().__init__() self.fc1 = nn.Linear(state_dim + action_dim, 256) # MADDPG Critic网络输入 class MADDPGCritic(nn.Module): def __init__(self, state_dim, action_dim, n_agents): super().__init__() # 输入维度变为全局状态+所有智能体动作的拼接 self.fc1 = nn.Linear(state_dim + n_agents * action_dim, 256)

关键改动点：

输入维度从state_dim + action_dim变为state_dim + n_agents * action_dim
前向传播时需要拼接所有智能体的动作信息

1.2 Actor网络保持独立性

与Critic不同，Actor网络在执行阶段仍然只依赖局部观察，因此其结构无需改变：

# Actor网络（DDPG和MADDPG保持一致） class Actor(nn.Module): def __init__(self, obs_dim, action_dim): super().__init__() self.fc1 = nn.Linear(obs_dim, 256) self.fc2 = nn.Linear(256, action_dim) def forward(self, obs): x = F.relu(self.fc1(obs)) return torch.tanh(self.fc2(x)) # 假设动作空间在[-1,1]范围内

2. 经验回放缓冲区的改造

多智能体环境中的经验存储需要考虑各智能体观察的同步性，我们需要设计能够保存全局状态和个体观察的缓冲区。

2.1 多智能体经验存储结构

class MultiAgentReplayBuffer: def __init__(self, capacity, obs_dims, state_dim, action_dims): self.capacity = capacity self.n_agents = len(obs_dims) # 为每个智能体创建独立的观察存储 self.obs_buffers = [ np.zeros((capacity, dim)) for dim in obs_dims ] self.next_obs_buffers = [ np.zeros((capacity, dim)) for dim in obs_dims ] # 全局状态存储 self.state_buffer = np.zeros((capacity, state_dim)) self.next_state_buffer = np.zeros((capacity, state_dim)) # 动作和奖励存储 self.action_buffers = [ np.zeros((capacity, dim)) for dim in action_dims ] self.reward_buffers = [ np.zeros((capacity, 1)) for _ in range(self.n_agents) ] self.done_buffer = np.zeros((capacity, 1), dtype=np.float32) self.pos = 0 self.size = 0

2.2 经验存储接口的变化

def add(self, obs_list, actions, rewards, next_obs_list, state, next_state, done): # 存储每个智能体的独立观察 for i in range(self.n_agents): self.obs_buffers[i][self.pos] = obs_list[i] self.next_obs_buffers[i][self.pos] = next_obs_list[i] self.action_buffers[i][self.pos] = actions[i] self.reward_buffers[i][self.pos] = rewards[i] # 存储全局状态 self.state_buffer[self.pos] = state self.next_state_buffer[self.pos] = next_state self.done_buffer[self.pos] = done self.pos = (self.pos + 1) % self.capacity self.size = min(self.size + 1, self.capacity)

3. 训练流程的协同优化

MADDPG的训练需要协调多个智能体的参数更新，这要求我们对训练循环进行重构。

3.1 集中式Critic更新

def update_critics(self, agents, batch_size): # 采样批量经验 idx = np.random.randint(0, self.size, size=batch_size) # 准备所有智能体的数据 states = torch.FloatTensor(self.state_buffer[idx]).to(device) next_states = torch.FloatTensor(self.next_state_buffer[idx]).to(device) # 收集所有智能体的当前和下一个动作 all_actions = [] all_next_actions = [] for i, agent in enumerate(agents): obs = torch.FloatTensor(self.obs_buffers[i][idx]).to(device) next_obs = torch.FloatTensor(self.next_obs_buffers[i][idx]).to(device) # 当前策略动作 current_actions = agent.actor(obs) # 目标策略动作 next_actions = agent.target_actor(next_obs) all_actions.append(current_actions) all_next_actions.append(next_actions) # 拼接所有动作 joint_actions = torch.cat(all_actions, dim=1) joint_next_actions = torch.cat(all_next_actions, dim=1) # 为每个智能体更新Critic for i, agent in enumerate(agents): rewards = torch.FloatTensor(self.reward_buffers[i][idx]).to(device) dones = torch.FloatTensor(self.done_buffer[idx]).to(device) # 计算目标Q值 with torch.no_grad(): target_q = agent.target_critic(next_states, joint_next_actions) y = rewards + (1 - dones) * self.gamma * target_q # 计算当前Q值 current_q = agent.critic(states, joint_actions) # 更新Critic critic_loss = F.mse_loss(current_q, y) agent.critic_optimizer.zero_grad() critic_loss.backward() agent.critic_optimizer.step()

3.2 分布式Actor更新

def update_actors(self, agents, batch_size): idx = np.random.randint(0, self.size, size=batch_size) states = torch.FloatTensor(self.state_buffer[idx]).to(device) # 为每个智能体更新Actor for i, agent in enumerate(agents): obs = torch.FloatTensor(self.obs_buffers[i][idx]).to(device) # 获取当前智能体的动作 current_actions = agent.actor(obs) # 获取其他智能体的动作（固定参数） other_actions = [] for j, other_agent in enumerate(agents): if j != i: other_obs = torch.FloatTensor(self.obs_buffers[j][idx]).to(device) other_action = other_agent.actor(other_obs).detach() other_actions.append(other_action) # 拼接所有动作（当前智能体+其他智能体） if other_actions: all_actions = torch.cat([current_actions] + other_actions, dim=1) else: all_actions = current_actions # 计算策略梯度 actor_loss = -agent.critic(states, all_actions).mean() agent.actor_optimizer.zero_grad() actor_loss.backward() agent.actor_optimizer.step()

4. 实战中的关键调整与优化

在实际应用中，我们发现以下几个调整对MADDPG的性能有显著影响：

4.1 探索噪声的协调

在多智能体环境中，探索噪声的设置需要更加谨慎：

def get_action(self, obs, noise_scale=0.1): obs = torch.FloatTensor(obs).unsqueeze(0).to(device) action = self.actor(obs).squeeze(0).cpu().detach().numpy() # 使用衰减的噪声 noise = noise_scale * np.random.randn(*action.shape) return np.clip(action + noise, -1, 1) # 假设动作空间在[-1,1]范围内

4.2 训练稳定性的提升技巧

技巧	DDPG实现	MADDPG调整
目标网络更新	单独更新	同步更新所有智能体目标网络
经验回放	统一采样	确保同一批次包含同步的经验
学习率调度	固定学习率	可能需要更保守的学习率衰减

4.3 多智能体特有的超参数调整

# 典型MADDPG超参数配置 config = { 'actor_lr': 1e-4, # 通常比DDPG更小的学习率 'critic_lr': 1e-3, 'tau': 0.01, # 目标网络软更新参数 'gamma': 0.95, # 折扣因子 'batch_size': 1024, # 更大的批次以稳定训练 'buffer_size': int(1e6), # 更大的回放缓冲区 'noise_start': 0.3, # 初始探索噪声 'noise_decay': 0.9995 # 噪声衰减率 }

在将DDPG扩展到多智能体场景时，最大的挑战不是算法原理的理解，而是工程实现上的细节处理。特别是在处理多个智能体的经验同步、网络参数更新顺序等实际问题时，需要格外注意数据的一致性和训练的稳定性。

查看全文

http://www.jsqmd.com/news/708202/