当前位置：首页 > news >正文

保姆级教程：在AirSim中手把手教你用Q-learning和Sarsa算法训练无人机定点飞行（附完整Python代码）

news 2026/5/5 19:50:58

从零实现AirSim无人机强化学习控制：Q-learning与Sarsa算法实战解析

在无人机自主控制领域，强化学习正逐渐成为解决复杂决策问题的利器。当AirSim仿真平台遇上Q-learning和Sarsa这两种经典的强化学习算法，会碰撞出怎样的火花？本文将带你从环境搭建到算法实现，完整复现一个可落地的无人机定点飞行控制项目。

1. 环境准备与工程配置

1.1 AirSim环境部署

首先需要确保AirSim仿真环境正确运行。推荐使用Windows 10/11系统，并按照以下步骤配置：

从Epic Games商店安装Unreal Engine 4.27+
下载AirSim官方提供的Blocks环境或自定义场景
修改settings.json配置文件，确保无人机模型和物理参数符合实验需求

关键配置参数示例：

{ "SettingsVersion": 1.2, "SimMode": "Multirotor", "Vehicles": { "Drone1": { "VehicleType": "SimpleFlight", "X": 0, "Y": 0, "Z": 0, "PhysicsEngineName": "FastPhysics" } } }

1.2 Python依赖安装

创建独立的conda环境并安装必要依赖：

conda create -n airsim_rl python=3.8 conda activate airsim_rl pip install airsim numpy pandas pyyaml

工程目录结构建议如下：

├── configs/ │ └── drone_config.yaml ├── src/ │ ├── envs/ │ │ └── drone_env.py │ ├── algorithms/ │ │ ├── q_learning.py │ │ └── sarsa.py │ └── main.py └── data/ # 训练过程数据存储

2. 无人机控制环境封装

2.1 状态空间设计

在drone_env.py中，我们封装与AirSim的交互逻辑。状态空间设计直接影响算法效果，这里采用无人机相对目标位置的归一化坐标：

class DroneNavigationEnv: def __init__(self, config_file): self.client = airsim.MultirotorClient() self.client.confirmConnection() self.target_pos = np.array([10, 0, -3]) # 目标位置(x,y,z) # 动作空间定义 self.actions = { 0: 'move_x_1m', 1: 'move_x_-1m', 2: 'move_y_1m', 3: 'move_y_-1m' } def get_state(self): kinematics = self.client.simGetGroundTruthKinematics() current_pos = np.array([ kinematics.position.x_val, kinematics.position.y_val, kinematics.position.z_val ]) # 状态归一化处理 state = (current_pos - self.target_pos) / 10.0 return state.round(2)

2.2 奖励函数设计

奖励函数是指引无人机学习的关键，我们采用渐进式奖励设计：

def calculate_reward(self, state, new_state): old_dist = np.linalg.norm(state[:2]) # 忽略高度维度 new_dist = np.linalg.norm(new_state[:2]) # 基础距离奖励 reward = (old_dist - new_dist) * 10 # 成功到达奖励 if new_dist < 0.5: reward += 100 self.done = True # 碰撞惩罚 if self.check_collision(): reward -= 50 self.done = True return reward

3. 强化学习算法实现

3.1 Q-learning核心逻辑

在q_learning.py中实现经典的Q-learning算法：

class QLearningAgent: def __init__(self, action_space, learning_rate=0.1, discount=0.95, epsilon=0.1): self.q_table = defaultdict(lambda: np.zeros(len(action_space))) self.lr = learning_rate self.gamma = discount self.epsilon = epsilon self.action_space = action_space def choose_action(self, state): if random.uniform(0, 1) < self.epsilon: return random.choice(self.action_space) # 探索 else: return np.argmax(self.q_table[str(state)]) # 利用 def learn(self, state, action, reward, next_state): current_q = self.q_table[str(state)][action] max_next_q = np.max(self.q_table[str(next_state)]) new_q = current_q + self.lr * (reward + self.gamma * max_next_q - current_q) self.q_table[str(state)][action] = new_q

3.2 Sarsa算法实现

与Q-learning不同，Sarsa采用同策略学习方式：

class SarsaAgent: def __init__(self, action_space, learning_rate=0.1, discount=0.95, epsilon=0.1): self.q_table = defaultdict(lambda: np.zeros(len(action_space))) self.lr = learning_rate self.gamma = discount self.epsilon = epsilon self.action_space = action_space def learn(self, state, action, reward, next_state, next_action): current_q = self.q_table[str(state)][action] next_q = self.q_table[str(next_state)][next_action] new_q = current_q + self.lr * (reward + self.gamma * next_q - current_q) self.q_table[str(state)][action] = new_q

4. 训练流程与参数调优

4.1 主训练循环实现

在main.py中整合环境和算法：

def train_agent(agent_type='qlearning', episodes=500): env = DroneNavigationEnv('configs/drone_config.yaml') action_space = list(range(len(env.actions))) if agent_type == 'qlearning': agent = QLearningAgent(action_space) else: agent = SarsaAgent(action_space) for ep in range(episodes): state = env.reset() total_reward = 0 done = False while not done: action = agent.choose_action(state) next_state, reward, done = env.step(action) if agent_type == 'qlearning': agent.learn(state, action, reward, next_state) else: next_action = agent.choose_action(next_state) agent.learn(state, action, reward, next_state, next_action) state = next_state total_reward += reward print(f"Episode {ep}: Total Reward = {total_reward}")

4.2 关键参数调优指南

参数	推荐范围	影响说明	调整建议
学习率(lr)	0.01-0.2	控制Q值更新幅度	从0.1开始，观察收敛性
折扣因子(gamma)	0.9-0.99	未来奖励的重要性	长期任务取较高值
探索率(epsilon)	0.05-0.3	探索与利用的平衡	训练初期可设0.3，逐步衰减
奖励缩放	10-100倍	影响梯度大小	确保奖励与lr匹配

5. 进阶优化与问题排查

5.1 状态空间扩展

基础版本仅使用位置信息，可以扩展更多状态特征：

速度向量
与目标的相对角度
电池电量（仿真中可模拟）

def get_enhanced_state(self): kinematics = self.client.simGetGroundTruthKinematics() pos = np.array([kinematics.position.x_val, kinematics.position.y_val]) vel = np.array([kinematics.linear_velocity.x_val, kinematics.linear_velocity.y_val]) rel_pos = pos - self.target_pos[:2] rel_angle = np.arctan2(rel_pos[1], rel_pos[0]) return np.concatenate([rel_pos/10.0, vel/5.0, [rel_angle/np.pi]])

5.2 常见问题解决方案

无人机不收敛：
- 检查奖励函数设计是否合理
- 尝试减小学习率
- 增加随机探索率
训练过程不稳定：
- 实现经验回放(Experience Replay)
- 添加Q值裁剪
- 使用目标网络(适用于DQN)
AirSim连接问题：
- 确保仿真环境先于脚本启动
- 检查防火墙设置
- 验证IP地址配置

# 稳健的连接初始化代码 def connect_airsim(max_retries=5): for i in range(max_retries): try: client = airsim.MultirotorClient() client.confirmConnection() return client except Exception as e: print(f"Connection attempt {i+1} failed: {str(e)}") time.sleep(2) raise ConnectionError("Failed to connect to AirSim after multiple attempts")

在实际项目中，我发现将初始探索率设为0.3并采用线性衰减策略，配合0.95的折扣因子，在大多数简单导航任务中都能取得不错的效果。当引入更复杂的环境时，可以考虑将Q表替换为神经网络近似器，但这会显著增加训练复杂度。

查看全文

http://www.jsqmd.com/news/759089/