当前位置：首页 > news >正文

从零构建系统级 AI Agent——Rust 工具链的完整搭建过程

news 2026/6/29 0:25:01

从零构建系统级 AI Agent——Rust 工具链的完整搭建过程

一、系统级 Agent 的工程挑战：可靠性、性能与可观测性

AI Agent 不是简单的"调用 LLM API + 解析返回"。一个真正可用的系统级 Agent 需要解决三个核心工程问题。

第一，可靠性。LLM 的输出不确定——同样的输入可能产生不同的结果，甚至产生格式错误的输出。Agent 必须具备重试、降级和自我修复能力，不能因为一次 LLM 调用失败就崩溃。

第二，性能。Agent 通常需要多步推理和工具调用，每一步都可能涉及网络请求或计算。串行执行会导致总延迟线性叠加，必须支持并行工具调用和流式输出。

第三，可观测性。Agent 的决策过程是黑盒——LLM 为什么选择调用某个工具？为什么生成某个参数？没有详细的执行日志，调试 Agent 行为几乎不可能。

Rust 在这个场景下的优势：强类型系统确保工具接口的编译期安全，异步运行时支持高并发工具调用，零成本抽象保证 Agent 框架本身不成为性能瓶颈。

二、Agent 架构设计：规划-执行-反思的三阶段模型

一个系统级 Agent 的核心架构包含三个阶段：规划（Plan）、执行（Execute）、反思（Reflect）。

flowchart TD A[用户目标输入] --> B[规划阶段\nLLM 分解任务] B --> C[生成执行计划\n步骤列表 + 依赖关系] C --> D[执行阶段] D --> E{步骤类型} E -->|工具调用| F[调用外部工具\nAPI / CLI / 文件操作] E -->|LLM 推理| G[调用 LLM\n生成中间结果] F --> H[收集执行结果] G --> H H --> I{所有步骤完成?} I -->|否| D I -->|是| J[反思阶段\nLLM 评估结果] J --> K{目标达成?} K -->|是| L[输出最终结果] K -->|否| M[修正执行计划] M --> D subgraph 安全沙箱 F G H end

规划阶段将用户目标分解为可执行的步骤序列。每个步骤指定要调用的工具和预期输入。步骤之间可以有依赖关系——步骤 B 依赖步骤 A 的输出，则 B 必须在 A 完成后执行。

执行阶段按照依赖关系调度步骤。无依赖的步骤可以并行执行，有依赖的步骤串行执行。每个步骤的执行结果被记录到上下文中，供后续步骤和反思阶段使用。

反思阶段评估执行结果是否满足用户目标。如果不满足，Agent 会修正执行计划并重新执行。这个循环有最大次数限制，避免无限循环。

三、生产级实现：Rust Agent 框架核心代码

use std::collections::HashMap; use std::sync::Arc; use tokio::sync::RwLock; use serde::{Deserialize, Serialize}; /// 工具定义：Agent 可调用的外部能力 #[derive(Debug, Clone, Serialize, Deserialize)] pub struct ToolDef { pub name: String, pub description: String, pub parameters: serde_json::Value, // JSON Schema } /// 执行步骤 #[derive(Debug, Clone, Serialize, Deserialize)] pub struct Step { pub id: String, pub tool: String, pub input: serde_json::Value, /// 依赖的步骤 ID 列表，这些步骤必须先完成 pub depends_on: Vec<String>, /// 执行状态 pub status: StepStatus, /// 执行结果 pub result: Option<serde_json::Value>, } #[derive(Debug, Clone, Serialize, Deserialize)] pub enum StepStatus { Pending, Running, Completed, Failed(String), } /// 执行上下文：存储中间结果，供步骤间传递数据 /// 使用 Arc<RwLock> 支持异步并发读写 pub struct ExecutionContext { results: Arc<RwLock<HashMap<String, serde_json::Value>>>, max_retries: u32, } impl ExecutionContext { pub fn new(max_retries: u32) -> Self { ExecutionContext { results: Arc::new(RwLock::new(HashMap::new())), max_retries, } } /// 存储步骤执行结果 pub async fn set_result(&self, step_id: &str, result: serde_json::Value) { self.results.write().await.insert(step_id.to_string(), result); } /// 获取步骤执行结果 pub async fn get_result(&self, step_id: &str) -> Option<serde_json::Value> { self.results.read().await.get(step_id).cloned() } } /// 工具执行器 trait：定义工具的统一接口 /// 每个具体工具实现此 trait，Agent 框架不关心具体实现 #[async_trait::async_trait] pub trait ToolExecutor: Send + Sync { async fn execute(&self, input: &serde_json::Value) -> Result<serde_json::Value, String>; fn definition(&self) -> ToolDef; } /// Agent 核心：规划-执行-反思循环 pub struct Agent { tools: HashMap<String, Arc<dyn ToolExecutor>>, context: ExecutionContext, max_reflect_rounds: u32, } impl Agent { pub fn new(max_retries: u32, max_reflect_rounds: u32) -> Self { Agent { tools: HashMap::new(), context: ExecutionContext::new(max_retries), max_reflect_rounds, } } /// 注册工具 pub fn register_tool(&mut self, executor: Arc<dyn ToolExecutor>) { let def = executor.definition(); self.tools.insert(def.name.clone(), executor); } /// 执行单个步骤，带重试机制 async fn execute_step(&self, step: &mut Step) -> Result<(), String> { let executor = self.tools.get(&step.tool) .ok_or_else(|| format!("未注册的工具: {}", step.tool))?; let mut attempts = 0; loop { match executor.execute(&step.input).await { Ok(result) => { step.status = StepStatus::Completed; step.result = Some(result.clone()); self.context.set_result(&step.id, result).await; return Ok(()); } Err(e) => { attempts += 1; if attempts >= self.context.max_retries { step.status = StepStatus::Failed(e.clone()); return Err(format!("步骤 {} 执行失败 (重试 {} 次): {}", step.id, attempts, e)); } // 指数退避等待 tokio::time::sleep( std::time::Duration::from_millis(500 * 2u64.pow(attempts - 1)) ).await; } } } } /// 执行计划：按依赖关系调度步骤 /// 无依赖的步骤并行执行，有依赖的步骤等待前置完成 pub async fn execute_plan(&self, steps: &mut Vec<Step>) -> Result<(), String> { let mut completed: HashMap<String, bool> = HashMap::new(); loop { // 找出所有依赖已满足的待执行步骤 let ready_steps: Vec<usize> = steps.iter().enumerate() .filter(|(_, s)| { matches!(s.status, StepStatus::Pending) && s.depends_on.iter().all(|dep| completed.contains_key(dep)) }) .map(|(i, _)| i) .collect(); if ready_steps.is_empty() { // 检查是否所有步骤都完成 let all_done = steps.iter().all(|s| { matches!(s.status, StepStatus::Completed | StepStatus::Failed(_)) }); if all_done { break; } // 存在死锁：有步骤依赖未完成且无法继续 return Err("执行计划存在循环依赖或无法继续".to_string()); } // 并行执行所有就绪步骤 let mut handles = Vec::new(); for idx in ready_steps { let step = &mut steps[idx]; step.status = StepStatus::Running; // 克隆必要数据用于异步任务 let step_id = step.id.clone(); let tool_name = step.tool.clone(); let input = step.input.clone(); let executor = self.tools.get(&tool_name) .ok_or_else(|| format!("未注册的工具: {}", tool_name))? .clone(); let ctx = self.context.clone(); let handle = tokio::spawn(async move { let result = executor.execute(&input).await; if let Ok(val) = &result { ctx.set_result(&step_id, val.clone()).await; } result }); handles.push((idx, handle)); } // 等待所有并行任务完成 for (idx, handle) in handles { match handle.await { Ok(Ok(result)) => { steps[idx].status = StepStatus::Completed; steps[idx].result = Some(result); completed.insert(steps[idx].id.clone(), true); } Ok(Err(e)) => { steps[idx].status = StepStatus::Failed(e.clone()); return Err(format!("步骤 {} 失败: {}", steps[idx].id, e)); } Err(e) => { steps[idx].status = StepStatus::Failed(format!("任务异常: {}", e)); return Err(format!("步骤 {} 任务异常: {}", steps[idx].id, e)); } } } } Ok(()) } }

设计要点：

工具解耦：ToolExecutortrait 将工具实现与 Agent 框架分离，新增工具只需实现 trait
依赖调度：depends_on字段定义步骤间的依赖关系，无依赖步骤自动并行
重试机制：每个步骤有独立的重试逻辑，指数退避避免雪崩
上下文共享：Arc<RwLock<HashMap>>支持步骤间的异步数据传递
安全隔离：工具执行在独立任务中运行，单个工具崩溃不影响整体

四、Agent 框架的工程妥协：灵活性、成本与可控性

LLM 规划的不可靠性。让 LLM 生成执行计划是 Agent 的核心能力，但 LLM 可能生成不合理的步骤（如调用不存在的工具、循环依赖）。解决方案是：用 JSON Schema 约束 LLM 输出格式，在执行前验证计划的合法性（检查工具是否存在、依赖是否形成环）。

Token 成本控制。每轮规划、执行、反思都需要调用 LLM，Token 消耗快速累积。一个 5 步任务的 Agent 可能消耗 5000-10000 Token。优化策略包括：压缩上下文（只发送必要的历史结果）、使用更小的模型做规划（大模型只做关键决策）、缓存重复的 LLM 调用。

执行超时与资源限制。Agent 可能陷入无限循环（反思阶段反复修正计划但不收敛）。必须设置最大反思轮次和总执行时间上限。工具执行也需要独立的超时控制，防止某个工具阻塞整个计划。

可观测性不足。Agent 的决策过程依赖 LLM 的内部推理，无法完全解释。日志只能记录输入输出，无法记录"为什么选择这个工具"。改进方向是让 LLM 在规划时输出推理过程（Chain of Thought），并将推理过程纳入日志。

适用边界：

场景	Agent 框架是否适用
自动化运维任务	适用，多步骤工具调用是核心场景
代码生成与审查	部分适用，需要人工确认关键步骤
数据分析 Pipeline	适用，工具调用链路清晰
实时交互对话	不适用，延迟太高
安全敏感操作	谨慎使用，必须加人工确认环节