当前位置：首页 > news >正文

AI 辅助的运维 Runbook 自动生成：从经验文档到可执行脚本

news 2026/6/12 4:32:21

AI 辅助的运维 Runbook 自动生成：从经验文档到可执行脚本

一、Runbook 的"知识断层"：经验在人的脑子里，不在文档里

运维团队最核心的知识资产是 Runbook——故障排查和恢复的操作手册。但现实中，Runbook 往往是最不可靠的文档：写的时候就不完整（只记录了已知场景），更新不及时（新故障类型没有补充），格式不统一（有的用 Wiki、有的用 Markdown、有的只存在于某人的脑子里）。当故障发生时，值班人员翻到的 Runbook 可能是两年前的，操作步骤已经不适用于当前架构。

AI 辅助的 Runbook 自动生成，可以从告警历史、操作日志和代码仓库中提取运维知识，自动生成结构化的 Runbook，并持续更新以反映架构变化。

二、Runbook 自动生成的架构

AI Runbook 生成分为三层：知识提取层从多源数据中提取运维知识，结构化层将知识组织为标准 Runbook 格式，验证层通过历史故障回放验证 Runbook 的有效性。

flowchart TD A[告警历史] --> D[知识提取] B[操作日志] --> D C[代码仓库] --> D D --> E[故障模式识别] D --> F[操作序列提取] D --> G[依赖关系推断] E --> H[Runbook 结构化] F --> H G --> H H --> I[故障描述] H --> J[诊断步骤] H --> K[修复操作] H --> L[验证方法] I --> M[历史回放验证] J --> M K --> M

三、工程化实现

3.1 运维知识提取

# runbook_generator.py from dataclasses import dataclass from typing import Optional @dataclass class RunbookStep: order: int action: str command: Optional[str] expected_result: str rollback_command: Optional[str] @dataclass class Runbook: alert_name: str severity: str description: str symptoms: list[str] diagnosis_steps: list[RunbookStep] fix_steps: list[RunbookStep] verification_steps: list[RunbookStep] estimated_time_minutes: int class RunbookGenerator: def generate_from_alert_history( self, alert_name: str, history: list[dict] ) -> Runbook: # 从告警历史中提取常见操作模式 operations = self._extract_operations(history) # 从操作日志中提取命令序列 commands = self._extract_commands(operations) # 生成结构化 Runbook return self._build_runbook(alert_name, commands) def _extract_operations(self, history: list[dict]) -> list[str]: ops = [] for record in history: if record.get('resolution_notes'): ops.append(record['resolution_notes']) return ops def _extract_commands(self, operations: list[str]) -> list[dict]: # 提取操作中包含的 shell 命令 import re commands = [] for op in operations: # 匹配 kubectl, docker, systemctl 等运维命令 cmd_patterns = re.findall( r'(kubectl\s+\S+.*|docker\s+\S+.*|systemctl\s+\S+.*)', op ) for cmd in cmd_patterns: commands.append({ 'command': cmd.strip(), 'context': op, }) return commands def _build_runbook( self, alert_name: str, commands: list[dict] ) -> Runbook: # 基于提取的命令生成 Runbook 步骤 diagnosis = [] fix = [] verification = [] for i, cmd in enumerate(commands): step = RunbookStep( order=i + 1, action=self._infer_action(cmd['command']), command=cmd['command'], expected_result=self._infer_expected(cmd['command']), rollback_command=self._infer_rollback(cmd['command']), ) # 简单分类：get/describe/logs → 诊断，apply/delete → 修复 if any(kw in cmd['command'] for kw in ['get', 'describe', 'logs', 'top']): diagnosis.append(step) elif any(kw in cmd['command'] for kw in ['apply', 'delete', 'rollout', 'scale']): fix.append(step) else: verification.append(step) return Runbook( alert_name=alert_name, severity='high', description=f'告警 {alert_name} 的自动生成 Runbook', symptoms=['待补充：从告警规则中提取'], diagnosis_steps=diagnosis, fix_steps=fix, verification_steps=verification, estimated_time_minutes=len(diagnosis) * 2 + len(fix) * 3, ) def _infer_action(self, command: str) -> str: if 'kubectl get' in command: return '查看资源状态' if 'kubectl describe' in command: return '查看资源详情' if 'kubectl logs' in command: return '查看容器日志' if 'kubectl apply' in command: return '应用配置变更' if 'kubectl scale' in command: return '调整副本数' return '执行运维操作' def _infer_expected(self, command: str) -> str: if 'kubectl get pods' in command: return '所有 Pod 状态为 Running' if 'kubectl logs' in command: return '无 ERROR 级别日志' return '命令执行成功，无报错' def _infer_rollback(self, command: str) -> str: if 'kubectl scale' in command: return '恢复原始副本数' if 'kubectl apply' in command: return 'kubectl rollout undo' return None

3.2 AI 增强的 Runbook 生成

# ai_runbook_enhancer.py class AIRunbookEnhancer: def enhance(self, runbook: Runbook, context: dict) -> Runbook: # 使用 LLM 补充 Runbook 中缺失的信息 prompt = f""" 你是一位资深运维工程师。请完善以下自动生成的 Runbook。 告警名称：{runbook.alert_name} 当前诊断步骤：{[s.action for s in runbook.diagnosis_steps]} 当前修复步骤：{[s.action for s in runbook.fix_steps]} 系统上下文： - 集群版本：{context.get('k8s_version', '未知')} - 节点数量：{context.get('node_count', '未知')} - 主要服务：{context.get('services', [])} 请补充： 1. 故障的典型症状描述 2. 缺失的诊断步骤（如检查节点状态、网络连通性） 3. 修复步骤的详细参数和注意事项 4. 修复后的验证方法 5. 预估修复时间 输出 JSON 格式的完整 Runbook。 """ response = self.call_llm(prompt) enhanced = self.parse_response(response) return enhanced

3.3 Runbook 验证与更新

# runbook_validator.py class RunbookValidator: def validate_against_history( self, runbook: Runbook, incidents: list[dict] ) -> dict: """用历史故障验证 Runbook 的有效性""" results = { 'total_incidents': len(incidents), 'would_resolve': 0, 'would_fail': 0, 'missing_steps': [], } for incident in incidents: # 模拟按 Runbook 步骤操作 simulated = self._simulate_runbook(runbook, incident) if simulated['resolved']: results['would_resolve'] += 1 else: results['would_fail'] += 1 results['missing_steps'].extend( simulated['missing_actions'] ) results['effectiveness'] = ( results['would_resolve'] / max(results['total_incidents'], 1) * 100 ) return results

四、AI Runbook 生成的 Trade-offs

生成质量的不确定性：AI 生成的 Runbook 可能包含错误的命令或不适用于当前环境的步骤。建议所有生成的 Runbook 必须经过人工审核后才能标记为"已验证"，未验证的 Runbook 仅供参考。

知识提取的覆盖率：从操作日志中提取的命令序列可能不完整——很多操作是通过 Web 控制台或 API 执行的，没有留下命令记录。建议同时从 GitOps 仓库（如 ArgoCD 的 Application 变更历史）中提取操作记录。

Runbook 的时效性：架构变更后，Runbook 中的命令可能失效（如服务名变更、API 版本升级）。建议在 CI 中加入 Runbook 验证步骤：定期执行 Runbook 中的诊断命令，确认命令仍然有效。

安全风险：Runbook 中可能包含敏感操作（如删除资源、修改配置）。自动生成的 Runbook 必须标注风险等级，高风险操作需要二次确认。

五、总结

AI 辅助的 Runbook 自动生成将运维知识从"人的脑子里"提取为"可执行文档"，大幅降低了知识传承的门槛。落地路线上，建议先从高频告警入手生成 Runbook，逐步扩展到低频场景。关键原则：生成的 Runbook 必须人工审核，定期验证时效性，高风险操作必须二次确认，持续更新是 Runbook 生命力的保障。

查看全文

http://www.jsqmd.com/news/996530/