当前位置：首页 > news >正文

深入浅出 LoongSuite Python Agent：让你的 AI 应用「透明化」（下篇）

news 2026/7/18 9:54:24

八、实战案例：构建一个可观测的智能助手

8.1 场景描述

假设我们要构建一个智能研究助手，它能够：

接收用户的研究问题
搜索相关资料
分析并总结信息
生成研究报告

我们将使用 LangChain + LangGraph 构建这个应用，并用 LoongSuite 进行监控。

8.2 项目结构

research_assistant/ ├── app.py # 主应用 ├── tools.py # 工具定义 ├── agent.py # Agent 定义 ├── requirements.txt # 依赖 └── config.py # 配置

8.3 完整代码实现

requirements.txt

langchain>=0.1.0 langchain-openai>=0.1.0 langgraph>=0.2.0 loongsuite-distro>=0.5.0 loongsuite-instrumentation-langchain>=0.1.0 loongsuite-instrumentation-langgraph>=0.1.0

config.py

import os class Config: OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY", "") OPENAI_BASE_URL = os.environ.get("OPENAI_BASE_URL", "https://api.openai.com/v1") MODEL_NAME = os.environ.get("MODEL_NAME", "gpt-4o-mini")

tools.py

from langchain_core.tools import tool import time import random @tool def search_web(query: str) -> str: """ 搜索网络获取信息。 参数: query: 搜索关键词 返回: 搜索结果摘要 """ time.sleep(random.uniform(0.5, 1.5)) mock_results = { "AI": "人工智能（AI）是计算机科学的一个分支，致力于创建能够执行通常需要人类智能的任务的系统...", "机器学习": "机器学习是AI的核心技术之一，它使计算机能够从数据中学习并改进...", "深度学习": "深度学习是机器学习的子集，使用神经网络来模拟人脑的工作方式...", } for keyword, result in mock_results.items(): if keyword in query: return result return f"关于「{query}」的搜索结果：这是一个有趣的话题，涉及多个方面的知识..." @tool def analyze_data(data: str) -> str: """ 分析数据并提取关键信息。 参数: data: 待分析的数据文本 返回: 分析结果 """ time.sleep(random.uniform(0.3, 0.8)) word_count = len(data.split()) key_points = data.split("。")[:3] return f"数据分析结果：共 {word_count} 个词，关键要点包括：{'；'.join(key_points)}" @tool def generate_report(topic: str, analysis: str) -> str: """ 生成研究报告。 参数: topic: 研究主题 analysis: 分析结果 返回: 格式化的研究报告 """ time.sleep(random.uniform(0.5, 1.0)) report = f""" ================================ 研究报告：{topic} ================================ 【摘要】 本报告针对「{topic}」进行了深入研究。 【分析结果】 {analysis} 【结论】 通过本次研究，我们获得了关于「{topic}」的深入理解。 ================================ 报告生成时间：{time.strftime("%Y-%m-%d %H:%M:%S")} ================================ """ return report

agent.py

from langchain_openai import ChatOpenAI from langgraph.prebuilt import create_react_agent from tools import search_web, analyze_data, generate_report from config import Config def create_research_agent(): llm = ChatOpenAI( model=Config.MODEL_NAME, api_key=Config.OPENAI_API_KEY, base_url=Config.OPENAI_BASE_URL, temperature=0, ) tools = [search_web, analyze_data, generate_report] system_prompt = """ 你是一个专业的研究助手。你的任务是： 1. 理解用户的研究需求 2. 使用 search_web 工具搜索相关信息 3. 使用 analyze_data 工具分析搜索结果 4. 使用 generate_report 工具生成最终报告 请确保每一步都仔细执行，提供高质量的研究结果。 """ agent = create_react_agent( llm, tools, state_modifier=system_prompt ) return agent

app.py

import os from opentelemetry.instrumentation.langchain import LangChainInstrumentor from opentelemetry.instrumentation.langgraph import LangGraphInstrumentor from agent import create_research_agent from config import Config def setup_telemetry(): LangChainInstrumentor().instrument() LangGraphInstrumentor().instrument() print("✅ Telemetry 已启用") def main(): setup_telemetry() agent = create_research_agent() queries = [ "请帮我研究一下人工智能的发展历史", "机器学习和深度学习有什么区别？", "AI 在医疗领域的应用有哪些？", ] for query in queries: print(f"\n{'='*60}") print(f"用户问题: {query}") print('='*60) result = agent.invoke({ "messages": [{"role": "user", "content": query}] }) final_message = result["messages"][-1] print(f"\n助手回复:\n{final_message.content}") if __name__ == "__main__": main()

8.4 运行与监控

启动应用

export OPENAI_API_KEY="your-api-key" export OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental export OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=SPAN_ONLY loongsuite-instrument \ --traces_exporter console \ --metrics_exporter console \ --service_name research-assistant \ python app.py

观察输出

你会看到详细的追踪信息：

{ "name": "invoke_agent Research Agent", "context": { "trace_id": "0xabc123...", "span_id": "0xdef456..." }, "attributes": { "gen_ai.operation.name": "invoke_agent", "gen_ai.span.kind": "AGENT" } }

{ "name": "react step 1", "parent_id": "0xdef456...", "attributes": { "gen_ai.operation.name": "react", "gen_ai.react.round": 1 } }

{ "name": "execute_tool search_web", "attributes": { "gen_ai.operation.name": "execute_tool", "gen_ai.tool.name": "search_web", "gen_ai.tool.arguments": "{\"query\": \"人工智能发展历史\"}" } }

8.5 理解追踪链路

让我们分析一个完整的请求链路：

Trace: 用户提问「请帮我研究一下人工智能的发展历史」 │ ├── Span: invoke_agent Research Agent (总耗时: 8.5s) │ │ │ ├── Span: react step 1 (耗时: 3.2s) │ │ ├── Span: chat gpt-4o-mini (耗时: 1.5s) │ │ │ └── LLM 决定调用 search_web 工具 │ │ └── Span: execute_tool search_web (耗时: 1.2s) │ │ └── 工具返回搜索结果 │ │ │ ├── Span: react step 2 (耗时: 2.8s) │ │ ├── Span: chat gpt-4o-mini (耗时: 1.3s) │ │ │ └── LLM 决定调用 analyze_data 工具 │ │ └── Span: execute_tool analyze_data (耗时: 0.6s) │ │ └── 工具返回分析结果 │ │ │ └── Span: react step 3 (耗时: 2.5s) │ ├── Span: chat gpt-4o-mini (耗时: 1.4s) │ │ └── LLM 决定调用 generate_report 工具 │ └── Span: execute_tool generate_report (耗时: 0.8s) │ └── 工具返回最终报告

从这个链路中，你可以清晰地看到：

总耗时：8.5 秒完成整个研究任务
瓶颈识别：LLM 调用占用了大部分时间（约 4.2 秒）
工具调用次数：共调用了 3 个工具
推理步骤：Agent 进行了 3 轮 ReAct 推理

九、高级用法

9.1 自定义 Span 属性

有时候，你需要添加自定义的 Span 属性来记录业务信息：

from opentelemetry import trace def process_order_with_telemetry(order_id: str): tracer = trace.get_tracer(__name__) with tracer.start_as_current_span("process_order") as span: span.set_attribute("order.id", order_id) span.set_attribute("order.status", "processing") span.set_attribute("business.department", "sales") try: result = process_order(order_id) span.set_attribute("order.result", "success") span.set_attribute("order.amount", result.amount) return result except Exception as e: span.set_attribute("order.result", "failed") span.set_attribute("error.message", str(e)) span.record_exception(e) raise

9.2 使用 Span Events

Span Events 用于记录 Span 生命周期中的重要事件：

from opentelemetry import trace import time def call_llm_with_events(prompt: str): tracer = trace.get_tracer(__name__) with tracer.start_as_current_span("llm_call") as span: span.add_event("开始调用 LLM", { "prompt.length": len(prompt), "timestamp": time.time() }) start = time.time() result = llm.invoke(prompt) duration = time.time() - start span.add_event("LLM 调用完成", { "duration_ms": duration * 1000, "response.length": len(result) }) return result

9.3 过滤敏感信息

在生产环境中，你可能需要过滤敏感信息（如 API Key、用户隐私数据）：

from opentelemetry.instrumentation.utils import suppress_instrumentation def call_external_api(api_key: str, data: dict): with suppress_instrumentation(): response = requests.post( "https://api.example.com/endpoint", headers={"Authorization": f"Bearer {api_key}"}, json=data ) return response.json()

或者使用自定义的 Span Processor：

from opentelemetry.sdk.trace import SpanProcessor import re class SensitiveDataFilter(SpanProcessor): SENSITIVE_PATTERNS = [ (r"api[_-]?key[\"']?\s*[:=]\s*[\"']?[a-zA-Z0-9_-]+", "api_key=***"), (r"password[\"']?\s*[:=]\s*[\"']?[^\s\"']+", "password=***"), (r"token[\"']?\s*[:=]\s*[\"']?[a-zA-Z0-9_-]+", "token=***"), ] def on_end(self, span): for attr_name, attr_value in span.attributes.items(): if isinstance(attr_value, str): for pattern, replacement in self.SENSITIVE_PATTERNS: attr_value = re.sub(pattern, replacement, attr_value, flags=re.IGNORECASE) span.set_attribute(attr_name, attr_value)

9.4 采样策略

在高流量场景下，你可能不需要追踪所有请求。OpenTelemetry 支持多种采样策略：

from opentelemetry.sdk.trace.sampling import TraceIdRatioBased, ParentBased sampler = ParentBased( root=TraceIdRatioBased(0.1) # 只采样 10% 的请求 ) from opentelemetry.sdk.trace import TracerProvider provider = TracerProvider(sampler=sampler)

9.5 批量导出

为了提高性能，可以使用批量导出：

from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter exporter = OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces") processor = BatchSpanProcessor( exporter, max_queue_size=2048, schedule_delay_millis=5000, max_export_batch_size=512 ) provider = TracerProvider() provider.add_span_processor(processor)

十、最佳实践

10.1 开发环境配置

在开发环境，建议：

export OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental export OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=SPAN_ONLY export OTEL_LOG_LEVEL=debug loongsuite-instrument \ --traces_exporter console \ --metrics_exporter console \ --service_name my-ai-app-dev \ python app.py

10.2 生产环境配置

在生产环境，建议：

export OTEL_SERVICE_NAME=my-ai-app-prod export OTEL_EXPORTER_OTLP_PROTOCOL=grpc export OTEL_EXPORTER_OTLP_ENDPOINT=your-backend-endpoint export OTEL_EXPORTER_OTLP_HEADERS="Authorization=your-token" export OTEL_TRACES_SAMPLER=traceidratio export OTEL_TRACES_SAMPLER_ARG=0.1 export OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=false loongsuite-instrument python app.py

10.3 性能优化建议

建议 1：合理设置采样率

from opentelemetry.sdk.trace.sampling import TraceIdRatioBased sampler = TraceIdRatioBased( 0.1 if is_production else 1.0 )

建议 2：使用异步导出

from opentelemetry.sdk.trace.export import BatchSpanProcessor processor = BatchSpanProcessor( exporter, schedule_delay_millis=5000, max_export_batch_size=512, export_timeout_millis=30000 )

建议 3：控制消息内容捕获

export OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=SPAN_ONLY

只在需要时捕获消息内容，避免大量数据传输。

10.4 成本监控

使用 LoongSuite 监控 Token 消耗：

from opentelemetry import metrics meter = metrics.get_meter(__name__) token_counter = meter.create_counter( "llm.tokens.total", unit="tokens", description="Total tokens used" ) def track_token_usage(span): input_tokens = span.attributes.get("gen_ai.usage.input_tokens", 0) output_tokens = span.attributes.get("gen_ai.usage.output_tokens", 0) token_counter.add(input_tokens, {"type": "input"}) token_counter.add(output_tokens, {"type": "output"})

10.5 错误追踪

from opentelemetry import trace from opentelemetry.trace.status import Status, StatusCode def safe_llm_call(prompt: str): tracer = trace.get_tracer(__name__) with tracer.start_as_current_span("llm_call") as span: try: result = llm.invoke(prompt) span.set_status(Status(StatusCode.OK)) return result except RateLimitError as e: span.set_status(Status(StatusCode.ERROR, "rate_limit_exceeded")) span.record_exception(e) span.set_attribute("error.type", "rate_limit") raise except APIConnectionError as e: span.set_status(Status(StatusCode.ERROR, "connection_failed")) span.record_exception(e) span.set_attribute("error.type", "connection") raise except Exception as e: span.set_status(Status(StatusCode.ERROR, str(e))) span.record_exception(e) raise

十一、常见问题与解决方案

11.1 问题：看不到追踪数据

症状：运行应用后，控制台没有输出追踪信息。

可能原因与解决方案：

未正确安装 instrumentation

pip install loongsuite-instrumentation-langchain

未调用 instrument() 方法

from opentelemetry.instrumentation.langchain import LangChainInstrumentor LangChainInstrumentor().instrument() # 确保调用了这个方法

环境变量未设置
```
export OTEL_TRACES_EXPORTER=console
```

11.2 问题：追踪数据不完整

症状：只能看到部分 Span，链路不完整。

可能原因与解决方案：

上下文传播问题

确保在异步调用中正确传播上下文：

from opentelemetry.context import attach, detach from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator propagator = TraceContextTextMapPropagator() async def async_task(): ctx = attach(propagator.extract(carrier)) try: await do_something() finally: detach(ctx)

多进程问题
在多进程环境中，每个进程需要独立的 TracerProvider。

11.3 问题：性能下降

症状：启用追踪后，应用性能明显下降。

可能原因与解决方案：

同步导出导致阻塞

使用批量导出：

from opentelemetry.sdk.trace.export import BatchSpanProcessor

消息内容捕获过多

减少捕获的内容：

export OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=false

采样率过高
降低采样率：
```
export OTEL_TRACES_SAMPLER_ARG=0.1
```

11.4 问题：敏感信息泄露

症状：追踪数据中包含 API Key、密码等敏感信息。

解决方案：

使用 suppress_instrumentation

from opentelemetry.instrumentation.utils import suppress_instrumentation with suppress_instrumentation(): sensitive_operation()

自定义 Span Processor 过滤

class SensitiveDataFilter(SpanProcessor): def on_end(self, span): # 过滤敏感属性 pass

11.5 问题：与现有监控系统冲突

症状：项目中已有其他监控方案，与 LoongSuite 冲突。

解决方案：

使用不同的服务名称

export OTEL_SERVICE_NAME=my-app-loongsuite

使用不同的导出端点

export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://loongsuite-backend:4318/v1/traces

十二、总结与展望

12.1 核心要点回顾

通过这篇文章，我们学习了：

什么是可观测性：让你的应用「透明化」，能看到内部发生了什么
LoongSuite Python Agent 是什么：阿里巴巴开源的 Python 应用可观测性工具，专为 AI 应用设计
如何使用：安装 distro、配置 instrumentation、启动应用
支持哪些框架：LangChain、LangGraph、CrewAI、OpenAI、Anthropic 等
如何解读追踪数据：理解 Span、Trace、属性的含义
高级用法：自定义属性、过滤敏感信息、采样策略
最佳实践：开发与生产环境配置、性能优化

12.2 为什么选择 LoongSuite？

特性	LoongSuite	传统 APM
AI 框架支持	✅ 原生支持 LangChain、CrewAI 等	❌ 需要手动埋点
Token 消耗追踪	✅ 自动追踪	❌ 需要自己实现
Prompt 内容捕获	✅ 支持	❌ 不支持
OpenTelemetry 兼容	✅ 完全兼容	⚠️ 部分兼容
零代码侵入	✅ 支持	⚠️ 需要修改代码

12.3 未来展望

LoongSuite Python Agent 还在不断演进，未来可能会有：

更多框架支持：如 AutoGen、MetaGPT 等
更智能的分析：自动识别性能瓶颈、异常检测
成本优化建议：基于追踪数据给出 Token 使用优化建议
与 LoongCollector 深度集成：实现端到端的全链路可观测性

12.4 参考资源

GitHub 仓库：https://github.com/alibaba/loongsuite-python-agent
OpenTelemetry 官方文档：https://opentelemetry.io/docs/
LangChain 文档：https://python.langchain.com/docs/
Jaeger 官方网站：https://www.jaegertracing.io/

附录：快速参考卡片

A. 常用命令

pip install loongsuite-distro pip install loongsuite-instrumentation-langchain loongsuite-bootstrap -a install --latest --auto-detect loongsuite-instrument --traces_exporter console python app.py

B. 常用环境变量

环境变量	说明	示例值
`OTEL_SERVICE_NAME`	服务名称	`my-ai-app`
`OTEL_TRACES_EXPORTER`	Trace 导出器	`console`,`otlp`
`OTEL_EXPORTER_OTLP_ENDPOINT`	OTLP 端点	`http://localhost:4318`
`OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT`	捕获消息内容	`SPAN_ONLY`
`OTEL_SEMCONV_STABILITY_OPT_IN`	启用实验性语义约定	`gen_ai_latest_experimental`

C. 常用 Span 属性

属性名	说明
`gen_ai.operation.name`	操作类型
`gen_ai.request.model`	模型名称
`gen_ai.usage.input_tokens`	输入 Token 数
`gen_ai.usage.output_tokens`	输出 Token 数
`gen_ai.tool.name`	工具名称
`gen_ai.tool.arguments`	工具参数