当前位置：首页 > news >正文

VideoAgentTrek-ScreenFilter实操案例：检测结果对接Prometheus实现GPU利用率告警

news 2026/7/8 9:58:21

VideoAgentTrek-ScreenFilter实操案例：检测结果对接Prometheus实现GPU利用率告警

1. 引言：从检测到监控的业务闭环

想象一下这个场景：你部署了一个强大的视频目标检测模型，它正在7x24小时地处理海量的监控录像，识别画面中的屏幕内容。突然，你发现处理速度变慢了，视频队列开始堆积。是模型出了问题？还是服务器资源不够了？如果没有监控，你只能等到用户投诉才发现问题。

这就是我们今天要解决的问题。VideoAgentTrek-ScreenFilter是一个优秀的屏幕内容检测模型，它能准确地识别视频和图片中的屏幕目标。但仅仅完成检测还不够，我们还需要知道：模型运行时GPU的利用率如何？处理每帧需要多少时间？系统负载是否正常？

本文将带你完成一个完整的工程实践：将VideoAgentTrek-ScreenFilter的检测结果与Prometheus监控系统对接，实现GPU利用率的实时告警。这不是一个简单的技术演示，而是一个可以直接在生产环境中使用的解决方案。

你将学到：

如何为VideoAgentTrek-ScreenFilter添加Prometheus指标暴露
如何配置Prometheus采集这些指标
如何基于GPU利用率设置智能告警规则
如何通过Grafana创建直观的监控仪表盘

无论你是运维工程师、算法工程师，还是全栈开发者，这套方案都能帮你更好地管理AI推理服务，确保服务稳定运行。

2. 理解VideoAgentTrek-ScreenFilter的核心能力

在开始技术实现之前，我们先快速回顾一下VideoAgentTrek-ScreenFilter的基本能力。理解这些特性，有助于我们设计合理的监控指标。

2.1 模型基础信息

VideoAgentTrek-ScreenFilter基于Ultralytics YOLO架构，专门用于检测视频和图像中的屏幕相关目标。它的核心特点包括：

模型路径：/root/ai-models/xlangai/VideoAgentTrek-ScreenFilter/best.pt
任务类型：目标检测（detect）
输入支持：图片（JPG/PNG）和视频文件
输出格式：可视化结果 + 结构化JSON数据

2.2 两种检测模式的工作流程

图片检测模式：

上传图片 → 模型推理 → 生成带检测框的图片 + JSON明细

JSON中包含每个检测目标的类别、置信度、坐标等信息。

视频检测模式：

上传视频 → 逐帧解码 → 每帧推理 → 合成带框视频 + JSON统计

JSON中除了每帧的检测明细，还包含整个视频的统计信息，如总检测数、各类别数量等。

2.3 性能特征与监控需求

基于模型的工作方式，我们需要关注以下几个关键性能指标：

推理速度：处理单张图片或单帧视频需要多长时间
GPU利用率：模型推理时GPU的计算负载
内存使用：显存和系统内存的占用情况
处理吞吐量：单位时间内能处理多少帧
检测质量：置信度分布、各类别检测数量

这些指标正是我们需要通过Prometheus来收集和监控的。

3. 为检测服务添加Prometheus指标

现在进入实战环节。我们需要修改VideoAgentTrek-ScreenFilter的代码，让它能够暴露Prometheus格式的指标。

3.1 创建指标定义文件

首先，我们创建一个专门用于定义Prometheus指标的模块。在项目根目录下创建metrics.py：

# metrics.py - Prometheus指标定义 from prometheus_client import Counter, Gauge, Histogram, Summary import time # 定义各种指标 class DetectionMetrics: def __init__(self): # 计数器：总处理请求数 self.requests_total = Counter( 'detection_requests_total', 'Total number of detection requests', ['type'] # 按类型标签：image或video ) # 计数器：总处理帧数 self.frames_total = Counter( 'detection_frames_total', 'Total number of frames processed' ) # 计数器：总检测目标数 self.objects_total = Counter( 'detection_objects_total', 'Total number of objects detected', ['class_name'] # 按类别标签 ) # 仪表盘：当前GPU利用率 self.gpu_utilization = Gauge( 'gpu_utilization_percent', 'Current GPU utilization percentage', ['gpu_id'] ) # 仪表盘：当前GPU内存使用 self.gpu_memory_used = Gauge( 'gpu_memory_used_mb', 'GPU memory used in MB', ['gpu_id'] ) # 仪表盘：当前GPU内存总量 self.gpu_memory_total = Gauge( 'gpu_memory_total_mb', 'Total GPU memory in MB', ['gpu_id'] ) # 直方图：推理耗时分布 self.inference_duration = Histogram( 'inference_duration_seconds', 'Time spent on model inference', buckets=[0.01, 0.05, 0.1, 0.2, 0.5, 1.0, 2.0, 5.0] ) # 摘要：处理延迟 self.processing_latency = Summary( 'processing_latency_seconds', 'Latency of complete processing' ) # 仪表盘：置信度阈值 self.confidence_threshold = Gauge( 'confidence_threshold', 'Current confidence threshold setting' ) # 仪表盘：IOU阈值 self.iou_threshold = Gauge( 'iou_threshold', 'Current IOU threshold setting' ) # 初始化GPU信息 self._init_gpu_info() def _init_gpu_info(self): """初始化GPU信息""" try: import pynvml pynvml.nvmlInit() device_count = pynvml.nvmlDeviceGetCount() for i in range(device_count): handle = pynvml.nvmlDeviceGetHandleByIndex(i) info = pynvml.nvmlDeviceGetMemoryInfo(handle) # 设置GPU内存总量（固定值） self.gpu_memory_total.labels(gpu_id=f'gpu_{i}').set(info.total / 1024 / 1024) except ImportError: print("pynvml not installed, GPU metrics will be limited") except Exception as e: print(f"Failed to init GPU info: {e}") def update_gpu_metrics(self): """更新GPU相关指标""" try: import pynvml pynvml.nvmlInit() device_count = pynvml.nvmlDeviceGetCount() for i in range(device_count): handle = pynvml.nvmlDeviceGetHandleByIndex(i) # 获取GPU利用率 util = pynvml.nvmlDeviceGetUtilizationRates(handle) self.gpu_utilization.labels(gpu_id=f'gpu_{i}').set(util.gpu) # 获取GPU内存使用 info = pynvml.nvmlDeviceGetMemoryInfo(handle) self.gpu_memory_used.labels(gpu_id=f'gpu_{i}').set(info.used / 1024 / 1024) except Exception as e: print(f"Failed to update GPU metrics: {e}") def record_inference(self, duration_seconds): """记录推理耗时""" self.inference_duration.observe(duration_seconds) def record_processing(self): """记录处理延迟的装饰器""" def decorator(func): def wrapper(*args, **kwargs): start_time = time.time() result = func(*args, **kwargs) duration = time.time() - start_time self.processing_latency.observe(duration) return result return wrapper return decorator # 创建全局指标实例 metrics = DetectionMetrics()

3.2 修改主处理逻辑集成指标

接下来，我们需要修改VideoAgentTrek-ScreenFilter的主处理逻辑，在关键位置添加指标记录。这里以图片检测为例：

# 修改原有的图片检测函数 import time from metrics import metrics def detect_image_with_metrics(image_path, conf_threshold=0.25, iou_threshold=0.45): """ 带指标记录的图片检测函数 """ # 记录请求开始 metrics.requests_total.labels(type='image').inc() # 更新阈值指标 metrics.confidence_threshold.set(conf_threshold) metrics.iou_threshold.set(iou_threshold) # 记录处理开始时间 start_time = time.time() try: # 加载模型（如果是第一次） if not hasattr(detect_image_with_metrics, 'model'): model_load_start = time.time() detect_image_with_metrics.model = YOLO('/root/ai-models/xlangai/VideoAgentTrek-ScreenFilter/best.pt') model_load_time = time.time() - model_load_start print(f"Model loaded in {model_load_time:.2f}s") # 执行推理 inference_start = time.time() results = detect_image_with_metrics.model( image_path, conf=conf_threshold, iou=iou_threshold, device='cuda' # 使用GPU ) inference_duration = time.time() - inference_start # 记录推理耗时指标 metrics.record_inference(inference_duration) # 处理检测结果 detections = [] for result in results: boxes = result.boxes if boxes is not None: for box in boxes: # 提取检测信息 class_id = int(box.cls[0]) class_name = detect_image_with_metrics.model.names[class_id] confidence = float(box.conf[0]) coords = box.xyxy[0].tolist() # 记录检测目标指标 metrics.objects_total.labels(class_name=class_name).inc() # 添加到结果列表 detections.append({ 'class_id': class_id, 'class_name': class_name, 'confidence': confidence, 'xyxy': coords }) # 更新GPU指标 metrics.update_gpu_metrics() # 记录总帧数（图片算1帧） metrics.frames_total.inc() # 计算总处理时间 total_duration = time.time() - start_time print(f"Detection completed in {total_duration:.2f}s, found {len(detections)} objects") return { 'success': True, 'detections': detections, 'count': len(detections), 'processing_time': total_duration, 'inference_time': inference_duration } except Exception as e: print(f"Detection failed: {e}") return { 'success': False, 'error': str(e) }

3.3 添加Prometheus指标端点

为了让Prometheus能够采集指标，我们需要在Web服务中添加一个专门的指标端点。修改你的Web应用（假设使用Flask或FastAPI）：

# 添加Prometheus指标端点 from prometheus_client import generate_latest, CONTENT_TYPE_LATEST from flask import Response # 如果是Flask应用 @app.route('/metrics') def metrics_endpoint(): """Prometheus指标采集端点""" # 更新实时GPU指标 try: metrics.update_gpu_metrics() except Exception as e: print(f"Failed to update GPU metrics: {e}") # 返回Prometheus格式的指标 return Response(generate_latest(), mimetype=CONTENT_TYPE_LATEST) # 如果是FastAPI应用 from fastapi import Response from fastapi.responses import PlainTextResponse @app.get("/metrics") async def get_metrics(): """Prometheus指标采集端点""" # 更新实时GPU指标 try: metrics.update_gpu_metrics() except Exception as e: print(f"Failed to update GPU metrics: {e}") # 返回Prometheus格式的指标 data = generate_latest() return PlainTextResponse(content=data, media_type="text/plain")

3.4 创建独立的指标服务器（可选）

如果你的主应用不方便修改，或者想要更灵活的部署方式，可以创建一个独立的指标服务器：

# metrics_server.py - 独立的指标服务器 from prometheus_client import start_http_server import time from metrics import metrics def run_metrics_server(port=8000): """启动独立的Prometheus指标服务器""" print(f"Starting metrics server on port {port}") start_http_server(port) # 定期更新GPU指标 while True: try: metrics.update_gpu_metrics() except Exception as e: print(f"Error updating GPU metrics: {e}") time.sleep(5) # 每5秒更新一次 if __name__ == "__main__": run_metrics_server(port=8000)

这个独立的服务器可以部署在同一个容器或主机上，通过定期调用update_gpu_metrics()来收集GPU信息。

4. 配置Prometheus采集与告警规则

现在我们的检测服务已经能够暴露指标了，接下来需要配置Prometheus来采集这些指标，并设置告警规则。

4.1 Prometheus采集配置

在Prometheus的配置文件prometheus.yml中添加新的job：

# prometheus.yml global: scrape_interval: 15s # 每15秒采集一次 evaluation_interval: 15s # 每15秒评估一次告警规则 scrape_configs: # 原有的job配置... # VideoAgentTrek-ScreenFilter监控 - job_name: 'videoagent_screenfilter' scrape_interval: 10s # 更频繁的采集 static_configs: - targets: ['your-server-ip:7860'] # 主应用端口 labels: service: 'videoagent-detection' instance: 'primary' # 指标服务器（如果使用独立服务器） - job_name: 'videoagent_metrics' scrape_interval: 10s static_configs: - targets: ['your-server-ip:8000'] # 独立指标服务器端口 labels: service: 'videoagent-metrics' instance: 'metrics-server'

4.2 GPU利用率告警规则

创建告警规则文件gpu_alerts.yml：

# gpu_alerts.yml groups: - name: gpu_utilization_alerts rules: # 规则1：GPU利用率过高告警 - alert: HighGPUUtilization expr: gpu_utilization_percent > 90 for: 5m # 持续5分钟 labels: severity: warning service: videoagent-detection annotations: summary: "GPU利用率过高 (实例 {{ $labels.instance }})" description: | GPU {{ $labels.gpu_id }} 利用率持续5分钟超过90% 当前值: {{ $value }}% 建议检查模型负载或考虑扩容。 # 规则2：GPU内存使用过高告警 - alert: HighGPUMemoryUsage expr: (gpu_memory_used_mb / gpu_memory_total_mb) * 100 > 85 for: 3m labels: severity: warning service: videoagent-detection annotations: summary: "GPU内存使用过高 (实例 {{ $labels.instance }})" description: | GPU {{ $labels.gpu_id }} 内存使用率超过85% 当前使用: {{ $value }}% 可能发生内存泄漏或批处理大小过大。 # 规则3：GPU利用率过低告警（资源浪费） - alert: LowGPUUtilization expr: gpu_utilization_percent < 10 for: 10m labels: severity: info service: videoagent-detection annotations: summary: "GPU利用率过低 (实例 {{ $labels.instance }})" description: | GPU {{ $labels.gpu_id }} 利用率持续10分钟低于10% 当前值: {{ $value }}% 可能服务负载不足，考虑资源优化。 # 规则4：推理延迟过高告警 - alert: HighInferenceLatency expr: histogram_quantile(0.95, rate(inference_duration_seconds_bucket[5m])) > 0.5 for: 2m labels: severity: warning service: videoagent-detection annotations: summary: "推理延迟过高 (实例 {{ $labels.instance }})" description: | 95%的推理请求延迟超过0.5秒 当前P95延迟: {{ $value }}秒 可能影响实时性要求。 # 规则5：检测服务宕机告警 - alert: DetectionServiceDown expr: up{job="videoagent_screenfilter"} == 0 for: 1m labels: severity: critical service: videoagent-detection annotations: summary: "检测服务宕机 (实例 {{ $labels.instance }})" description: | 视频检测服务已宕机超过1分钟 请立即检查服务状态和日志。 # 规则6：处理吞吐量下降告警 - alert: LowProcessingThroughput expr: rate(detection_frames_total[5m]) < 5 for: 5m labels: severity: warning service: videoagent-detection annotations: summary: "处理吞吐量过低 (实例 {{ $labels.instance }})" description: | 帧处理速率持续5分钟低于5帧/秒 当前速率: {{ $value }} 帧/秒 可能遇到性能瓶颈。

4.3 在Prometheus中加载告警规则

修改Prometheus配置，添加告警规则文件：

# prometheus.yml rule_files: - "gpu_alerts.yml" # 其他告警规则文件...

重启Prometheus服务使配置生效：

# 检查配置文件语法 promtool check config prometheus.yml # 重启Prometheus（根据你的部署方式） systemctl restart prometheus # 或 docker-compose restart prometheus

5. 创建Grafana监控仪表盘

有了指标数据后，我们可以通过Grafana创建直观的监控仪表盘。这里提供一个完整的仪表盘配置。

5.1 GPU监控面板

创建一个专门监控GPU状态的面板：

面板1：GPU利用率实时曲线

查询：gpu_utilization_percent
可视化：Time series
设置：
- 显示所有GPU的利用率曲线
- 添加阈值线（70%警告，90%严重）
- Y轴范围：0-100%
告警：当任何GPU超过90%时高亮显示

面板2：GPU内存使用情况

查询1：gpu_memory_used_mb（已使用内存）
查询2：gpu_memory_total_mb（总内存）
可视化：Stat + Gauge
设置：
- 显示每个GPU的内存使用百分比
- 使用仪表盘显示使用率
- 添加颜色编码（绿<70%，黄70-85%，红>85%）

面板3：GPU温度监控（如果硬件支持）

查询：gpu_temperature_celsius（需要额外采集）
可视化：Gauge
设置：温度阈值告警（如>80°C）

5.2 检测服务性能面板

面板4：请求处理统计

查询1：rate(detection_requests_total[5m])（请求速率）
查询2：sum(detection_requests_total)（总请求数）
可视化：Stat + Time series
设置：按类型（image/video）区分显示

面板5：帧处理吞吐量

查询：rate(detection_frames_total[5m])
可视化：Time series
设置：
- 显示每秒处理的帧数
- 添加目标线（如30fps）
- 统计最近1小时/24小时的平均值

面板6：目标检测统计

查询：rate(detection_objects_total[5m])
可视化：Time series + Pie chart
设置：
- 按类别显示检测频率
- 饼图显示各类别占比
- 表格显示Top N类别

面板7：推理延迟分布

查询1：histogram_quantile(0.95, rate(inference_duration_seconds_bucket[5m]))（P95延迟）
查询2：histogram_quantile(0.50, rate(inference_duration_seconds_bucket[5m]))（中位数延迟）
可视化：Time series
设置：双Y轴显示P95和中位数延迟

5.3 系统资源面板

面板8：CPU和内存使用

查询1：rate(process_cpu_seconds_total[5m]) * 100（CPU使用率）
查询2：process_resident_memory_bytes / 1024 / 1024（内存使用MB）
可视化：Time series + Gauge
设置：显示进程级别的资源使用

面板9：活跃连接数

查询：http_requests_in_progress（如果暴露）
可视化：Stat
设置：显示当前活跃请求数

5.4 导出Grafana仪表盘配置

将以上面板组合成一个完整的仪表盘后，可以导出为JSON配置文件：

{ "dashboard": { "title": "VideoAgentTrek-ScreenFilter 监控仪表盘", "tags": ["video", "detection", "gpu", "prometheus"], "timezone": "browser", "panels": [ // 各个面板的详细配置... ], "templating": { "list": [ { "name": "service", "query": "label_values(up, service)", "refresh": 1, "type": "query" }, { "name": "gpu_id", "query": "label_values(gpu_utilization_percent, gpu_id)", "refresh": 1, "type": "query" } ] }, "time": { "from": "now-6h", "to": "now" }, "refresh": "10s" }, "overwrite": true }

这个仪表盘可以通过Grafana的导入功能快速部署。

6. 实战：从告警到处理的完整流程

让我们通过一个实际场景，看看这套监控告警系统如何工作。

6.1 场景：GPU利用率持续过高

时间线：

15:00- 视频检测服务开始处理一批高清视频
15:05- GPU利用率达到92%，持续1分钟
15:06- Prometheus触发HighGPUUtilization告警（但需要持续5分钟）
15:10- GPU利用率仍保持在93%，持续5分钟条件满足
15:10:01- Alertmanager收到告警，根据路由规则发送通知

告警通知示例：

[警告] HighGPUUtilization - GPU利用率过高 服务: videoagent-detection 实例: primary GPU: gpu_0 当前利用率: 93% 持续时间: 5分钟 阈值: 90% 描述: GPU gpu_0 利用率持续5分钟超过90% 建议: 检查模型负载或考虑扩容。 时间: 2024-01-15 15:10:01

6.2 排查与处理步骤

收到告警后，运维人员可以立即采取行动：

步骤1：查看实时监控

打开Grafana仪表盘，确认GPU利用率确实持续高位
检查同时段的请求量是否异常增加
查看推理延迟是否同步上升

步骤2：登录服务器检查

# 查看GPU状态 nvidia-smi # 查看服务日志 tail -100 /root/workspace/videoagent-screenfilter.log # 查看进程状态 supervisorctl status videoagent-screenfilter # 查看当前处理的请求 netstat -anp | grep 7860

步骤3：分析可能原因

正常高负载：确实有大量视频需要处理
参数配置不当：批处理大小过大
资源竞争：其他进程占用GPU
内存泄漏：GPU内存持续增长

步骤4：采取应对措施

# 如果是正常高负载，可以考虑动态调整 # 临时降低处理并发数 @app.route('/adjust-concurrency') def adjust_concurrency(): """动态调整处理并发数""" global MAX_CONCURRENT_REQUESTS current_utilization = get_gpu_utilization() if current_utilization > 90: # GPU利用率过高，减少并发 MAX_CONCURRENT_REQUESTS = max(1, MAX_CONCURRENT_REQUESTS - 1) return {"status": "reduced", "new_limit": MAX_CONCURRENT_REQUESTS} elif current_utilization < 30: # GPU利用率低，增加并发 MAX_CONCURRENT_REQUESTS = min(10, MAX_CONCURRENT_REQUESTS + 1) return {"status": "increased", "new_limit": MAX_CONCURRENT_REQUESTS} else: return {"status": "unchanged", "current_limit": MAX_CONCURRENT_REQUESTS}

步骤5：验证解决效果

监控GPU利用率是否下降
确认服务仍然正常处理请求
观察一段时间确保问题不再出现

6.3 自动化处理脚本

对于常见问题，我们可以编写自动化处理脚本：

# auto_remediate.py - 自动修复脚本 import requests import time import json def check_and_remediate(): """检查GPU状态并自动修复""" # 查询Prometheus API获取当前GPU利用率 prometheus_url = "http://localhost:9090/api/v1/query" # 查询GPU利用率 response = requests.get(prometheus_url, params={ 'query': 'gpu_utilization_percent' }) if response.status_code == 200: data = response.json() for result in data['data']['result']: gpu_id = result['metric']['gpu_id'] utilization = float(result['value'][1]) print(f"GPU {gpu_id}: {utilization}%") # 如果GPU利用率持续过高 if utilization > 90: print(f"GPU {gpu_id} 利用率过高，尝试自动修复...") # 方案1：重启服务（简单粗暴但有效） restart_service() # 方案2：调整处理参数 adjust_processing_params() # 方案3：转移负载（如果有多个实例） transfer_load() else: print(f"Failed to query Prometheus: {response.status_code}") def restart_service(): """重启检测服务""" import subprocess print("重启检测服务...") try: result = subprocess.run( ["supervisorctl", "restart", "videoagent-screenfilter"], capture_output=True, text=True ) print(f"重启结果: {result.stdout}") except Exception as e: print(f"重启失败: {e}") def adjust_processing_params(): """调整处理参数""" print("调整处理参数...") # 降低置信度阈值，减少计算量 params = { 'conf_threshold': 0.35, # 从0.25提高到0.35 'iou_threshold': 0.35, # 从0.45降低到0.35 'batch_size': 1 # 单张处理，减少显存占用 } # 调用服务API更新参数 try: response = requests.post( "http://localhost:7860/update-params", json=params ) print(f"参数更新结果: {response.json()}") except Exception as e: print(f"参数更新失败: {e}") def transfer_load(): """转移负载到其他实例""" print("尝试转移负载...") # 如果有负载均衡，可以调整权重 # 这里只是示例 pass if __name__ == "__main__": # 每30秒检查一次 while True: check_and_remediate() time.sleep(30)