当前位置：首页 > news >正文

Qwen3-14B镜像部署：Prometheus+Grafana监控GPU/内存/请求指标

news 2026/6/10 23:51:36

Qwen3-14B镜像部署：Prometheus+Grafana监控GPU/内存/请求指标

1. 镜像概述与监控需求

Qwen3-14B私有部署镜像为开发者提供了开箱即用的大模型推理环境，但在实际生产部署中，我们需要实时掌握系统资源使用情况和模型服务状态。通过集成Prometheus和Grafana监控系统，可以实现：

GPU监控：显存占用、利用率、温度等关键指标
内存监控：系统内存和显存使用趋势
请求监控：API调用量、响应时间、错误率等
告警设置：资源阈值告警，提前发现问题

这套监控方案特别适合长期运行的模型服务，帮助开发者优化资源配置和排查问题。

2. 监控系统架构设计

2.1 核心组件介绍

我们的监控方案包含三个核心组件：

Prometheus：负责指标采集和存储
Grafana：提供可视化仪表盘
Node Exporter：采集主机基础指标
DCGM Exporter：专用于GPU监控

2.2 数据流向示意图

[Qwen3-14B服务] → [Prometheus] ← [Node Exporter] ↑ [Grafana Dashboard]

3. 监控环境部署步骤

3.1 安装必要组件

首先在Qwen3-14B镜像环境中安装所需工具：

# 安装Prometheus wget https://github.com/prometheus/prometheus/releases/download/v2.47.0/prometheus-2.47.0.linux-amd64.tar.gz tar xvfz prometheus-*.tar.gz mv prometheus-2.47.0.linux-amd64 /opt/prometheus # 安装Grafana wget https://dl.grafana.com/enterprise/release/grafana-enterprise-10.2.0.linux-amd64.tar.gz tar xvfz grafana-*.tar.gz mv grafana-10.2.0 /opt/grafana # 安装Node Exporter wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz tar xvfz node_exporter-*.tar.gz mv node_exporter-1.6.1.linux-amd64/node_exporter /usr/local/bin/ # 安装DCGM Exporter docker pull nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04

3.2 配置Prometheus

编辑Prometheus配置文件/opt/prometheus/prometheus.yml：

global: scrape_interval: 15s scrape_configs: - job_name: 'node' static_configs: - targets: ['localhost:9100'] - job_name: 'dcgm' static_configs: - targets: ['localhost:9400'] - job_name: 'qwen-api' metrics_path: '/metrics' static_configs: - targets: ['localhost:8000']

3.3 启动监控服务

创建启动脚本start_monitoring.sh：

#!/bin/bash # 启动Node Exporter nohup node_exporter > /var/log/node_exporter.log 2>&1 & # 启动DCGM Exporter docker run -d --rm --gpus all --name dcgm-exporter \ -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04 # 启动Prometheus cd /opt/prometheus nohup ./prometheus --config.file=prometheus.yml > /var/log/prometheus.log 2>&1 & # 启动Grafana cd /opt/grafana/bin nohup ./grafana-server > /var/log/grafana.log 2>&1 &

赋予执行权限并启动：

chmod +x start_monitoring.sh ./start_monitoring.sh

4. Grafana仪表板配置

4.1 基础配置

访问Grafana：http://localhost:3000
默认账号/密码：admin/admin
添加Prometheus数据源：
- URL: http://localhost:9090
- Access: Server

4.2 导入预置仪表板

我们提供了专门为Qwen3-14B设计的监控仪表板，包含以下关键面板：

GPU监控：
- 显存使用率
- GPU利用率
- 温度监控
- 功耗监控
系统资源：
- CPU使用率
- 内存使用量
- 磁盘IO
- 网络流量
API服务：
- 请求速率
- 响应时间
- 错误率
- 并发请求数

导入仪表板JSON配置文件：

wget https://example.com/qwen-monitoring-dashboard.json

在Grafana界面选择"Import"导入该文件。

5. 关键监控指标解析

5.1 GPU监控指标

# 显存使用率 100 * (DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL) # GPU利用率 DCGM_FI_DEV_GPU_UTIL # 温度监控 DCGM_FI_DEV_GPU_TEMP

5.2 系统资源指标

# CPU使用率 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) # 内存使用量 node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes # 磁盘使用率 100 * (node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"}

5.3 API服务指标

# 请求速率 sum(rate(http_requests_total[1m])) by (status_code) # 平均响应时间 avg(http_request_duration_seconds_sum / http_request_duration_seconds_count) # 错误率 sum(rate(http_requests_total{status_code=~"5.."}[1m])) / sum(rate(http_requests_total[1m]))

6. 告警规则配置

在Prometheus中配置关键告警规则/opt/prometheus/alerts.yml：

groups: - name: qwen-alerts rules: - alert: HighGPUUsage expr: DCGM_FI_DEV_GPU_UTIL > 90 for: 5m labels: severity: warning annotations: summary: "High GPU utilization on {{ $labels.instance }}" description: "GPU utilization is {{ $value }}%" - alert: HighMemoryUsage expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90 for: 5m labels: severity: critical annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value }}%" - alert: APIErrorRateHigh expr: sum(rate(http_requests_total{status_code=~"5.."}[1m])) / sum(rate(http_requests_total[1m])) > 0.05 for: 5m labels: severity: warning annotations: summary: "High API error rate on {{ $labels.instance }}" description: "Error rate is {{ $value }}"

更新Prometheus配置引用告警规则：