DeOldify服务稳定运行秘籍:Prometheus+Grafana监控部署全攻略
DeOldify服务稳定运行秘籍:Prometheus+Grafana监控部署全攻略
1. 为什么需要监控DeOldify服务
当你部署了DeOldify图像上色服务后,最常遇到的运维问题是什么?是半夜收到用户投诉服务不可用,还是发现GPU资源莫名其妙被耗尽?这些问题都指向同一个需求:我们需要一个完善的监控系统。
DeOldify作为基于深度学习的图像处理服务,具有几个典型特征:
- 资源密集型:模型推理需要大量GPU和内存资源
- 长时运行:服务通常需要7×24小时持续工作
- 性能敏感:用户期望快速获得上色结果
没有监控的系统就像没有仪表盘的汽车,你无法知道:
- 当前服务是否健康运行
- GPU和内存资源是否充足
- 请求处理速度是否正常
- 是否存在潜在的性能瓶颈
2. 监控系统架构设计
2.1 核心组件选型
我们选择Prometheus+Grafana组合来实现监控系统,这是目前最流行的开源监控方案:
- Prometheus:负责指标采集和存储
- Grafana:负责数据可视化和告警展示
- Node Exporter:采集主机系统指标
- NVIDIA GPU Exporter:采集GPU相关指标
2.2 数据流架构
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ │ │ │ │ DeOldify │ │ Prometheus │ │ Grafana │ │ 服务 │◄──►│ 服务器 │◄──►│ 平台 │ │ (7860端口) │ │ (9090端口) │ │ (3000端口) │ │ │ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ │ 指标数据 │ 查询数据 │ 配置仪表盘 ▼ ▼ ▼ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Node │ │ Alert- │ │ 监控 │ │ Exporter │ │ manager │ │ 仪表盘 │ │ (9100端口) │ │ (9093端口) │ │ │ └─────────────┘ └─────────────┘ └─────────────┘3. 监控环境搭建
3.1 安装Prometheus
# 创建专用用户 sudo useradd --no-create-home --shell /bin/false prometheus # 创建配置目录 sudo mkdir /etc/prometheus sudo mkdir /var/lib/prometheus # 下载并安装 cd /tmp wget https://github.com/prometheus/prometheus/releases/download/v2.47.2/prometheus-2.47.2.linux-amd64.tar.gz tar -xvf prometheus-2.47.2.linux-amd64.tar.gz cd prometheus-2.47.2.linux-amd64 # 复制文件 sudo cp prometheus /usr/local/bin/ sudo cp promtool /usr/local/bin/ sudo chown prometheus:prometheus /usr/local/bin/prometheus sudo chown prometheus:prometheus /usr/local/bin/promtool sudo cp -r consoles /etc/prometheus sudo cp -r console_libraries /etc/prometheus sudo cp prometheus.yml /etc/prometheus/ sudo chown -R prometheus:prometheus /etc/prometheus sudo chown -R prometheus:prometheus /var/lib/prometheus # 创建服务文件 sudo tee /etc/systemd/system/prometheus.service > /dev/null <<EOF [Unit] Description=Prometheus Monitoring System After=network.target [Service] User=prometheus Group=prometheus ExecStart=/usr/local/bin/prometheus \ --config.file /etc/prometheus/prometheus.yml \ --storage.tsdb.path /var/lib/prometheus/ \ --web.console.templates=/etc/prometheus/consoles \ --web.console.libraries=/etc/prometheus/console_libraries \ --web.listen-address=0.0.0.0:9090 Restart=always RestartSec=3 [Install] WantedBy=multi-user.target EOF3.2 安装Node Exporter
# 创建系统用户 sudo useradd --no-create-home --shell /bin/false node_exporter # 下载安装 cd /tmp wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz tar -xvf node_exporter-1.6.1.linux-amd64.tar.gz cd node_exporter-1.6.1.linux-amd64 sudo cp node_exporter /usr/local/bin/ sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter # 创建服务文件 sudo tee /etc/systemd/system/node_exporter.service > /dev/null <<EOF [Unit] Description=Node Exporter After=network.target [Service] User=node_exporter Group=node_exporter ExecStart=/usr/local/bin/node_exporter [Install] WantedBy=multi-user.target EOF3.3 安装NVIDIA GPU Exporter
# 安装依赖 sudo apt-get update sudo apt-get install -y golang-go # 下载编译 git clone https://github.com/utkuozdemir/nvidia_gpu_exporter.git cd nvidia_gpu_exporter make build # 安装服务 sudo cp bin/nvidia_gpu_exporter /usr/local/bin/ sudo chmod +x /usr/local/bin/nvidia_gpu_exporter sudo tee /etc/systemd/system/nvidia_gpu_exporter.service > /dev/null <<EOF [Unit] Description=NVIDIA GPU Exporter After=network.target [Service] Type=simple User=root ExecStart=/usr/local/bin/nvidia_gpu_exporter Restart=always [Install] WantedBy=multi-user.target EOF3.4 安装Grafana
# 添加Grafana仓库 sudo apt-get install -y apt-transport-https software-properties-common wget wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add - echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list sudo apt-get update # 安装Grafana sudo apt-get install -y grafana # 启动服务 sudo systemctl enable grafana-server sudo systemctl start grafana-server4. 配置监控数据采集
4.1 配置Prometheus采集目标
编辑/etc/prometheus/prometheus.yml:
global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'node' static_configs: - targets: ['localhost:9100'] - job_name: 'nvidia_gpu' static_configs: - targets: ['localhost:9835'] - job_name: 'deoldify' metrics_path: '/metrics' static_configs: - targets: ['localhost:7860'] relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: localhost:78604.2 为DeOldify添加自定义指标
创建/root/cv_unet_image-colorization/monitoring/metrics_exporter.py:
from prometheus_client import start_http_server, Gauge, Counter, Histogram import psutil import time import threading # 定义监控指标 REQUEST_COUNT = Counter('deoldify_requests_total', '总请求数量', ['method', 'endpoint']) REQUEST_DURATION = Histogram('deoldify_request_duration_seconds', '请求处理时间', ['endpoint']) ACTIVE_REQUESTS = Gauge('deoldify_active_requests', '当前活跃请求数') PROCESSING_TIME = Histogram('deoldify_processing_seconds', '图片处理时间') SUCCESSFUL_PROCESSING = Counter('deoldify_success_total', '成功处理数量') FAILED_PROCESSING = Counter('deoldify_failures_total', '处理失败数量', ['reason']) CPU_USAGE = Gauge('deoldify_cpu_usage_percent', 'CPU使用率') MEMORY_USAGE = Gauge('deoldify_memory_usage_bytes', '内存使用量') GPU_USAGE = Gauge('deoldify_gpu_usage_percent', 'GPU使用率') GPU_MEMORY = Gauge('deoldify_gpu_memory_bytes', 'GPU内存使用量') class MetricsExporter: def __init__(self, port=8000): self.port = port self.stop_event = threading.Event() def start(self): start_http_server(self.port) threading.Thread(target=self._collect_system_metrics, daemon=True).start() def _collect_system_metrics(self): while not self.stop_event.is_set(): try: CPU_USAGE.set(psutil.cpu_percent(interval=1)) MEMORY_USAGE.set(psutil.Process().memory_info().rss) self._collect_gpu_metrics() except Exception as e: print(f"收集指标出错: {e}") time.sleep(5) def _collect_gpu_metrics(self): try: # 实际GPU指标收集逻辑 pass except: GPU_USAGE.set(0) GPU_MEMORY.set(0) def record_request(self, method, endpoint, duration, success=True, error_reason=None): REQUEST_COUNT.labels(method=method, endpoint=endpoint).inc() REQUEST_DURATION.labels(endpoint=endpoint).observe(duration) SUCCESSFUL_PROCESSING.inc() if success else FAILED_PROCESSING.labels(reason=error_reason or 'unknown').inc() def record_processing_time(self, duration): PROCESSING_TIME.observe(duration) def increment_active_requests(self): ACTIVE_REQUESTS.inc() def decrement_active_requests(self): ACTIVE_REQUESTS.dec() exporter = MetricsExporter(port=8000) def start_metrics_server(): exporter.start()5. Grafana仪表盘配置
5.1 添加数据源
- 访问Grafana(http://localhost:3000)
- 左侧菜单选择Configuration → Data Sources
- 点击Add data source选择Prometheus
- URL填写http://localhost:9090
- 点击Save & Test
5.2 创建监控仪表盘
系统资源面板:
CPU使用率查询:
100 - (avg by(instance)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)内存使用率查询:
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
GPU监控面板:
GPU使用率:
nvidia_gpu_duty_cycle * 100GPU内存使用:
nvidia_gpu_memory_used_bytes / 1024 / 1024
服务性能面板:
请求率:
rate(deoldify_requests_total[5m])平均响应时间:
rate(deoldify_request_duration_seconds_sum[5m]) / rate(deoldify_request_duration_seconds_count[5m])
6. 告警配置
6.1 配置告警规则
创建/etc/prometheus/alert_rules.yml:
groups: - name: deoldify_alerts rules: - alert: HighGPUMemoryUsage expr: nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes > 0.9 for: 5m labels: severity: warning annotations: summary: "GPU内存使用率过高" description: "GPU内存使用率超过90%" - alert: ServiceDown expr: up{job="deoldify"} == 0 for: 1m labels: severity: critical annotations: summary: "DeOldify服务宕机" description: "服务已停止响应"6.2 配置Alertmanager
编辑/etc/alertmanager/alertmanager.yml:
route: group_by: ['alertname'] group_wait: 30s group_interval: 5m repeat_interval: 3h receiver: 'email-notifications' receivers: - name: 'email-notifications' email_configs: - to: 'admin@example.com' send_resolved: true7. 启动与验证
7.1 启动所有服务
sudo systemctl enable prometheus node_exporter nvidia_gpu_exporter alertmanager grafana-server sudo systemctl start prometheus node_exporter nvidia_gpu_exporter alertmanager grafana-server7.2 验证监控系统
检查服务状态:
sudo systemctl status prometheus sudo systemctl status node_exporter sudo systemctl status nvidia_gpu_exporter测试指标收集:
curl http://localhost:9100/metrics # Node Exporter curl http://localhost:9835/metrics # GPU Exporter curl http://localhost:8000/metrics # DeOldify自定义指标8. 总结
通过本文的部署,你已经为DeOldify图像上色服务建立了完整的监控系统:
- 资源监控:实时掌握CPU、内存、GPU使用情况
- 性能监控:跟踪请求处理时间和成功率
- 告警系统:异常情况及时通知
- 可视化:通过仪表盘直观展示监控数据
这套监控系统将帮助你:
- 快速发现和解决性能问题
- 合理规划资源扩容
- 提高服务稳定性和可靠性
- 优化用户体验
现在,你的DeOldify服务已经具备了生产级监控能力,可以更加自信地提供服务了。
获取更多AI镜像
想探索更多AI镜像和应用场景?访问 CSDN星图镜像广场,提供丰富的预置镜像,覆盖大模型推理、图像生成、视频生成、模型微调等多个领域,支持一键部署。
