当前位置：首页 > news >正文

微服务监控：Prometheus与Grafana实战

news 2026/7/18 10:15:52

微服务监控：Prometheus与Grafana实战

大家好，我是欧阳瑞（Rich Own）。今天想和大家聊聊微服务监控这个重要话题。作为一个全栈开发者，监控是保障系统稳定运行的关键。今天就来分享一下Prometheus和Grafana的实战经验。

为什么需要监控？

场景	说明
故障排查	快速定位问题
性能优化	发现性能瓶颈
容量规划	预测资源需求
安全审计	追踪异常行为

Prometheus简介

Prometheus是一个开源的监控系统，具有以下特点：

多维度数据模型
灵活的查询语言（PromQL）
高效的时间序列数据库
内置告警机制

安装Prometheus

# 使用Docker安装 docker run -d --name prometheus \ -p 9090:9090 \ -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \ prom/prometheus

配置文件

# prometheus.yml global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'node-exporter' static_configs: - targets: ['node-exporter:9100'] - job_name: 'api-service' static_configs: - targets: ['api-service:3000'] metrics_path: '/metrics'

指标类型

# 计数器（Counter） http_requests_total = Counter('http_requests_total', 'Total HTTP requests') # 仪表盘（Gauge） memory_usage = Gauge('memory_usage_bytes', 'Memory usage in bytes') # 直方图（Histogram） request_duration = Histogram('request_duration_seconds', 'Request duration') # 摘要（Summary） response_size = Summary('response_size_bytes', 'Response size')

实战：监控API服务

from flask import Flask from prometheus_client import Counter, Histogram, generate_latest app = Flask(__name__) REQUESTS = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint']) DURATION = Histogram('request_duration_seconds', 'Request duration') @app.route('/') @DURATION.time() def index(): REQUESTS.labels(method='GET', endpoint='/').inc() return 'Hello World' @app.route('/metrics') def metrics(): return generate_latest(), 200, {'Content-Type': 'text/plain'} if __name__ == '__main__': app.run(port=3000)

Grafana配置

# 使用Docker安装Grafana docker run -d --name grafana \ -p 3000:3000 \ -v /path/to/grafana-data:/var/lib/grafana \ grafana/grafana

配置数据源

# 添加Prometheus数据源 apiVersion: 1 datasources: - name: Prometheus type: prometheus url: http://prometheus:9090 access: proxy isDefault: true

创建仪表盘

{ "dashboard": { "id": null, "title": "API监控", "panels": [ { "type": "graph", "title": "请求数", "targets": [ { "expr": "rate(http_requests_total[5m])", "legendFormat": "{{method}} {{endpoint}}" } ] }, { "type": "graph", "title": "请求延迟", "targets": [ { "expr": "histogram_quantile(0.95, rate(request_duration_seconds_bucket[5m]))", "legendFormat": "P95" } ] } ] } }

告警配置

# alerting_rules.yml groups: - name: api-alerts rules: - alert: HighErrorRate expr: rate(http_errors_total[5m]) > 0.1 for: 5m labels: severity: critical annotations: summary: "High error rate detected" description: "Error rate is {{ $value }}% for API service" - alert: HighLatency expr: histogram_quantile(0.95, rate(request_duration_seconds_bucket[5m])) > 1 for: 5m labels: severity: warning annotations: summary: "High latency detected" description: "P95 latency is {{ $value }}s"

最佳实践

1. 指标命名规范

# <metric_type>_<name>_<unit> http_requests_total memory_usage_bytes request_duration_seconds

2. 标签管理

REQUESTS.labels( method='GET', endpoint='/api/users', status_code='200' ).inc()

3. 可视化技巧

{ "panels": [ { "type": "stat", "title": "平均延迟", "targets": [ { "expr": "avg(request_duration_seconds)" } ] }, { "type": "gauge", "title": "内存使用率", "targets": [ { "expr": "memory_usage_bytes / memory_total_bytes * 100" } ] } ] }