微服务监控:Prometheus与Grafana实战
微服务监控:Prometheus与Grafana实战
大家好,我是欧阳瑞(Rich Own)。今天想和大家聊聊微服务监控这个重要话题。作为一个全栈开发者,监控是保障系统稳定运行的关键。今天就来分享一下Prometheus和Grafana的实战经验。
为什么需要监控?
| 场景 | 说明 |
|---|---|
| 故障排查 | 快速定位问题 |
| 性能优化 | 发现性能瓶颈 |
| 容量规划 | 预测资源需求 |
| 安全审计 | 追踪异常行为 |
Prometheus简介
Prometheus是一个开源的监控系统,具有以下特点:
- 多维度数据模型
- 灵活的查询语言(PromQL)
- 高效的时间序列数据库
- 内置告警机制
安装Prometheus
# 使用Docker安装 docker run -d --name prometheus \ -p 9090:9090 \ -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \ prom/prometheus配置文件
# prometheus.yml global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'node-exporter' static_configs: - targets: ['node-exporter:9100'] - job_name: 'api-service' static_configs: - targets: ['api-service:3000'] metrics_path: '/metrics'指标类型
# 计数器(Counter) http_requests_total = Counter('http_requests_total', 'Total HTTP requests') # 仪表盘(Gauge) memory_usage = Gauge('memory_usage_bytes', 'Memory usage in bytes') # 直方图(Histogram) request_duration = Histogram('request_duration_seconds', 'Request duration') # 摘要(Summary) response_size = Summary('response_size_bytes', 'Response size')实战:监控API服务
from flask import Flask from prometheus_client import Counter, Histogram, generate_latest app = Flask(__name__) REQUESTS = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint']) DURATION = Histogram('request_duration_seconds', 'Request duration') @app.route('/') @DURATION.time() def index(): REQUESTS.labels(method='GET', endpoint='/').inc() return 'Hello World' @app.route('/metrics') def metrics(): return generate_latest(), 200, {'Content-Type': 'text/plain'} if __name__ == '__main__': app.run(port=3000)Grafana配置
# 使用Docker安装Grafana docker run -d --name grafana \ -p 3000:3000 \ -v /path/to/grafana-data:/var/lib/grafana \ grafana/grafana配置数据源
# 添加Prometheus数据源 apiVersion: 1 datasources: - name: Prometheus type: prometheus url: http://prometheus:9090 access: proxy isDefault: true创建仪表盘
{ "dashboard": { "id": null, "title": "API监控", "panels": [ { "type": "graph", "title": "请求数", "targets": [ { "expr": "rate(http_requests_total[5m])", "legendFormat": "{{method}} {{endpoint}}" } ] }, { "type": "graph", "title": "请求延迟", "targets": [ { "expr": "histogram_quantile(0.95, rate(request_duration_seconds_bucket[5m]))", "legendFormat": "P95" } ] } ] } }告警配置
# alerting_rules.yml groups: - name: api-alerts rules: - alert: HighErrorRate expr: rate(http_errors_total[5m]) > 0.1 for: 5m labels: severity: critical annotations: summary: "High error rate detected" description: "Error rate is {{ $value }}% for API service" - alert: HighLatency expr: histogram_quantile(0.95, rate(request_duration_seconds_bucket[5m])) > 1 for: 5m labels: severity: warning annotations: summary: "High latency detected" description: "P95 latency is {{ $value }}s"最佳实践
1. 指标命名规范
# <metric_type>_<name>_<unit> http_requests_total memory_usage_bytes request_duration_seconds2. 标签管理
REQUESTS.labels( method='GET', endpoint='/api/users', status_code='200' ).inc()3. 可视化技巧
{ "panels": [ { "type": "stat", "title": "平均延迟", "targets": [ { "expr": "avg(request_duration_seconds)" } ] }, { "type": "gauge", "title": "内存使用率", "targets": [ { "expr": "memory_usage_bytes / memory_total_bytes * 100" } ] } ] }总结
Prometheus和Grafana是监控领域的黄金组合。通过合理的指标设计和可视化配置,可以全面监控系统的运行状态。
我的鬃狮蜥Hash对监控也有自己的理解——它总是时刻关注周围环境的变化,这也许就是自然界的"监控系统"吧!
如果你对监控感兴趣,欢迎留言交流!我是欧阳瑞,极客之路,永无止境!
技术栈:Prometheus · Grafana · 监控
