Prometheus介绍及监控平台部署
1. 核心架构概览
plaintext
┌─────────────────────────────────────────────────────────────────┐ │ Prometheus 架构 │ ├─────────────────────────────────────────────────────────────────┤ │ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ │ │Exporter│ │Exporter│ │Exporter│ │Exporter│ (:9100/9104) │ │ └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ │ │ └───────────┴───────────┴───────────┘ │ │ │ Pull (15s) │ │ ▼ │ │ ┌──────────────┐ │ │ │ Prometheus │──┐ │ │ │ Server │ │ ┌──────────────┐ │ │ │ ┌──────────┐ │ └───▶│ Alertmanager │──▶通知 │ │ │ │ TSDB │ │ └──────────────┘ │ │ │ └──────────┘ │ │ │ └──────┬───────┘ │ │ ▼ │ │ ┌──────────────┐ (:3000) │ │ │ Grafana │ │ │ └──────────────┘ │ └─────────────────────────────────────────────────────────────────┘流程:Exporter暴露/metrics→ Prometheus定时Pull → TSDB存储 → Alertmanager告警 → Grafana展示
2. 部署安装(Docker Compose)
yaml
# docker-compose.yml version: '3.8' services: prometheus: image: prom/prometheus:v2.47.0 container_name: prometheus restart: unless-stopped ports: - "9090:9090" volumes: - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml - ./prometheus/rules:/etc/prometheus/rules - prometheus_data:/prometheus command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' - '--storage.tsdb.retention.time=15d' - '--web.enable-lifecycle' networks: - monitoring alertmanager: image: prom/alertmanager:v0.26.0 container_name: alertmanager restart: unless-stopped ports: - "9093:9093" volumes: - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml networks: - monitoring grafana: image: grafana/grafana:10.1.0 container_name: grafana restart: unless-stopped ports: - "3000:3000" environment: - GF_SECURITY_ADMIN_USER=admin - GF_SECURITY_ADMIN_PASSWORD=admin123 volumes: - grafana_data:/var/lib/grafana networks: - monitoring node-exporter: image: prom/node-exporter:v1.6.1 container_name: node-exporter restart: unless-stopped ports: - "9100:9100" command: - '--path.procfs=/host/proc' - '--path.sysfs=/host/sys' - '--path.rootfs=/host' - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|$)' volumes: - /proc:/host/proc:ro - /sys:/host/sys:ro - /:/host:ro networks: - monitoring networks: monitoring: driver: bridge volumes: prometheus_data: grafana_data:3. 核心配置(prometheus.yml详解)
yaml
# prometheus.yml global: scrape_interval: 15s evaluation_interval: 15s external_labels: cluster: 'prod' alerting: alertmanagers: - static_configs: - targets: ['alertmanager:9093'] rule_files: - "rules/*.yml" scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'node' static_configs: - targets: ['node-exporter:9100'] labels: env: 'prod' - job_name: 'file_sd' file_sd_configs: - files: - 'targets/*.json' refresh_interval: 30s - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - job_name: 'relabel_demo' static_configs: - targets: ['192.168.1.100:8080'] relabel_configs: - source_labels: [__address__] regex: '([^:]+):(\d+)' target_label: instance replacement: '${1}' - target_label: env replacement: 'prod' - regex: '__meta_.*' action: labeldrop4. Exporter部署
node_exporter安装
bash
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz tar xzf node_exporter-1.6.1.linux-amd64.tar.gz sudo mv node_exporter-1.6.1.linux-amd64/node_exporter /usr/local/bin/ sudo tee /etc/systemd/system/node-exporter.service <<EOF [Unit] Description=Node Exporter After=network.target [Service] ExecStart=/usr/local/bin/node_exporter Restart=on-failure [Install] WantedBy=multi-user.target EOF sudo systemctl daemon-reload && sudo systemctl enable --now node-exporter常用Exporter
表格
| Exporter | 端口 | 监控目标 | 关键指标 |
|---|---|---|---|
| node_exporter | 9100 | Linux | cpu/mem/disk/net |
| windows_exporter | 9182 | Windows | iis/sqlserver |
| mysql_exporter | 9104 | MySQL | queries/connections |
| postgres_exporter | 9187 | PostgreSQL | queries/buffers |
| redis_exporter | 9121 | Redis | memory/commands |
| blackbox_exporter | 9115 | HTTP/TCP | probe_success |
| cadvisor | 8080 | Docker | container_* |
5. PromQL查询基础
promql
# 即时向量 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) # CPU使用率 100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) # 内存使用率 # 区间向量 rate(node_cpu_seconds_total{mode="user"}[5m]) # 变化率 increase(http_requests_total[1h]) # 增量 # 聚合 sum by (instance, job) (rate(node_cpu_seconds_total[5m])) count(node_cpu_seconds_total) max by (service) (http_request_duration_seconds_bucket) # 函数 predict_linear(node_filesystem_free_bytes{mountpoint="/"}[1h], 4*3600) # 预测 irate(node_cpu_seconds_total{mode="user"}[5m]) # 瞬时变化率 label_replace(up{job="node"}, "hostname", "$1", "instance", "([^:]+):.*")6. Alertmanager告警配置
告警规则
yaml
# rules/alerts.yml groups: - name: node_alerts rules: - alert: NodeDown expr: up{job="node"} == 0 for: 1m labels: severity: critical annotations: summary: "节点 {{ $labels.instance }} 宕机" - alert: HighCPU expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 5m labels: severity: warning annotations: summary: "CPU使用率超过80%,当前: {{ $value | printf \"%.2f\" }}%" - alert: LowMemory expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.1 for: 3m labels: severity: warning - alert: DiskSpaceLow expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1 for: 2m labels: severity: criticalAlertmanager配置
yaml
# alertmanager/alertmanager.yml global: resolve_timeout: 5m smtp_smarthost: 'smtp.qq.com:587' smtp_from: 'alert@example.com' smtp_auth_password: 'xxxxxx' route: group_by: ['alertname', 'cluster'] group_wait: 30s group_interval: 5m repeat_interval: 4h receiver: 'default-receiver' routes: - match: severity: critical receiver: 'critical-receiver' group_wait: 10s receivers: - name: 'default-receiver' email_configs: - to: 'ops@example.com' send_resolved: true slack_configs: - channel: '#alerts' send_resolved: true - name: 'critical-receiver' webhook_configs: - url: 'http://dingtalk:8060/dingtalk/webhook' inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['instance']7. 常用命令与API
bash
# 热加载配置 curl -X POST http://localhost:9090/-/reload # TSDB操作 curl http://localhost:9090/api/v1/status/tsdb curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={job="test"}' curl -X POST http://localhost:9090/api/v1/admin/tsdb/clean_tombstones # HTTP API curl -G http://localhost:9090/api/v1/query --data-urlencode 'query=up{job="node"}' curl -G http://localhost:9090/api/v1/query_range \ --data-urlencode 'query=up{job="node"}' \ --data-urlencode 'start=2024-01-01T00:00:00Z' \ --data-urlencode 'end=2024-01-01T01:00:00Z' \ --data-urlencode 'step=60s' curl http://localhost:9090/api/v1/targets curl http://localhost:9090/api/v1/alerts curl http://localhost:9090/api/v1/rules8. 常见问题排查
问题1: Target Down
bash
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health=="down")' nc -zv <target_ip> <port> docker logs node-exporter curl http://<target>:9100/metrics问题2: 指标缺失
bash
curl http://localhost:9090/api/v1/label/__name__/values | jq '.data[]' | grep -i <metric> curl -G http://localhost:9090/api/v1/query --data-urlencode 'query=<metric_name>_total' curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[].labels'问题3: 告警不触发
bash
curl http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.type=="alerting")' curl -s http://localhost:9093/api/v1/status curl http://localhost:9093/api/v1/silences表格
| 排查命令 | 用途 |
|---|---|
up{job="xxx"} | 确认target状态 |
rate(x[5m]) > 0 | 验证指标存在 |
ALERTS{alertname="xxx"} | 检查告警状态 |
promtool check config | 验证配置文件 |
9. 最佳实践
命名规范
yaml
# 指标名: <域>_<子系统>_<名称>_<单位> node_memory_Available_bytes http_request_duration_seconds # 标签: app_name, env, region, cluster, instance # 避免高基数标签(user_id, ip等)联邦集群
yaml
- job_name: 'federate' metrics_path: '/federate' params: 'match[]': ['{__name__=~".+"}'] static_configs: - targets: - 'prometheus-prod:9090' - 'prometheus-prod2:9090'高可用
plaintext
┌─────────────┐ ┌─────────────┐ │ Prometheus │ │ Prometheus │ # 双写 │ Primary │ │ Replica │ └──────┬──────┘ └──────┬──────┘ └────────┬─────────┘ ▼ ┌──────────────┐ │Thanos Receiver│ # 统一存储 └──────────────┘远程存储
yaml
remote_write: - url: http://thanos-receive:19291/api/v1/receive queue_config: capacity: 10000 max_shards: 30 remote_read: - url: http://thanos-query:10912/api/v1/read read_recent: true性能优化
- 标签基数控制: 避免超过10万标签组合
- 抓取间隔: 高频5s,低频60s
- 记录规则: 预聚合复杂查询
- 存储清理: 合理保留周期
- 联邦分区: 按服务域拆分Prometheus
