当前位置: 首页 > news >正文

Prometheus介绍及监控平台部署

1. 核心架构概览

plaintext

┌─────────────────────────────────────────────────────────────────┐ │ Prometheus 架构 │ ├─────────────────────────────────────────────────────────────────┤ │ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ │ │Exporter│ │Exporter│ │Exporter│ │Exporter│ (:9100/9104) │ │ └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ │ │ └───────────┴───────────┴───────────┘ │ │ │ Pull (15s) │ │ ▼ │ │ ┌──────────────┐ │ │ │ Prometheus │──┐ │ │ │ Server │ │ ┌──────────────┐ │ │ │ ┌──────────┐ │ └───▶│ Alertmanager │──▶通知 │ │ │ │ TSDB │ │ └──────────────┘ │ │ │ └──────────┘ │ │ │ └──────┬───────┘ │ │ ▼ │ │ ┌──────────────┐ (:3000) │ │ │ Grafana │ │ │ └──────────────┘ │ └─────────────────────────────────────────────────────────────────┘

流程:Exporter暴露/metrics→ Prometheus定时Pull → TSDB存储 → Alertmanager告警 → Grafana展示

2. 部署安装(Docker Compose)

yaml

# docker-compose.yml version: '3.8' services: prometheus: image: prom/prometheus:v2.47.0 container_name: prometheus restart: unless-stopped ports: - "9090:9090" volumes: - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml - ./prometheus/rules:/etc/prometheus/rules - prometheus_data:/prometheus command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' - '--storage.tsdb.retention.time=15d' - '--web.enable-lifecycle' networks: - monitoring alertmanager: image: prom/alertmanager:v0.26.0 container_name: alertmanager restart: unless-stopped ports: - "9093:9093" volumes: - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml networks: - monitoring grafana: image: grafana/grafana:10.1.0 container_name: grafana restart: unless-stopped ports: - "3000:3000" environment: - GF_SECURITY_ADMIN_USER=admin - GF_SECURITY_ADMIN_PASSWORD=admin123 volumes: - grafana_data:/var/lib/grafana networks: - monitoring node-exporter: image: prom/node-exporter:v1.6.1 container_name: node-exporter restart: unless-stopped ports: - "9100:9100" command: - '--path.procfs=/host/proc' - '--path.sysfs=/host/sys' - '--path.rootfs=/host' - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|$)' volumes: - /proc:/host/proc:ro - /sys:/host/sys:ro - /:/host:ro networks: - monitoring networks: monitoring: driver: bridge volumes: prometheus_data: grafana_data:

3. 核心配置(prometheus.yml详解)

yaml

# prometheus.yml global: scrape_interval: 15s evaluation_interval: 15s external_labels: cluster: 'prod' alerting: alertmanagers: - static_configs: - targets: ['alertmanager:9093'] rule_files: - "rules/*.yml" scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'node' static_configs: - targets: ['node-exporter:9100'] labels: env: 'prod' - job_name: 'file_sd' file_sd_configs: - files: - 'targets/*.json' refresh_interval: 30s - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - job_name: 'relabel_demo' static_configs: - targets: ['192.168.1.100:8080'] relabel_configs: - source_labels: [__address__] regex: '([^:]+):(\d+)' target_label: instance replacement: '${1}' - target_label: env replacement: 'prod' - regex: '__meta_.*' action: labeldrop

4. Exporter部署

node_exporter安装

bash

wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz tar xzf node_exporter-1.6.1.linux-amd64.tar.gz sudo mv node_exporter-1.6.1.linux-amd64/node_exporter /usr/local/bin/ sudo tee /etc/systemd/system/node-exporter.service <<EOF [Unit] Description=Node Exporter After=network.target [Service] ExecStart=/usr/local/bin/node_exporter Restart=on-failure [Install] WantedBy=multi-user.target EOF sudo systemctl daemon-reload && sudo systemctl enable --now node-exporter

常用Exporter

表格

Exporter端口监控目标关键指标
node_exporter9100Linuxcpu/mem/disk/net
windows_exporter9182Windowsiis/sqlserver
mysql_exporter9104MySQLqueries/connections
postgres_exporter9187PostgreSQLqueries/buffers
redis_exporter9121Redismemory/commands
blackbox_exporter9115HTTP/TCPprobe_success
cadvisor8080Dockercontainer_*

5. PromQL查询基础

promql

# 即时向量 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) # CPU使用率 100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) # 内存使用率 # 区间向量 rate(node_cpu_seconds_total{mode="user"}[5m]) # 变化率 increase(http_requests_total[1h]) # 增量 # 聚合 sum by (instance, job) (rate(node_cpu_seconds_total[5m])) count(node_cpu_seconds_total) max by (service) (http_request_duration_seconds_bucket) # 函数 predict_linear(node_filesystem_free_bytes{mountpoint="/"}[1h], 4*3600) # 预测 irate(node_cpu_seconds_total{mode="user"}[5m]) # 瞬时变化率 label_replace(up{job="node"}, "hostname", "$1", "instance", "([^:]+):.*")

6. Alertmanager告警配置

告警规则

yaml

# rules/alerts.yml groups: - name: node_alerts rules: - alert: NodeDown expr: up{job="node"} == 0 for: 1m labels: severity: critical annotations: summary: "节点 {{ $labels.instance }} 宕机" - alert: HighCPU expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 5m labels: severity: warning annotations: summary: "CPU使用率超过80%,当前: {{ $value | printf \"%.2f\" }}%" - alert: LowMemory expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.1 for: 3m labels: severity: warning - alert: DiskSpaceLow expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1 for: 2m labels: severity: critical

Alertmanager配置

yaml

# alertmanager/alertmanager.yml global: resolve_timeout: 5m smtp_smarthost: 'smtp.qq.com:587' smtp_from: 'alert@example.com' smtp_auth_password: 'xxxxxx' route: group_by: ['alertname', 'cluster'] group_wait: 30s group_interval: 5m repeat_interval: 4h receiver: 'default-receiver' routes: - match: severity: critical receiver: 'critical-receiver' group_wait: 10s receivers: - name: 'default-receiver' email_configs: - to: 'ops@example.com' send_resolved: true slack_configs: - channel: '#alerts' send_resolved: true - name: 'critical-receiver' webhook_configs: - url: 'http://dingtalk:8060/dingtalk/webhook' inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['instance']

7. 常用命令与API

bash

# 热加载配置 curl -X POST http://localhost:9090/-/reload # TSDB操作 curl http://localhost:9090/api/v1/status/tsdb curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={job="test"}' curl -X POST http://localhost:9090/api/v1/admin/tsdb/clean_tombstones # HTTP API curl -G http://localhost:9090/api/v1/query --data-urlencode 'query=up{job="node"}' curl -G http://localhost:9090/api/v1/query_range \ --data-urlencode 'query=up{job="node"}' \ --data-urlencode 'start=2024-01-01T00:00:00Z' \ --data-urlencode 'end=2024-01-01T01:00:00Z' \ --data-urlencode 'step=60s' curl http://localhost:9090/api/v1/targets curl http://localhost:9090/api/v1/alerts curl http://localhost:9090/api/v1/rules

8. 常见问题排查

问题1: Target Down

bash

curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health=="down")' nc -zv <target_ip> <port> docker logs node-exporter curl http://<target>:9100/metrics

问题2: 指标缺失

bash

curl http://localhost:9090/api/v1/label/__name__/values | jq '.data[]' | grep -i <metric> curl -G http://localhost:9090/api/v1/query --data-urlencode 'query=<metric_name>_total' curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[].labels'

问题3: 告警不触发

bash

curl http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.type=="alerting")' curl -s http://localhost:9093/api/v1/status curl http://localhost:9093/api/v1/silences

表格

排查命令用途
up{job="xxx"}确认target状态
rate(x[5m]) > 0验证指标存在
ALERTS{alertname="xxx"}检查告警状态
promtool check config验证配置文件

9. 最佳实践

命名规范

yaml

# 指标名: <域>_<子系统>_<名称>_<单位> node_memory_Available_bytes http_request_duration_seconds # 标签: app_name, env, region, cluster, instance # 避免高基数标签(user_id, ip等)

联邦集群

yaml

- job_name: 'federate' metrics_path: '/federate' params: 'match[]': ['{__name__=~".+"}'] static_configs: - targets: - 'prometheus-prod:9090' - 'prometheus-prod2:9090'

高可用

plaintext

┌─────────────┐ ┌─────────────┐ │ Prometheus │ │ Prometheus │ # 双写 │ Primary │ │ Replica │ └──────┬──────┘ └──────┬──────┘ └────────┬─────────┘ ▼ ┌──────────────┐ │Thanos Receiver│ # 统一存储 └──────────────┘

远程存储

yaml

remote_write: - url: http://thanos-receive:19291/api/v1/receive queue_config: capacity: 10000 max_shards: 30 remote_read: - url: http://thanos-query:10912/api/v1/read read_recent: true

性能优化

  1. 标签基数控制: 避免超过10万标签组合
  2. 抓取间隔: 高频5s,低频60s
  3. 记录规则: 预聚合复杂查询
  4. 存储清理: 合理保留周期
  5. 联邦分区: 按服务域拆分Prometheus
http://www.jsqmd.com/news/890280/

相关文章:

  • 【总结】HugeGraph Client 从 1.2.0 升级到 1.7.0 的 7 个坑
  • 瓦斯事故深度复盘:无感定位助力矿山筑牢安全防线
  • 2026 年工业码垛机企业/厂家发展现状分析(附核心数据) - GrowthUME
  • 栈(Stack)学习笔记 —— 动态数组实现
  • AI Agent的幻觉问题及解决方案
  • OpenArm 2.0:开源协作机械臂的工程化架构与技术实现深度解析
  • 游戏开发学习之路一——人物移动与旋转
  • 微信删除好友后还能恢复吗?这 10 种情况可以尝试找回
  • 【论文解读】U-Net深度解析:医学图像分割的里程碑式突破
  • UE5-MCP:用AI重新定义游戏开发工作流的5个关键突破
  • 基于压缩感知与冗余字典的图像超分辨率重建:原理、实现与优化
  • 不仅仅是 HashMap:盘点 Java 中 O(1) 的键值对存储利器
  • 3PEAK思瑞浦 TP2121-CR SOT353 精密运放
  • 利用Taotoken的稳定路由为你的AI应用提供高可用后端
  • 3步解锁Windows桌面生产力:FancyZones智能窗口管理全攻略
  • 为什么92%的团队搭不出真正Lovable的开发体验?这4个隐性设计缺陷你中招了吗?
  • 终极免费IDM激活指南:如何永久解锁完整功能(2024最新方案)
  • 英伟达VR200服务器MLCC用量暴增30%:被动元件板块涨停潮深度解析
  • 美国机器人捡快递,给中国机器人上了一课?
  • 最新2026年网盘搜索引擎
  • SRA Toolkit终极指南:轻松处理海量基因组测序数据
  • CZSC缠论量化插件:专业交易者的自动化技术分析终极指南
  • Windows 11 LTSC 24H2 添加微软应用商店:3分钟极速解决方案
  • 终极英雄联盟自动化工具指南:5分钟掌握League Akari核心功能
  • JavaQuestPlayer:3分钟搭建你的文字冒险游戏世界,告别复杂配置烦恼
  • 3步精准控制:Windows窗口尺寸强制调整工具完全指南
  • 封阳台门窗品牌解析:长沙家装静音安全,依托建筑标准选对本土靠谱品牌 - 涂伟
  • Fast-GitHub:终极GitHub加速解决方案,让国内开发者告别下载缓慢的烦恼
  • Lindy翻译工作流自动化升级(2024企业级部署白皮书):仅3家头部语言服务商在用的私有化集成协议
  • League Akari:英雄联盟玩家的终极本地化智能工具箱,安全高效提升游戏体验