当前位置：首页 > news >正文

Prometheus 高可用集群部署：从单点到多副本的监控体系演进

news 2026/6/26 21:14:39

Prometheus 高可用集群部署：从单点到多副本的监控体系演进

一、监控单点的致命风险：当 Prometheus 宕机等于全盲

Prometheus 作为云原生监控的事实标准，其默认部署模式是单实例。这种架构在测试环境中足够使用，但在生产环境中存在致命缺陷：Prometheus 宕机意味着整个监控体系瞬间失明，告警规则停止评估，历史数据无法查询。更危险的是，监控盲区往往发生在系统故障期间——正是最需要监控数据的时候。

单点 Prometheus 的另一个隐患是存储瓶颈。TSDB 的本地存储随着时间序列数量增长而膨胀，单个实例的内存和磁盘 I/O 终将达到上限。当采集目标超过 5000 个、时间序列超过 200 万条时，单实例的查询延迟会从毫秒级退化到秒级，严重影响故障排查效率。构建高可用 Prometheus 集群，是监控体系从"能用"到"可靠"的关键一步。

二、Prometheus 高可用架构：双副本 + Thanos 长期存储

Prometheus 本身不支持多副本写入同一数据集（无分布式共识机制），因此高可用方案采用"双副本独立采集 + 远程去重"的思路。每个 Prometheus 副本独立采集和存储数据，通过 Thanos Sidecar 将数据上传到对象存储，Thanos Query 层负责跨副本查询和去重。

flowchart TD subgraph 采集层 P1[Prometheus Replica-1] --> S1[Thanos Sidecar-1] P2[Prometheus Replica-2] --> S2[Thanos Sidecar-2] end subgraph 存储层 S1 --> OSS[对象存储 S3/OSS] S2 --> OSS OSS --> ST[Thanos Store Gateway] end subgraph 查询层 TQ[Thanos Query] --> S1 TQ --> S2 TQ --> ST TQ --> QR[Thanos Query Frontend] end subgraph 告警层 P1 --> AM1[Alertmanager-1] P2 --> AM2[Alertmanager-2] AM1 --> AMG[Alertmanager Cluster] AM2 --> AMG AMG --> NF[通知渠道: 钉钉/邮件/PagerDuty] end subgraph 规则层 TR[Thanos Ruler] --> TQ TR --> OSS end QR --> GW[Grafana]

双副本独立采集：两个 Prometheus 实例配置完全相同的采集目标，各自独立拉取指标。当其中一个宕机时，另一个仍能提供完整的监控数据。代价是采集流量翻倍，对被监控服务的拉取压力增加一倍。

Thanos Sidecar：与 Prometheus 运行在同一 Pod 中，周期性将 TSDB 数据块上传到对象存储（S3/OSS）。同时作为 gRPC StoreAPI 端点，为 Thanos Query 提供近实时数据查询。

Thanos Store Gateway：从对象存储中读取历史数据块，通过 StoreAPI 暴露给 Thanos Query。支持按时间范围下推查询，避免全量扫描。

Thanos Query 去重：当查询命中多个副本时，Thanos Query 通过--query.replica-label参数识别重复数据，按时间对齐后去重，返回单一结果集。

三、生产级 Prometheus 高可用部署配置

3.1 Prometheus 双副本 StatefulSet

# prometheus-statefulset.yaml # 为什么用 StatefulSet 而非 Deployment：Prometheus 需要稳定的网络标识 # 和持久化存储，StatefulSet 提供有序的 Pod 名称和稳定的 PVC 绑定 apiVersion: apps/v1 kind: StatefulSet metadata: name: prometheus namespace: monitoring spec: replicas: 2 # 双副本保证高可用 serviceName: prometheus-headless selector: matchLabels: app: prometheus template: metadata: labels: app: prometheus spec: serviceAccountName: prometheus # 反亲和调度：确保两个副本分布在不同节点 # 为什么必须反亲和：同一节点上的两个副本在节点故障时会同时丢失 affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app: prometheus topologyKey: kubernetes.io/hostname initContainers: - name: config-init image: alpine:3.18 command: ["/bin/sh", "-c"] args: - | # 初始化数据目录权限，防止 TSDB 因权限问题启动失败 chown -R 65534:65534 /prometheus volumeMounts: - name: data mountPath: /prometheus containers: - name: prometheus image: prom/prometheus:v2.52.0 args: - "--config.file=/etc/prometheus/prometheus.yml" - "--storage.tsdb.path=/prometheus" - "--storage.tsdb.retention.time=15d" # 为什么保留 15 天本地数据：超过 15 天的数据已上传至对象存储， # 本地只保留近期热数据以加速查询，同时控制磁盘占用 - "--storage.tsdb.retention.size=80GB" - "--web.enable-lifecycle" # 副本标签：Thanos Query 通过此标签识别重复数据 - "--external-labels=replica=$(POD_NAME)" - "--web.console.templates=/usr/share/prometheus/consoles" - "--web.console.libraries=/usr/share/prometheus/console_libraries" env: - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name ports: - containerPort: 9090 resources: requests: cpu: "2" memory: "8Gi" limits: cpu: "4" memory: "16Gi" # 存活探针：检测 TSDB 锁文件是否存在， # 防止 TSDB 损坏后 Prometheus 仍在运行但无法写入 livenessProbe: httpGet: path: /-/healthy port: 9090 initialDelaySeconds: 30 periodSeconds: 15 readinessProbe: httpGet: path: /-/ready port: 9090 initialDelaySeconds: 10 periodSeconds: 5 volumeMounts: - name: config mountPath: /etc/prometheus - name: data mountPath: /prometheus - name: thanos-sidecar image: thanosio/thanos:v0.35.0 args: - "sidecar" - "--tsdb.path=/prometheus" - "--prometheus.url=http://localhost:9090" - "--objstore.config-file=/etc/thanos/objstore.yml" # 上传压缩：减少对象存储的存储成本 - "--shipper.compression-type=snappy" # 上传间隔：5 分钟检查一次新数据块 - "--shipper.upload-compacted=false" ports: - containerPort: 10902 # Thanos Sidecar HTTP - containerPort: 10901 # Thanos gRPC StoreAPI volumeMounts: - name: data mountPath: /prometheus - name: thanos-config mountPath: /etc/thanos volumes: - name: config configMap: name: prometheus-config - name: thanos-config secret: secretName: thanos-objstore-config volumeClaimTemplates: - metadata: name: data spec: accessModes: ["ReadWriteOnce"] storageClassName: ssd-storage resources: requests: storage: 100Gi

3.2 Thanos Query 去重查询配置

# thanos-query.yaml apiVersion: apps/v1 kind: Deployment metadata: name: thanos-query namespace: monitoring spec: replicas: 2 selector: matchLabels: app: thanos-query template: metadata: labels: app: thanos-query spec: containers: - name: thanos-query image: thanosio/thanos:v0.35.0 args: - "query" - "--http-address=0.0.0.0:10902" - "--grpc-address=0.0.0.0:10901" # 连接两个 Prometheus Sidecar 的 gRPC 端点 - "--store=dnssrv+_grpc._tcp.prometheus-headless.monitoring.svc.cluster.local" # 连接 Store Gateway 读取历史数据 - "--store=dnssrv+_grpc._tcp.thanos-store.monitoring.svc.cluster.local" # 去重标签：与 Prometheus 的 external-labels 对应 # 为什么必须配置：不配置去重标签会导致查询结果翻倍， # 所有指标值都会出现两条记录 - "--query.replica-label=replica" # 查询超时：防止慢查询拖垮整个查询层 - "--query.timeout=2m" # 最大数据源响应数：限制并发查询的 StoreAPI 数量 - "--store.response-timeout=30s" ports: - containerPort: 10902 - containerPort: 10901 resources: requests: cpu: "1" memory: "2Gi" limits: cpu: "2" memory: "4Gi"

3.3 Alertmanager 集群配置

# alertmanager-statefulset.yaml # 为什么 Alertmanager 也需要高可用：单点 Alertmanager 宕机后， # 所有告警通知中断，即使 Prometheus 仍在正常评估告警规则 apiVersion: apps/v1 kind: StatefulSet metadata: name: alertmanager namespace: monitoring spec: replicas: 3 # 奇数副本保证 Raft 共识 serviceName: alertmanager-headless selector: matchLabels: app: alertmanager template: metadata: labels: app: alertmanager spec: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchLabels: app: alertmanager topologyKey: topology.kubernetes.io/zone containers: - name: alertmanager image: prom/alertmanager:v0.27.0 args: - "--config.file=/etc/alertmanager/config.yml" - "--storage.path=/alertmanager" - "--data.retention=120h" # 集群模式：通过 --cluster.* 参数组建 Mesh 集群 # 为什么用 DNS 发现集群成员：避免硬编码 IP， # Pod 重建后 IP 变化不会导致集群失联 - "--cluster.listen-address=0.0.0.0:9094" - "--cluster.advertise-address=$(POD_IP):9094" - "--cluster.peer=alertmanager-0.alertmanager-headless:9094" - "--cluster.peer=alertmanager-1.alertmanager-headless:9094" - "--cluster.peer=alertmanager-2.alertmanager-headless:9094" # 去重：集群模式下每个 Alertmanager 都会收到同一告警， # 必须开启去重防止重复通知 - "--cluster.settle-timeout=30s" env: - name: POD_IP valueFrom: fieldRef: fieldPath: status.podIP ports: - containerPort: 9093 - containerPort: 9094 volumeMounts: - name: config mountPath: /etc/alertmanager - name: data mountPath: /alertmanager volumes: - name: config configMap: name: alertmanager-config volumeClaimTemplates: - metadata: name: data spec: accessModes: ["ReadWriteOnce"] resources: requests: storage: 10Gi

四、高可用监控的隐性成本：资源开销与运维复杂度

Prometheus 高可用方案并非没有代价，在决策前必须量化这些成本。

采集流量翻倍：双副本意味着每个采集目标被拉取两次。对于拥有 1000 个采集目标的集群，网络流量从单副本的约 50MB/s 增加到 100MB/s。被监控服务的 /metrics 端点压力也相应翻倍，对于资源受限的 Sidecar 采集模式，这可能成为新的瓶颈。

存储成本增长：对象存储的费用取决于数据量和查询频率。一个中等规模集群（200 万时间序列）的日数据量约 2-3GB，年存储成本在 500-1000 元。但 Thanos Store Gateway 的查询会从对象存储下载数据块，频繁的大范围查询可能产生显著的 API 调用费用。

去重精度问题：Thanos Query 的去重基于时间对齐，选择时间戳更早的数据点。当两个副本的采集时间存在微小偏差（通常在 1-2 秒内），去重结果可能偶尔出现跳变。对于需要精确到秒级的监控场景（如 SLI 计算），这种偏差不可忽略。

运维复杂度显著增加：从单实例升级到双副本 + Thanos 组件，运维对象从 1 个增加到至少 7 个（2 Prometheus + 2 Sidecar + 1 Store Gateway + 1 Query + 1 Ruler）。每个组件都有独立的配置、日志和故障模式，排障难度成倍增加。

适用边界：双副本 + Thanos 方案适合对监控可用性要求高、数据保留周期长（超过 30 天）的中大规模集群。对于小规模集群（采集目标 < 500），单副本 + 远程写入（Remote Write）到 VictoriaMetrics 等方案更经济。

五、总结

Prometheus 高可用架构的核心思路是"双副本独立采集 + 远程去重"，通过 Thanos 组件实现长期存储和跨副本查询。双副本保证了单点故障时的监控连续性，Thanos Store Gateway 将历史数据下沉到对象存储，Thanos Query 负责去重和统一查询入口。但高可用方案的代价是采集流量翻倍、存储成本增加和运维复杂度提升。

落地路线建议：先评估当前单实例的瓶颈点——是可用性风险还是存储压力？如果是可用性风险，优先部署双副本并配置反亲和调度；如果是存储压力，优先引入 Thanos Sidecar + 对象存储；两者都存在时，再考虑完整的 Thanos Query + Store Gateway 部署。每一步都需要验证查询性能和去重精度，避免盲目追求高可用而引入新的故障点。

查看全文

http://www.jsqmd.com/news/1083714/