当前位置：首页 > news >正文

别再死记硬背了！用这套实战Demo，5分钟搞懂Prometheus四大核心Metric类型

news 2026/6/15 5:14:03

别再死记硬背了！用这套实战Demo，5分钟搞懂Prometheus四大核心Metric类型

每次面试被问到Prometheus的四种Metric类型时，你是不是还在机械地背诵"Counter只能增加、Gauge可以增减"这样的定义？作为过来人，我完全理解这种痛苦——直到有一天，我在自己的服务器上运行了几个简单的示例，一切突然变得清晰可见。今天，我就带你用5分钟亲手搭建一个可运行的Demo，让抽象的概念变成可视化的曲线。

想象一下这样的场景：当面试官追问"Histogram和Summary在内存消耗上有何不同"时，你不仅能说出理论区别，还能调出自己电脑上的监控图表现场演示。这就是实战理解与死记硬背的本质差异。我们将使用Node Exporter和Prometheus官方Go客户端库，创建一个包含所有Metric类型的完整示例。

1. 环境准备：三件套快速部署

在开始之前，我们需要准备以下组件（假设使用Linux/macOS环境）：

# 安装Prometheus和Node Exporter wget https://github.com/prometheus/prometheus/releases/download/v2.47.0/prometheus-2.47.0.linux-amd64.tar.gz wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz # 解压后运行 ./node_exporter & # 启动节点监控 ./prometheus --config.file=./prometheus.yml & # 启动Prometheus服务

同时准备一个Go示例程序demo_metrics.go，我们将用它演示自定义指标：

package main import ( "net/http" "time" "github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/promhttp" ) func main() { // 在这里定义四种Metric类型 http.Handle("/metrics", promhttp.Handler()) http.ListenAndServe(":8080", nil) }

提示：确保防火墙开放了9090(Prometheus)、9100(Node Exporter)和8080(示例程序)端口

2. Counter：不只是计数器那么简单

Counter类型常被简单理解为"只能增加的计数器"，但它的真正价值在于变化率分析。让我们在示例程序中添加以下代码：

requestCounter := prometheus.NewCounter(prometheus.CounterOpts{ Name: "demo_http_requests_total", Help: "Total number of HTTP requests", }) prometheus.MustRegister(requestCounter) // 模拟请求处理 go func() { for { requestCounter.Inc() time.Sleep(time.Second * 2) } }()

启动程序后，在Prometheus的表达式浏览器中输入：

rate(demo_http_requests_total[1m])

你会看到一条平滑的曲线，这正是Counter的精髓——它最适合用rate()函数计算单位时间内的增长量。面试常考点：

为什么服务重启后Counter会归零？这会影响rate计算吗？
如何设计一个不会因为重启而丢失数据的Counter？（提示：借助外部存储）

3. Gauge：系统的瞬时脉搏

如果说Counter是"累计值"，那么Gauge就是"当前值"。我们在示例中添加内存模拟：

memoryUsage := prometheus.NewGauge(prometheus.GaugeOpts{ Name: "demo_memory_usage_bytes", Help: "Current memory usage in bytes", }) prometheus.MustRegister(memoryUsage) // 模拟内存波动 go func() { for { memoryUsage.Set(100 + 50*math.Sin(float64(time.Now().Unix())/10)) time.Sleep(time.Second) } }()

在Prometheus中查询：

demo_memory_usage_bytes

关键区别在于：

Gauge适合直接查询当前值（如CPU温度、内存占用）
支持Dec()、Sub()等操作，而Counter只有Inc()
面试陷阱："能用Gauge实现Counter功能吗？"（技术上可以，但会丢失rate等特性）

4. Histogram vs Summary：百分位的两种实现

这是最容易混淆的一对概念。我们先在程序中同时实现两者：

// Histogram responseTimeHist := prometheus.NewHistogram(prometheus.HistogramOpts{ Name: "demo_api_response_time_seconds", Help: "API response time distribution", Buckets: []float64{0.1, 0.5, 1, 2, 5}, // 自定义分桶 }) prometheus.MustRegister(responseTimeHist) // Summary responseTimeSummary := prometheus.NewSummary(prometheus.SummaryOpts{ Name: "demo_api_response_time_summary_seconds", Help: "API response time summary", Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01}, // 50%和90%分位数 }) prometheus.MustRegister(responseTimeSummary) // 模拟响应时间 go func() { for { latency := 0.05 + rand.Float64()*2 responseTimeHist.Observe(latency) responseTimeSummary.Observe(latency) time.Sleep(time.Millisecond * 300) } }()

查询对比两者的结果：

# Histogram的各个桶计数 demo_api_response_time_seconds_bucket{le="0.1"} demo_api_response_time_seconds_bucket{le="0.5"} # Summary的预计算分位数 demo_api_response_time_summary_seconds{quantile="0.5"} demo_api_response_time_summary_seconds{quantile="0.9"}

核心区别总结为下表：

特性	Histogram	Summary
配置灵活性	需预先定义桶	可动态调整分位数
客户端资源消耗	低（仅计数）	高（需计算分位数）
服务端聚合	支持（相同桶定义即可）	不支持
典型应用场景	需要聚合的响应时间监控	不需要聚合的独立服务监控

5. 实战技巧：从监控到告警

现在我们已经有了完整的指标，如何设置有效的告警规则？在Prometheus的配置文件中添加：

rule_files: - 'alert.rules' # alert.rules示例内容 groups: - name: demo-alerts rules: - alert: HighLatency expr: histogram_quantile(0.9, rate(demo_api_response_time_seconds_bucket[5m])) > 1 for: 2m labels: severity: warning annotations: summary: "High API latency detected" description: "90th percentile latency is {{ $value }}s"

关键技巧：

对Counter使用rate()后再比较
Histogram用histogram_quantile计算分位数
Gauge直接使用当前值
Summary直接查询预计算分位数

6. 面试高频问题深度解析

结合这个Demo，当被问到"四种Metric类型如何选择"时，可以这样结构化回答：

计数场景：选择Counter
- 适用于：请求数、错误数、任务完成数
- 关键操作：rate()、increase()
- 示例：rate(http_requests_total[5m]) > 100
瞬时值场景：选择Gauge
- 适用于：温度、内存使用、并发连接数
- 关键操作：直接查询、预测函数
- 示例：predict_linear(node_memory_free_bytes[1h], 3600) < 0
分布分析（需聚合）：选择Histogram
- 适用于：跨实例的响应时间分析
- 关键操作：histogram_quantile + rate
- 示例：histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
分布分析（精确计算）：选择Summary
- 适用于：单服务的关键指标
- 关键操作：直接查询quantile
- 示例：rpc_duration_seconds{quantile="0.9"} > 1

最后分享一个真实案例：在一次性能调优中，我们同时使用Histogram和Summary监控同一个API，发现Histogram显示的P99比Summary低15%。经过排查，发现是客户端计算Summary时使用了不同的算法。这让我深刻理解了"监控数据会撒谎"的含义——只有亲手实验过不同Metric类型，才能真正理解它们的特性和局限。

查看全文

http://www.jsqmd.com/news/1016282/