Kubernetes监控与可观测性最佳实践
Kubernetes监控与可观测性最佳实践
引言
监控与可观测性是 Kubernetes 集群运维的核心能力。本文将深入探讨 Kubernetes 监控体系的架构设计、工具选型和最佳实践。
一、可观测性架构
1.1 可观测性层次结构
┌─────────────────────────────────────────────────────────────┐ │ 可观测性架构 │ ├─────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ 指标层 │ │ │ │ - Prometheus / Metrics │ │ │ │ - 资源使用 / 性能指标 │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ 日志层 │ │ │ │ - Loki / Elasticsearch │ │ │ │ - 应用日志 / 系统日志 │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ 追踪层 │ │ │ │ - Jaeger / Zipkin │ │ │ │ - 分布式追踪 / 请求链路 │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ 可视化层 │ │ │ │ - Grafana / Kibana │ │ │ │ - 仪表板 / 告警 │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────┘1.2 可观测性三大支柱
| 支柱 | 工具 | 用途 |
|---|---|---|
| 指标 | Prometheus | 数值型数据,用于监控和告警 |
| 日志 | Loki/ELK | 文本日志,用于故障排查 |
| 追踪 | Jaeger | 分布式追踪,用于请求分析 |
二、Prometheus 部署与配置
2.1 Prometheus 安装
# 添加 Helm 仓库 helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update # 安装 Prometheus helm install prometheus prometheus-community/prometheus \ --namespace monitoring \ --create-namespace \ --set alertmanager.persistentVolume.enabled=true \ --set server.persistentVolume.enabled=true2.2 Prometheus 配置
apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config data: prometheus.yml: | global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: 'kubernetes-apiservers' kubernetes_sd_configs: - role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https - job_name: 'kubernetes-nodes' kubernetes_sd_configs: - role: node scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+)2.3 ServiceMonitor 配置
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: my-app-monitor namespace: monitoring spec: selector: matchLabels: app: my-app endpoints: - port: http interval: 15s path: /metrics namespaceSelector: matchNames: - default三、Grafana 部署与配置
3.1 Grafana 安装
helm install grafana grafana/grafana \ --namespace monitoring \ --set persistence.enabled=true \ --set adminPassword='my-secure-password'3.2 Grafana 仪表板配置
{ "dashboard": { "title": "Kubernetes Cluster Overview", "id": null, "tags": ["kubernetes", "cluster"], "style": "dark", "timezone": "browser", "editable": true, "graphTooltip": 0, "panels": [ { "type": "stat", "title": "Total Pods", "targets": [ { "expr": "sum(kube_pod_status_running)", "legendFormat": "Running" } ], "gridPos": { "h": 3, "w": 4, "x": 0, "y": 0 } }, { "type": "graph", "title": "CPU Usage", "targets": [ { "expr": "sum(node_cpu_seconds_total{mode='idle'}) / sum(node_cpu_seconds_total) * 100", "legendFormat": "Idle" } ], "gridPos": { "h": 8, "w": 12, "x": 0, "y": 3 } } ], "timeRange": { "from": "now-6h", "to": "now" } } }四、告警规则配置
4.1 PrometheusRule 配置
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: kubernetes-alerts namespace: monitoring spec: groups: - name: kubernetes.rules rules: - alert: HighCPUUsage expr: sum(rate(node_cpu_seconds_total{mode!='idle'}[5m])) / sum(rate(node_cpu_seconds_total[5m])) * 100 > 80 for: 5m labels: severity: warning annotations: summary: "节点 CPU 使用率过高" description: "节点 {{ $labels.instance }} CPU 使用率超过 80%" - alert: HighMemoryUsage expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85 for: 5m labels: severity: warning annotations: summary: "节点内存使用率过高" description: "节点 {{ $labels.instance }} 内存使用率超过 85%" - alert: PodNotReady expr: kube_pod_status_ready{condition="false"} > 0 for: 10m labels: severity: critical annotations: summary: "Pod {{ $labels.pod }} 未就绪" description: "Pod {{ $labels.pod }} 在命名空间 {{ $labels.namespace }} 中未就绪超过 10 分钟"4.2 Alertmanager 配置
apiVersion: v1 kind: ConfigMap metadata: name: alertmanager-config data: config.yml: | global: resolve_timeout: 5m route: group_by: ['alertname'] group_wait: 10s group_interval: 10s repeat_interval: 1h receiver: 'webhook' receivers: - name: 'webhook' webhook_configs: - url: 'https://alertmanager.example.com/webhook' send_resolved: true五、日志管理
5.1 Loki 部署
helm repo add grafana https://grafana.github.io/helm-charts helm repo update helm install loki grafana/loki-stack \ --namespace monitoring \ --set grafana.enabled=true \ --set promtail.enabled=true5.2 Promtail 配置
apiVersion: v1 kind: ConfigMap metadata: name: promtail-config data: config.yaml: | server: http_listen_port: 9080 grpc_listen_port: 0 positions: filename: /tmp/positions.yaml clients: - url: http://loki:3100/api/v1/push scrape_configs: - job_name: kubernetes-pods kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_label_app] target_label: app - source_labels: [__meta_kubernetes_namespace] target_label: namespace - source_labels: [__meta_kubernetes_pod_name] target_label: pod六、分布式追踪
6.1 Jaeger 部署
helm repo add jaegertracing https://jaegertracing.github.io/helm-charts helm repo update helm install jaeger jaegertracing/jaeger \ --namespace monitoring \ --set persistence.enabled=true \ --set storage.type=elasticsearch6.2 应用集成
apiVersion: apps/v1 kind: Deployment metadata: name: my-app spec: template: spec: containers: - name: my-app image: my-app:latest env: - name: JAEGER_AGENT_HOST valueFrom: fieldRef: fieldPath: status.hostIP - name: JAEGER_AGENT_PORT value: "6831" - name: JAEGER_SERVICE_NAME value: "my-app"七、监控最佳实践
7.1 指标采集策略
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: my-app-monitor spec: endpoints: - port: http interval: 30s scrapeTimeout: 10s honorLabels: true7.2 仪表板设计原则
| 原则 | 说明 |
|---|---|
| 简洁性 | 只展示关键指标 |
| 分层展示 | 从概览到细节 |
| 实时更新 | 设置合理刷新间隔 |
| 告警关联 | 与告警规则联动 |
7.3 告警分级
| 级别 | 描述 | 响应时间 |
|---|---|---|
| Critical | 服务不可用 | 立即 |
| Warning | 性能下降 | 15分钟 |
| Info | 信息通知 | 按需 |
八、常见问题与解决方案
8.1 指标采集失败
问题分析:
- 服务端点不可达
- 权限不足
- 配置错误
解决方案:
# 检查服务状态 kubectl get svc my-app # 测试指标端点 curl http://my-app:8080/metrics # 检查 ServiceMonitor 配置 kubectl get servicemonitor my-app-monitor -o yaml8.2 告警过多
问题分析:
- 阈值设置过松
- 缺少抑制规则
- 重复告警
解决方案:
# Alertmanager 抑制规则 inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'instance']8.3 存储不足
问题分析:
- 指标保留时间过长
- 采样频率过高
- 存储容量不足
解决方案:
# Prometheus 存储配置 storage: volumeClaimTemplate: spec: resources: requests: storage: 100Gi retention: 15d结论
监控与可观测性是 Kubernetes 集群运维的核心能力。通过合理配置 Prometheus、Grafana、Loki 和 Jaeger,可以构建完整的可观测性体系。遵循最佳实践能够确保监控系统的可靠性和有效性,帮助运维团队及时发现和解决问题。
