使用 Terraform Grafana Provider 实现 Grafana 全栈 IaC 一体化管理的完整方案
以下是使用Terraform Grafana Provider实现 Grafana 全栈 IaC 一体化管理的完整方案,覆盖从架构设计到生产落地的全部实现细节。
一、架构总览与核心设计原则
1.1 为什么选 Terraform 路线
Grafana 官方提供多种 as-code 工具(Terraform、Ansible、Operator、Crossplane)。Terraform Provider 是资源覆盖度最广的方案,支持 Dashboard、Datasource、Alert、SLO、Synthetic Monitoring、IAM 等几乎所有 Grafana 资源。
适用场景
- 已有 Terraform 工作流管理云资源(AWS/GCP/Azure/K8s)
- 需要统一管理 Dashboard + Alert + SLO + Datasource + 权限
- 多环境(dev/staging/prod)一致性要求严格
- 团队已有 HCL 技能储备
1.2 架构分层
┌─────────────────────────────────────────────────────────────┐ │ Git Repository │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────────────┐ │ │ │dashboards│ │ datasources│ │ alerting │ │ iam/teams │ │ │ │ (.json) │ │ (.tf) │ │ (.tf) │ │ (.tf) │ │ │ └──────────┘ └──────────┘ └──────────┘ └────────────────┘ │ └────────────────────┬──────────────────────────────────────────┘ │ PR Review / CI Validation ▼ ┌─────────────────────────────────────────────────────────────┐ │ CI/CD Pipeline (GitHub Actions/GitLab CI) │ │ ┌──────────────┐ ┌──────────────┐ ┌────────────────────┐ │ │ │ terraform fmt│ │ terraform plan│ │ terraform apply │ │ │ │ validate │ │ (review req) │ │ (auto/staging) │ │ │ └──────────────┘ └──────────────┘ └────────────────────┘ │ └────────────────────┬──────────────────────────────────────────┘ │ State Backend (S3 + DynamoDB / Terraform Cloud) ▼ ┌─────────────────────────────────────────────────────────────┐ │ Grafana Instance(s) │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────────────┐ │ │ │ OSS │ │ Cloud │ │ AWS │ │ Multi-tenant │ │ │ │ Self │ │ Stack │ │ Managed│ │ (prod/staging) │ │ │ │ Hosted │ │ │ │ Grafana│ │ │ │ │ └─────────┘ └─────────┘ └─────────┘ └─────────────────┘ │ └─────────────────────────────────────────────────────────────┘二、Provider 配置与认证体系
2.1 基础 Provider 配置
Terraform Grafana Provider 当前稳定版本为~> 2.0,支持 Grafana OSS 和 Grafana Cloud。
# versions.tf terraform { required_version = ">= 1.5.0" required_providers { grafana = { source = "grafana/grafana" version = "~> 2.0" # 或 ">= 3.0" 若已发布 } } } # provider.tf provider "grafana" { url = var.grafana_url auth = var.grafana_auth # Service Account Token 推荐 }认证方式优先级
- Service Account Token(推荐生产):在 Grafana 中创建 Service Account → 分配 Viewer/Editor/Admin 角色 → 生成 Token
- API Key(已逐步被 Service Account 替代)
- Basic Auth:
admin:password(仅初始化或本地测试)
2.2 多实例管理(Provider Alias)
管理多套 Grafana 环境(如 prod Grafana Cloud + dev OSS 实例):
provider "grafana" { alias = "production" url = "https://my-stack.grafana.net/" auth = var.grafana_prod_token } provider "grafana" { alias = "staging" url = "https://staging.grafana.local/" auth = var.grafana_staging_token } # 使用示例 resource "grafana_folder" "prod_infra" { provider = grafana.production title = "Infrastructure" } resource "grafana_folder" "staging_infra" { provider = grafana.staging title = "Infrastructure" }2.3 Grafana Cloud 专属配置
Grafana Cloud 需要额外的 Cloud Access Policy Token 来管理 Stack、Synthetic Monitoring 等资源:
provider "grafana" { alias = "cloud" url = "https://grafana.com" auth = var.grafana_cloud_api_key # Cloud Access Policy Token # Synthetic Monitoring 专用 sm_access_token = var.grafana_sm_token }三、Dashboard 资源深度管理
Dashboard 是 Grafana 中最复杂的资源类型。Terraform 通过config_json字段接收完整的 Dashboard Model JSON。
3.1 目录结构与文件组织
grafana-terraform/ ├── modules/ │ └── dashboard-stack/ │ ├── main.tf │ ├── variables.tf │ └── outputs.tf ├── dashboards/ │ ├── platform/ │ │ ├── cluster-overview.json │ │ └── node-exporter.json │ ├── application/ │ │ ├── api-gateway.json │ │ └── payment-service.json │ └── templates/ │ └── service-overview.json.tpl ├── environments/ │ ├── production/ │ │ ├── main.tf │ │ └── terraform.tfvars │ └── staging/ │ ├── main.tf │ └── terraform.tfvars └── global/ ├── folders.tf ├── datasources.tf └── permissions.tf3.2 批量导入 Dashboard JSON
使用for_each+fileset实现批量管理,避免为每个 Dashboard 写重复代码:
# dashboards.tf locals { dashboard_folders = { "platform" = grafana_folder.platform.id "application" = grafana_folder.application.id } } resource "grafana_dashboard" "all" { for_each = { for pair in setproduct(keys(local.dashboard_folders), fileset("${path.module}/dashboards", "*/*.json")) : "${pair[0]}-${trimsuffix(basename(pair[1]), ".json")}" => { folder = local.dashboard_folders[pair[0]] path = "${path.module}/dashboards/${pair[1]}" } } folder = each.value.folder config_json = file(each.value.path) overwrite = true }3.3 Dashboard JSON 预处理规范
从 Grafana UI 导出的 JSON 需要清理后才能用于 Terraform:
# 清理脚本:删除 id、version,保留 uidjq'del(.id, .version) | .uid |= .'exported.json>clean.json关键字段处理
id:必须删除,由 Grafana 自动分配version:必须删除,避免版本冲突uid:必须保留且固定,用于唯一标识和更新datasource.uid:建议引用 Terraform 数据源资源,而非硬编码
3.4 使用 Templatefile 实现参数化
对于结构相似但指标不同的 Dashboard(如各微服务统一视图),使用 Terraform 模板:
# templates/service-overview.json.tpl { "title": "${service_name} Overview", "uid": "svc-${service_name}", "panels": [ { "title": "Request Rate", "targets": [ { "expr": "rate(http_requests_total{service=\"${service_name}\"}[$__rate_interval])" } ] } ] } # main.tf resource "grafana_dashboard" "services" { for_each = toset(["api-gateway", "web-frontend", "worker", "billing"]) folder = grafana_folder.application.id config_json = templatefile("${path.module}/templates/service-overview.json.tpl", { service_name = each.key }) }3.5 Grafonnet + Terraform 混合工作流
对于复杂 Dashboard,手写 JSON 维护困难。推荐Grafonnet (Jsonnet)生成 JSON,Terraform 负责部署:
# 工作流dashboards/*.jsonnet --[jsonnet]-->output/*.json --[terraform]-->GrafanaJsonnet 示例
// dashboards/cluster-overview.jsonnet local g = import 'grafonnet/grafana.libsonnet'; g.dashboard.new( title='Kubernetes Cluster Overview', uid='k8s-cluster-overview', timezone='utc', ) .addPanel( g.panel.timeSeries.new('CPU Usage') .addTarget( g.target.prometheus.new('prometheus', 'sum(rate(container_cpu_usage_seconds_total[$__rate_interval])) by (namespace)') ), gridPos={x: 0, y: 0, w: 12, h: 8} )CI 集成
# .github/workflows/dashboards.yml-name:Generate Dashboardsrun:|jb install # jsonnet-bundler 安装依赖 mkdir -p output for f in dashboards/*.jsonnet; do jsonnet -J vendor "$f" > "output/$(basename $f .jsonnet).json" done-name:Validate & Deployrun:|terraform init terraform plan terraform apply -auto-approve四、Datasource 与 Folder 管理
4.1 数据源全类型配置
Terraform 支持 Prometheus、Elasticsearch、CloudWatch、Jaeger、Loki、Tempo 等数十种数据源。
# datasources.tf resource "grafana_data_source" "prometheus" { type = "prometheus" name = "Prometheus" uid = "prometheus-main" # 固定 UID,Dashboard 中引用 url = "http://prometheus.monitoring.svc:9090" is_default = true json_data_encoded = jsonencode({ httpMethod = "POST" manageAlerts = true prometheusType = "Prometheus" prometheusVersion = "2.40.0" }) } resource "grafana_data_source" "cloudwatch" { type = "cloudwatch" name = "AWS CloudWatch" uid = "cloudwatch-main" json_data_encoded = jsonencode({ defaultRegion = "us-east-1" authType = "default" # 使用 EC2 IAM Role }) } resource "grafana_data_source" "elasticsearch" { type = "elasticsearch" name = "Application Logs" uid = "es-logs" url = "https://es.example.com:9200" database_name = "[logs-]YYYY.MM.DD" json_data_encoded = jsonencode({ esVersion = "8.0.0" timeField = "@timestamp" maxConcurrentShardRequests = 256 logMessageField = "message" logLevelField = "level" }) }关键注意事项
- 始终显式设置
uid,Dashboard 中通过${grafana_data_source.prometheus.uid}引用 - 使用
json_data_encoded而非旧版json_data块,避免 provider 版本兼容问题 - AWS Managed Grafana 需配置
sigv4_auth等 SigV4 参数
4.2 Folder 与权限体系
# folders.tf resource "grafana_folder" "platform" { title = "Platform Engineering" uid = "platform" } resource "grafana_folder" "application" { title = "Application Teams" uid = "application" } # permissions.tf - Folder 级别权限 resource "grafana_folder_permission" "platform" { folder_uid = grafana_folder.platform.uid permissions { role = "Viewer" permission = "View" } permissions { team_id = grafana_team.sre.id permission = "Edit" } permissions { team_id = grafana_team.platform.id permission = "Admin" } } # Dashboard 级别细粒度权限 resource "grafana_dashboard_permission" "sensitive" { dashboard_uid = grafana_dashboard.security_overview.uid permissions { team_id = grafana_team.security.id permission = "View" } }五、Alerting 告警体系 as Code
Grafana Alerting 是 Terraform 管理中最复杂的部分,包含 Contact Point、Notification Policy、Alert Rule、Mute Timing、Message Template 五大资源。
5.1 联系点(Contact Points)
# alerting/contact-points.tf resource "grafana_contact_point" "email_ops" { name = "Operations Email" email { addresses = ["ops@company.com", "sre@company.com"] single_email = true message = "{{ template \"default.message\" . }}" } } resource "grafana_contact_point" "slack_alerts" { name = "Slack Alerts" slack { url = var.slack_webhook_url recipient = "#alerts" title = "{{ template \"default.title\" . }}" text = "{{ template \"default.message\" . }}" } } resource "grafana_contact_point" "pagerduty_critical" { name = "PagerDuty Critical" pagerduty { integration_key = var.pagerduty_key severity = "critical" } }5.2 通知模板(Message Templates)
resource "grafana_message_template" "custom" { name = "custom_alerts" template = <<EOT {{ define "custom_email.message" }} Alert: {{ .CommonLabels.alertname }} Severity: {{ .CommonLabels.severity }} Summary: {{ .CommonAnnotations.summary }} Runbook: {{ .CommonAnnotations.runbook_url }} {{ end }} EOT } # 在 contact point 中引用模板 resource "grafana_contact_point" "email_custom" { name = "Custom Email" email { addresses = ["oncall@company.com"] message = "{{ template \"custom_email.message\" . }}" } }5.3 静默时间(Mute Timings)
resource "grafana_mute_timing" "weekends" { name = "No Weekends" intervals { weekdays = ["saturday", "sunday"] } } resource "grafana_mute_timing" "maintenance" { name = "Maintenance Window" intervals { weekdays = ["monday"] times { start = "02:00" end = "04:00" } } }5.4 通知策略树(Notification Policy)
⚠️ 关键警告:grafana_notification_policy是一个单例资源,应用它会覆盖整个通知策略树。必须在代码中完整定义所有策略。
resource "grafana_notification_policy" "main" { group_by = ["alertname", "grafana_folder", "severity"] contact_point = grafana_contact_point.email_ops.name group_wait = "30s" group_interval = "5m" repeat_interval = "4h" # 关键告警 -> PagerDuty policy { matcher { label = "severity" match = "=" value = "critical" } contact_point = grafana_contact_point.pagerduty_critical.name group_wait = "10s" continue = true # 继续匹配其他策略 } # 警告 -> Slack policy { matcher { label = "severity" match = "=" value = "warning" } contact_point = grafana_contact_point.slack_alerts.name } # 开发环境告警 -> 静默周末 policy { matcher { label = "environment" match = "=" value = "development" } contact_point = grafana_contact_point.slack_alerts.name mute_timings = [grafana_mute_timing.weekends.name] } }5.5 告警规则组(Alert Rules)
resource "grafana_rule_group" "platform" { name = "platform_alerts" folder_uid = grafana_folder.platform.uid interval = 60 # 评估间隔 60s rule { name = "High CPU Usage" condition = "B" data { ref_id = "A" relative_time_range { from = 300 to = 0 } datasource_uid = grafana_data_source.prometheus.uid model = jsonencode({ expr = "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100) > 80" refId = "A" }) } data { ref_id = "B" relative_time_range { from = 0 to = 0 } datasource_uid = "__expr__" model = jsonencode({ type = "threshold" expression = "A" conditions = [{ evaluator = { type = "gt" params = [80] } }] }) } annotations = { summary = "CPU usage above 80% on {{ $labels.instance }}" description = "Instance {{ $labels.instance }} has CPU usage of {{ $value }}%" runbook_url = "https://wiki.internal/runbooks/high-cpu" } labels = { severity = "critical" team = "sre" } } }Alert Rule 设计要点
- 一个
rule_group内的所有 rule 是原子评估的 - 使用
for_each批量创建同类告警 datasource_uid引用 Terraform 数据源资源,避免硬编码
六、SLO 与 Synthetic Monitoring
6.1 SLO as Code
Grafana Cloud SLO 功能可通过 Terraform 管理。创建 SLO 后,系统会自动生成关联的 Recording Rules、Dashboard 和 Alert。
resource "grafana_slo" "api_availability" { name = "API Availability" description = "99.9% availability target for API gateway" query { type = "ratio" ratio { success_metric = "sum(rate(http_requests_total{status!~\"5..\"}[5m]))" total_metric = "sum(rate(http_requests_total[5m]))" } } objectives { value = 0.999 window = "30d" } alert { fastburn { annotation { key = "severity" value = "critical" } label { key = "team" value = "sre" } } slowburn { annotation { key = "severity" value = "warning" } } } # 可选:关联到特定文件夹 folder_uid = grafana_folder.slo.id }⚠️ 初始化陷阱:新创建的 Grafana Cloud Stack 需要先手动初始化 SLO 功能(在 UI 中点击一次),否则 Terraform 首次 apply 会报错。可通过time_sleep资源延迟创建或先执行初始化脚本。
resource "time_sleep" "wait_for_slo_init" { create_duration = "60s" depends_on = [grafana_cloud_stack.main] }6.2 Synthetic Monitoring
resource "grafana_synthetic_monitoring_check" "homepage" { job = "homepage" target = "https://example.com" enabled = true frequency = 60000 # 60s timeout = 5000 probes = [ data.grafana_synthetic_monitoring_probes.main.probes.0 ] settings { http { method = "GET" valid_status_codes = [200] valid_http_versions = ["HTTP/1.1", "HTTP/2"] } } }七、IAM 与组织架构
7.1 用户与团队管理
# iam.tf resource "grafana_user" "developers" { for_each = toset([ "alice@company.com", "bob@company.com", "charlie@company.com" ]) email = each.value login = split("@", each.value)[0] password = random_password.user_passwords[each.value].result is_admin = false } resource "random_password" "user_passwords" { for_each = toset(["alice@company.com", "bob@company.com", "charlie@company.com"]) length = 16 special = true } resource "grafana_team" "sre" { name = "SRE Team" email = "sre@company.com" members = [ grafana_user.developers["alice@company.com"].email, grafana_user.developers["bob@company.com"].email, ] } resource "grafana_team" "platform" { name = "Platform Team" email = "platform@company.com" members = [ grafana_user.developers["charlie@company.com"].email, ] }7.2 组织与多租户
resource "grafana_organization" "engineering" { name = "Engineering" } provider "grafana" { alias = "engineering" org_id = grafana_organization.engineering.org_id auth = var.grafana_auth } resource "grafana_folder" "eng_infra" { provider = grafana.engineering title = "Infrastructure" }八、多环境管理策略
8.1 Terraform Workspace 方案
使用 Terraform Workspace 隔离环境状态:
terraform workspace new production terraform workspace new staging terraform workspace new development# environments.tfvars 按 workspace 区分 locals { env = terraform.workspace grafana_configs = { production = { url = "https://my-stack.grafana.net/" token = var.grafana_prod_token } staging = { url = "https://staging.grafana.local/" token = var.grafana_staging_token } } } provider "grafana" { url = local.grafana_configs[local.env].url auth = local.grafana_configs[local.env].token }8.2 环境差异化配置
locals { environment_tags = { production = ["prod", "critical"] staging = ["staging", "non-critical"] } } resource "grafana_dashboard" "overview" { folder = grafana_folder.main.id config_json = templatefile("${path.module}/dashboards/overview.json.tpl", { environment = local.env tags = local.environment_tags[local.env] datasource = grafana_data_source.prometheus.uid }) }8.3 模块复用模式
# modules/monitoring-stack/main.tf variable "environment" { type = string } variable "prometheus_url" { type = string } resource "grafana_folder" "main" { title = "${var.environment} Monitoring" } resource "grafana_data_source" "prometheus" { type = "prometheus" name = "Prometheus ${var.environment}" url = var.prometheus_url } resource "grafana_dashboard" "overview" { folder = grafana_folder.main.id config_json = file("${path.module}/dashboards/overview.json") } output "folder_id" { value = grafana_folder.main.id } # environments/production/main.tf module "prod_monitoring" { source = "../../modules/monitoring-stack" environment = "Production" prometheus_url = "http://prometheus-prod.monitoring.svc:9090" }九、状态管理与协作
9.1 Remote Backend 配置
# backend.tf terraform { backend "s3" { bucket = "mycompany-terraform-state" key = "grafana/production/terraform.tfstate" region = "us-east-1" encrypt = true dynamodb_table = "terraform-locks" } }9.2 资源导入策略
从现有 Grafana UI 迁移到 Terraform 的批量导入流程:
# 1. 导出 Dashboard JSON 并清理curl-H"Authorization: Bearer$TOKEN"\"$URL/api/dashboards/uid/my-dashboard"|\jq'.dashboard | del(.id, .version)'>dashboards/my-dashboard.json# 2. 编写 Terraform 资源resource"grafana_dashboard""my_dashboard"{folder=grafana_folder.main.id config_json=file("${path.module}/dashboards/my-dashboard.json")}# 3. 导入到 Terraform Stateterraformimportgrafana_dashboard.my_dashboard<uid>terraform plan# 对比差异,补齐代码批量导入脚本
#!/bin/bash# import-all.shuids=$(curl-s-H"Authorization: Bearer$TOKEN"\"$URL/api/search?type=dash-db&limit=1000"|jq-r'.[].uid')foruidin$uids;doecho"Importing dashboard:$uid"terraformimportgrafana_dashboard.$uid$uid2>/dev/null||echo"Skipped$uid"done十、CI/CD 完整流水线
10.1 GitHub Actions 工作流
# .github/workflows/grafana-terraform.ymlname:Grafana Infrastructure as Codeon:push:branches:[main]paths:-'terraform/grafana/**'-'dashboards/**'pull_request:paths:-'terraform/grafana/**'-'dashboards/**'env:TF_VAR_grafana_auth:${{secrets.GRAFANA_SERVICE_ACCOUNT_TOKEN}}jobs:validate:runs-on:ubuntu-lateststeps:-uses:actions/checkout@v4-name:Setup Terraformuses:hashicorp/setup-terraform@v3with:terraform_version:"1.7.0"-name:Terraform Format Checkworking-directory:terraform/grafanarun:terraform fmt-check-recursive-name:Terraform Initworking-directory:terraform/grafanarun:terraform init-name:Terraform Validateworking-directory:terraform/grafanarun:terraform validate-name:Generate Dashboards (Jsonnet)if:hashFiles('dashboards/**/*.jsonnet')!=''run:|go install github.com/google/go-jsonnet/cmd/jsonnet@latest go install github.com/jsonnet-bundler/jsonnet-bundler/cmd/jb@latest jb install mkdir -p output for f in dashboards/*.jsonnet; do jsonnet -J vendor "$f" > "output/$(basename $f .jsonnet).json" done-name:Validate Dashboard JSONrun:|for f in output/*.json dashboards/**/*.json; do jq empty "$f" doneplan:needs:validateif:github.event_name == 'pull_request'runs-on:ubuntu-lateststeps:-uses:actions/checkout@v4-uses:hashicorp/setup-terraform@v3-name:Terraform Init & Planworking-directory:terraform/grafanarun:|terraform init terraform plan -no-color -out=tfplan-name:Post Plan to PRuses:actions/github-script@v7with:script:|const fs = require('fs'); const plan = fs.readFileSync('terraform/grafana/tfplan.stdout', 'utf8'); github.rest.issues.createComment({ issue_number: context.issue.number, owner: context.repo.owner, repo: context.repo.repo, body: `### Terraform Plan\n\`\`\`\n${plan}\n\`\`\`` });deploy:needs:validateif:github.ref == 'refs/heads/main'runs-on:ubuntu-latestenvironment:production# 需要审批steps:-uses:actions/checkout@v4-uses:hashicorp/setup-terraform@v3-name:Terraform Init & Applyworking-directory:terraform/grafanarun:|terraform init terraform apply -auto-approve10.2 审批与回滚策略
- Plan 阶段:PR 时自动执行,结果评论到 PR
- Apply 阶段:合并到
main后触发,通过 GitHub Environment Protection Rules 设置人工审批 - 回滚:利用 Terraform State 历史版本或 Git Revert + Re-apply
- Dashboard 专属变更:仅
dashboards/**路径变更时触发,减少无关构建
十一、最佳实践与常见陷阱
11.1 核心最佳实践
| 实践项 | 说明 |
|---|---|
| 固定 UID | Dashboard、Folder、Datasource 必须显式设置uid,避免重复创建 |
| 删除 id/version | 导入 JSON 时删除id和version字段 |
| 禁用 UI 编辑 | 生产环境设置disable_provenance = false(默认),保持 Terraform 为唯一真理源 |
| 敏感信息隔离 | Webhook URL、PagerDuty Key、密码使用sensitive = true变量,注入环境变量 |
| 分支保护 | main分支禁止直接推送,必须通过 PR + Code Review |
| 状态锁定 | 使用 DynamoDB 或 Terraform Cloud 防止并发操作 |
| 模块复用 | 将通用监控栈封装为模块,环境间复用 |
| UTC 时区 | Dashboard 统一设置timezone: "utc",避免时区混乱 |
11.2 常见陷阱与解决方案
| 问题 | 原因 | 解决方案 |
|---|---|---|
| Contact Point 删除失败 409 | 被 Notification Policy 引用 | 先更新 Policy 移除引用,再删除 Contact Point;或设计时避免循环依赖 |
| Datasource 引用失效 | 硬编码 UID 与环境不匹配 | 使用grafana_data_source.xxx.uid动态引用 |
| SLO 首次创建失败 | Grafana Cloud SLO 功能未初始化 | 手动在 UI 初始化一次,或使用time_sleep延迟 |
| Dashboard 重复创建 | UID 冲突或未设置 | 确保所有 Dashboard 有固定 UID |
| Alert Rule 评估异常 | __expr__数据源配置错误 | 严格遵循ref_id和datasource_uid = "__expr__"规范 |
| Terraform Plan 频繁漂移 | UI 手动修改导致 | 设置disable_provenance = false,禁止 UI 编辑 provisioned 资源 |
11.3 监控 Terraform 本身
建议将 Terraform 状态变更也纳入审计:
# 在 Terraform 中记录部署信息 resource "grafana_annotation" "deployment" { text = "Terraform apply: ${timestamp()}" dashboard_id = grafana_dashboard.overview.id tags = ["terraform", "deployment"] }十二、完整项目结构示例
grafana-infrastructure/ ├── README.md ├── .github/ │ └── workflows/ │ └── grafana-terraform.yml ├── modules/ │ ├── monitoring-stack/ │ │ ├── main.tf │ │ ├── variables.tf │ │ ├── outputs.tf │ │ └── dashboards/ │ │ └── overview.json │ └── alerting-policy/ │ ├── main.tf │ ├── variables.tf │ └── outputs.tf ├── environments/ │ ├── production/ │ │ ├── main.tf │ │ ├── backend.tf │ │ └── terraform.tfvars │ └── staging/ │ ├── main.tf │ ├── backend.tf │ └── terraform.tfvars ├── global/ │ ├── providers.tf │ ├── versions.tf │ ├── variables.tf │ ├── folders.tf │ ├── datasources.tf │ ├── permissions.tf │ ├── iam.tf │ └── alerting/ │ ├── contact-points.tf │ ├── notification-policy.tf │ ├── mute-timings.tf │ ├── templates.tf │ └── rule-groups.tf ├── dashboards/ │ ├── jsonnet/ │ │ ├── lib/ │ │ ├── cluster-overview.jsonnet │ │ └── service-detail.jsonnet │ └── json/# CI 生成或手写的最终 JSON│ ├── platform/ │ └── application/ └── scripts/ ├── import-dashboards.sh └── validate-json.sh十三、选型总结
Terraform Grafana IaC 路线是已有 Terraform 工作流团队的最优选择,其核心价值在于:
- 全资源覆盖:Dashboard、Datasource、Alert、SLO、Synthetic Monitoring、IAM 统一管理
- 环境一致性:通过 Workspace + Module 实现多环境复刻
- 变更可审计:Git 历史 + Terraform Plan 提供完整的变更审查链
- 灾难恢复:从 Git + State 可完全重建整个 Grafana 配置
实施路径建议
- 第 1 周:搭建 Provider + 导入现有 Datasource 和 Folder
- 第 2-3 周:批量导入 Dashboard,建立 Jsonnet/Terraform 混合工作流
- 第 4 周:迁移 Alerting(Contact Point → Policy → Rule Group)
- 第 5 周:接入 SLO、Synthetic Monitoring、IAM
- 第 6 周:完善 CI/CD、状态锁定、审批流程、文档
此方案将 Grafana 从"手工配置的 UI 工具"转变为"可版本控制、可审查、可自动化的基础设施组件",真正实现监控体系的 GitOps 闭环。
