# N9e-告警规则分级管理与优化建议| 项目 | 值 | |------|----| | 数据来源 | N9E API `GET /api/n9e/busi-group/{id}/alert-rules` | | N9E 地址 | https://n9e.icbc.com | | 总规则数 | **222** (启用 197 / 禁用 25) | | 业务组数 | 17 |## 目录- [汇总统计](#汇总统计) - [P0-Critical (116 条)](#p0-critical-116-条) - [P1-Warning (86 条)](#p1-warning-86-条) - [P2-Info (20 条)](#p2-info-20-条) - [覆盖缺口分析](#覆盖缺口分析)---## 汇总统计### 按级别统计| 级别 | 启用 | 禁用 | 合计 | 占比 | |------|------|------|------|------| | **P0-Critical** | 106 | 10 | 116 | 52% | | **P1-Warning** | 73 | 13 | 86 | 39% | | **P2-Info** | 18 | 2 | 20 | 9% | | **合计** | **197** | **25** | **222** | 100% |### 按业务组统计| 业务组 | P0 | P1 | P2 | 合计 | 启用 | 禁用 | |--------|----|----|----|----|------|------| | **AM** | 4 | 0 | 0 | 4 | 3 | 1 | | **DMA** | 1 | 0 | 0 | 1 | 0 | 1 | | **DataCenter** | 5 | 0 | 0 | 5 | 5 | 0 | | **Infra/AccessLog** | 1 | 1 | 0 | 2 | 2 | 0 | | **Infra/DevOps** | 4 | 2 | 0 | 6 | 6 | 0 | | **Infra/EC2** | 14 | 20 | 9 | 43 | 35 | 8 | | **Infra/K8S** | 13 | 16 | 6 | 35 | 34 | 1 | | **Infra/Kafka** | 5 | 11 | 0 | 16 | 16 | 0 | | **Infra/Monitoring** | 14 | 14 | 0 | 28 | 20 | 8 | | **Infra/RDS** | 7 | 8 | 1 | 16 | 16 | 0 | | **Infra/Redis** | 5 | 7 | 1 | 13 | 13 | 0 | | **OTC** | 2 | 0 | 0 | 2 | 2 | 0 | | **Prime/Custody** | 0 | 1 | 0 | 1 | 1 | 0 | | **Prime/EMS/mds** | 6 | 2 | 3 | 11 | 8 | 3 | | **Prime/EMS/rapidtrade** | 12 | 0 | 0 | 12 | 11 | 1 | | **Prime/OMS** | 21 | 4 | 0 | 25 | 23 | 2 | | **Security** | 2 | 0 | 0 | 2 | 2 | 0 |### 按监控类型统计| 监控类型 | 数量 | |---------|------| | AWS CloudWatch 指标 | 49 | | 主机指标 (node_exporter) | 48 | | 应用自定义指标 | 40 | | 探活/连通性检测 | 27 | | K8s/容器指标 | 25 | | 日志告警 (Loki LogQL) | 23 | | Kafka 指标 (JMX) | 5 | | 数据库指标 | 3 | | 其他/无 PromQL | 2 |---## P0-Critical (116 条)### AM (4 条, 启用 3)#### 1. Logs-AM ALERTERROR- **级别**: P0-Critical | **状态**: Disabled | **ID**: 160 - **配置**: 执行间隔: 15s | 类型: loki - **备注**: AM相关服务日志告警,关键字ALERTERROR**PromQL**:```promql sum by (app, truncated_message) ( count_over_time( {app=~"(ltp-am|ltp-am-mgt|bouncebit-am|bitu|am|ltp-spec-acc|am-client-service)",env="prod"}|="ERROR" !~"(NacosServiceDiscovery|NettyConnectionClient|NettyConnectionHandler)" | pattern `<message>` | line_format "{{substring .message 0 100}}" | label_format truncated_message="{{__line__}}" [5m] ) ) ```---#### 2. Logs-AM ALERTERROR日志告警- **级别**: P0-Critical | **状态**: Enabled | **ID**: 202 - **配置**: 执行间隔: 15s | 持续时间: 120s | 类型: loki - **备注**: AM相关服务日志告警,关键字ALERTERROR**PromQL**:```promql sum by (app, error_snippet) ( count_over_time( {app=~"(ltp-am|ltp-am-mgt|bouncebit-am|bitu|am|ltp-spec-acc|am-client-service)",env="prod"} |="ALERTERROR" !~"(NacosServiceDiscovery|NettyConnectionClient|NettyConnectionHandler)" | pattern `<message>` | line_format "{{.message}}" | regexp `(?P<error_snippet>(?s:ALERTERROR.*?\n.*?\n.*?))` | line_format "{{substring .error_snippet 0 300}}" [5m] ) ) ```---#### 3. Logs-AM-电话告警- **级别**: P0-Critical | **状态**: Enabled | **ID**: 204 - **配置**: 执行间隔: 15s | 类型: loki - **备注**: 活期申购失败 活期赎回失败 定时任务:calcHourDividendFunding划转 CashAccount变动处理失败 Funding划转 CashAccount释放冻结失败 transferCallBack 处理失败**PromQL**:```promql sum( count_over_time( {app=~"(ltp-am|ltp-am-mgt|bouncebit-am|bitu|am|ltp-spec-acc|am-client-service)",env="prod"} |~ "(PhoneAlert|活期申购失败|活期赎回失败|定时任务:calcHourDividend|Funding划转,转入响应成功后,CashAccount变动处理失败|Funding划转,转出响应成功后,CashAccount释放冻结失败|transferCallBack 处理失败)" !~ "(NacosServiceDiscovery|NettyConnectionClient|NettyConnectionHandler)" | pattern `<message>` | line_format "{{substring .message 0 100}}" | label_format truncated_message="{{__line__}}" [5m] ) )>1 ```---#### 4. XXLJOB任务执行失败-AM- **级别**: P0-Critical | **状态**: Enabled | **ID**: 269 - **配置**: 执行间隔: 15s | 类型: mysql - **备注**: XXLJOB任务执行失败-AM-prod-资管专户> 无 PromQL (可能为 N9E 内置规则或事件型告警)---### DMA (1 条, 启用 0)#### 5. DMA-Lighter Runbot process is Down- **级别**: P0-Critical | **状态**: Disabled | **ID**: 192 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: process is down or lighter process <2**PromQL**:```promql up{job="lighter_runbot_process"}==0 or namedprocess_namegroup_num_procs{groupname="lighter"}<2 ```---### DataCenter (5 条, 启用 5)#### 6. Datacenter-Kafka is Down- **级别**: P0-Critical | **状态**: Enabled | **ID**: 184 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: datacenter-Kafka is Down , Please checkout**PromQL**:```promql aws_kafka_memory_free_average{dimension_Cluster_Name="aws-jp-prod-ltp-datacenter-kafka"}==0 ```---#### 7. Datacenter-RDS CPU波动率大于30%- **级别**: P0-Critical | **状态**: Enabled | **ID**: 199 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: Datacenter-RDS CPU>90% 波动率大于30%**PromQL**:```promql aws_rds_cpuutilization_maximum{tag_env!~"dev|sit|uat|fat|qa|mirror",dimension_DBInstanceIdentifier="ltp-data-prod-rds"} >=90and(aws_rds_cpuutilization_maximum{tag_env!~"dev|sit|uat|fat|qa|mirror",dimension_DBInstanceIdentifier="ltp-data-prod-rds"}- aws_rds_cpuutilization_maximum{tag_env!~"dev|sit|uat|fat|qa|mirror",dimension_DBInstanceIdentifier="ltp-data-prod-rds"} offset 15m) >30 ```---#### 8. Datacenter-RedShift CPU max 大于 99%- **级别**: P0-Critical | **状态**: Enabled | **ID**: 284 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: aws_redshift_cpuutilization_maximum 连续30min >99%**PromQL**:```promql avg_over_time(aws_redshift_cpuutilization_maximum{dimension_ClusterIdentifier="ltp-prod-bigdata-redshift"}[30m]) > 99 and min_over_time(aws_redshift_cpuutilization_maximum{dimension_ClusterIdentifier="ltp-prod-bigdata-redshift"}[30m]) > 99 ```---#### 9. Datacenter-Redis is Down- **级别**: P0-Critical | **状态**: Enabled | **ID**: 183 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: datacenter-redis is Down , Please checkout**PromQL**:```promql aws_elasticache_database_memory_usage_percentage_maximum{dimension_CacheClusterId="aws-jp-prod-ltp-datacenter-redis-001"} == 0 ```---#### 10. Datacenter-Redshift HealthStatus is Down- **级别**: P0-Critical | **状态**: Enabled | **ID**: 161 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: RedShift is Down , Please checkout**PromQL**:```promql aws_redshift_health_status_average{dimension_ClusterIdentifier="ltp-prod-bigdata-redshift"} == 0 ```---### Infra/AccessLog (1 条, 启用 1)#### 11. 生产环境-Nginx Status Code 499/5xx 过去 5min 大于 100- **级别**: P0-Critical | **状态**: Enabled | **ID**: 162 - **配置**: 执行间隔: 15s | 持续时间: 200s | 类型: loki - **备注**: 生产环境-Nginx Status Code 499/5xx 过去 5min 大于 100**PromQL**:```promql count by (env, site, status) (count_over_time({job="nginx",status=~"499|50+",env!~"dev|sit|uat|fat|qa|mirror|nonprod"}[5m]))>10 ```---### Infra/DevOps (4 条, 启用 4)#### 12. JVM Heap Usage 大于 95%- **级别**: P0-Critical | **状态**: Enabled | **ID**: 263 - **配置**: 执行间隔: 15s | 持续时间: 120s | 类型: prometheus**PromQL**:```promql jvm_memory_used_bytes{env=~"prod.*"} / jvm_memory_max_bytes{env=~"prod.*"} * 100 > 95 ```---#### 13. cam.liquiditytech.com ping is unhealth- **级别**: P0-Critical | **状态**: Enabled | **ID**: 215 - **配置**: 执行间隔: 15s | 持续时间: 120s | 类型: prometheus - **备注**: https://cam.liquiditytech.com/api/v1/httpmisc/ping is unhealth**PromQL**:```promql up{job='cam-ping-check'}==0 ```---#### 14. cam.liquiditytech.com site is unhealth- **级别**: P0-Critical | **状态**: Enabled | **ID**: 214 - **配置**: 执行间隔: 15s | 持续时间: 120s | 类型: prometheus - **备注**: https://cam.liquiditytech.com is unhealth**PromQL**:```promql up{job="cam"}==0 ```---#### 15. 生产日志 Erorr 计数短时间突增-S1- **级别**: P0-Critical | **状态**: Enabled | **ID**: 289 - **配置**: 执行间隔: 15s | 持续时间: 120s | 类型: prometheus**PromQL**:```promql prod_logs_error_count_1m{app!~`(loki|litellm|dolphinscheduler)`} > 1000 and (prod_logs_error_count_1m{app!~`(loki|litellm|dolphinscheduler)`}/clamp_min(avg_over_time(prod_logs_error_count_1m{app!~`(loki|litellm|dolphinscheduler)`}[15m]), 1) ) > 6 ```---### Infra/EC2 (14 条, 启用 12)#### 16. AWS EC2 P0 维护事件- **级别**: P0-Critical | **状态**: Enabled | **ID**: 125 - **配置**: 执行间隔: 15s | 持续时间: 180s | 类型: prometheus**PromQL**:```promql avg_over_time(aws_health_event_info{exported_service="EC2",status="upcoming",event_code=~"AWS_EC2_PERSISTENT_INSTANCE_RETIREMENT_SCHEDULED|AWS_EC2_INSTANCE_RETIREMENT_SCHEDULED|AWS_EC2_INSTANCE_REBOOT_FLEXIBLE_MAINTENANCE_SCHEDULED|AWS_EC2_INSTANCE_STOP_SCHEDULED"}[1h]) > 0 ```---#### 17. AWS_EC2_INSTANCE_REBOOT_FLEXIBLE_MAINTENANCE_SCHEDULED- **级别**: P0-Critical | **状态**: Disabled | **ID**: 126 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: AWS_EC2_INSTANCE_REBOOT_FLEXIBLE_MAINTENANCE_SCHEDULED>0**PromQL**:```promql avg_over_time(aws_health_event_info{event_code="AWS_EC2_INSTANCE_REBOOT_FLEXIBLE_MAINTENANCE_SCHEDULED",service="EC2",status="upcoming"}[1h]) > 0 ```---#### 18. AWS_EC2_INSTANCE_STOP_SCHEDULED- **级别**: P0-Critical | **状态**: Disabled | **ID**: 127 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: avg_over_time(aws_health_event_info{env="prod",event_code=~"AWS_EC2_INSTANCE_STOP_SCHEDULED",service="EC2",status="upcoming"}[1h]) > 0**PromQL**:```promql avg_over_time(aws_health_event_info{event_code=~"AWS_EC2_INSTANCE_STOP_SCHEDULED",service="EC2",status="upcoming"}[1h]) > 0 ```---#### 19. EC2-Available Memory 使用率大于:95%- **级别**: P0-Critical | **状态**: Enabled | **ID**: 29 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: EC2实例内存使用率>95%**PromQL**:```promql (1 - (node_memory_MemAvailable_bytes{env!~"dev|sit|uat|fat|mirror|qa",PrivateIpAddress!=""} / node_memory_MemTotal_bytes{env!~"dev|sit|uat|fat|mirror|qa",PrivateIpAddress!=""})) * 100 >= 95 and (1 - (node_memory_MemAvailable_bytes{env!~"dev|sit|uat|fat|mirror|qa",PrivateIpAddress!=""} / node_memory_MemTotal_bytes{env!~"dev|sit|uat|fat|mirror|qa",PrivateIpAddress!=""})) * 100 <=100 ```---#### 20. EC2-CPU负载变化量大于:40%-prod- **级别**: P0-Critical | **状态**: Enabled | **ID**: 242 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: EC2-CPU负载变化量大于:40%**PromQL**:```promql 100 - sum(rate(node_cpu_seconds_total{mode="idle", PrivateIpAddress!="",env!~"dev|sit|uat|fat|mirror|qa"}[5m])) by (Region,instance,Name,env,PrivateIpAddress) / sum(rate(node_cpu_seconds_total{PrivateIpAddress!="",env!~"dev|sit|uat|fat|mirror|qa"}[5m])) by (Region,instance,Name,env,PrivateIpAddress) * 100 >= 80 and ((100 - sum(rate(node_cpu_seconds_total{mode="idle", PrivateIpAddress!="",env!~"dev|sit|uat|fat|mirror|qa"}[5m])) by (Region,instance,Name,env,PrivateIpAddress) / sum(rate(node_cpu_seconds_total{PrivateIpAddress!="",env!~"dev|sit|uat|fat|mirror|qa"}[5m])) by (Region,instance,Name,env,PrivateIpAddress) * 100)-(100 - sum(rate(node_cpu_seconds_total{mode="idle", PrivateIpAddress!="",env!~"dev|sit|uat|fat|mirror|qa"}[5m] offset 5m)) by (Region,instance,Name,env,PrivateIpAddress) / sum(rate(node_cpu_seconds_total{PrivateIpAddress!="",env!~"dev|sit|uat|fat|mirror|qa"}[5m] offset 5m)) by (Region,instance,Name,env,PrivateIpAddress) * 100) ) >= 40 ```---#### 21. EC2-CPU负载大于:95%-prod- **级别**: P0-Critical | **状态**: Enabled | **ID**: 232 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: CPU负载>95%-100%**PromQL**:```promql 100 - sum(rate(node_cpu_seconds_total{mode="idle", PrivateIpAddress!="",env!~"dev|sit|uat|fat|mirror|qa"}[5m])) by (Region,instance,Name,env,PrivateIpAddress) / sum(rate(node_cpu_seconds_total{PrivateIpAddress!="",env!~"dev|sit|uat|fat|mirror|qa"}[5m])) by (Region,instance,Name,env,PrivateIpAddress) * 100 >= 95 and 100 - sum(rate(node_cpu_seconds_total{mode="idle", PrivateIpAddress!="",env!~"dev|sit|uat|fat|mirror|qa"}[5m])) by (Region,instance,Name,env,PrivateIpAddress) / sum(rate(node_cpu_seconds_total{PrivateIpAddress!="",env!~"dev|sit|uat|fat|mirror|qa"}[5m])) by (Region,instance,Name,env,PrivateIpAddress) * 100 < 100 ```---#### 22. EC2-Connect Limit 连接限制大于:90%-prod- **级别**: P0-Critical | **状态**: Enabled | **ID**: 234 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: EC2实例内存使用率>90% 连接数量接近极限**PromQL**:```promql node_nf_conntrack_entries{PrivateIpAddress!="",env!~"dev|sit|uat|fat|mirror|qa"} / node_nf_conntrack_entries_limit{PrivateIpAddress!="",env!~"dev|sit|uat|fat|mirror|qa"} > 0.9 ```---#### 23. EC2-Disk avail_bytes-Prod 磁盘使用率大于:90%-prod- **级别**: P0-Critical | **状态**: Enabled | **ID**: 27 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: EC2实例磁盘使用率高于>90%**PromQL**:```promql (1 - (node_filesystem_avail_bytes{fstype!~"^(fuse.*|tmpfs|cifs|nfs)",env!~"dev|sit|uat|fat|mirror|qa"} / node_filesystem_size_bytes{env!~"dev|sit|uat|fat|mirror|qa"})) >=0.90 and (1 - (node_filesystem_avail_bytes{fstype!~"^(fuse.*|tmpfs|cifs|nfs)",env!~"dev|sit|uat|fat|mirror|qa"} / node_filesystem_size_bytes{env!~"dev|sit|uat|fat|mirror|qa"})) <= 1 and on (instance, device, mountpoint) node_filesystem_readonly == 0 ```---#### 24. EC2-MEM利用率变化量大于:40%-prod- **级别**: P0-Critical | **状态**: Enabled | **ID**: 244 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: EC2实例MEM利用率变化量>40%**PromQL**:```promql (1 - (node_memory_MemAvailable_bytes{env!~"dev|sit|uat|fat|mirror|qa",PrivateIpAddress!=""} / node_memory_MemTotal_bytes{env!~"dev|sit|uat|fat|mirror|qa",PrivateIpAddress!=""})) * 100 >= 80 and ((1 - (node_memory_MemAvailable_bytes{env!~"dev|sit|uat|fat|mirror|qa",PrivateIpAddress!=""} / node_memory_MemTotal_bytes{env!~"dev|sit|uat|fat|mirror|qa",PrivateIpAddress!=""})) * 100 -(1 - (node_memory_MemAvailable_bytes{env!~"dev|sit|uat|fat|mirror|qa",PrivateIpAddress!=""} offset 1m / node_memory_MemTotal_bytes{env!~"dev|sit|uat|fat|mirror|qa",PrivateIpAddress!=""} offset 1m)) * 100 ) >= 40 ```---#### 25. EC2-NetworkReceiveErrors 主机网络接收异常大于:1%- **级别**: P0-Critical | **状态**: Enabled | **ID**: 16 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: EC2-NetworkReceiveErrors 主机网络接收异常**PromQL**:```promql (rate(node_network_receive_errs_total{PrivateIpAddress!="",env!~"dev|sit|uat|fat|mirror|qa"}[2m]) / rate(node_network_receive_packets_total{PrivateIpAddress!="",env!~"dev|sit|uat|fat|mirror|qa"}[2m]) > 0.01) ```---#### 26. EC2-NetworkTransmitErrors 主机网络传输错误大于:1%- **级别**: P0-Critical | **状态**: Enabled | **ID**: 15 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: EC2-NetworkTransmitErrors 主机网络传输错误**PromQL**:```promql (rate(node_network_transmit_errs_total{PrivateIpAddress!="",env!~"dev|sit|uat|fat|mirror|qa"}[2m]) / rate(node_network_transmit_packets_total{PrivateIpAddress!="",env!~"dev|sit|uat|fat|mirror|qa"}[2m]) > 0.01) ```---#### 27. EC2-OomKillDetected-Prod 检测到OOM终止- **级别**: P0-Critical | **状态**: Enabled | **ID**: 14 - **配置**: 执行间隔: 15s | 持续时间: 120s | 类型: prometheus - **备注**: 检测到EC2实例OOM终止,请立刻查看**PromQL**:```promql (increase(node_vmstat_oom_kill{InstanceType!="",env!~"dev|sit|uat|fat|mirror|qa",PrivateIpAddress!=""}[2m])) >=1 ```---#### 28. EC2-OutOf available inodes使用率大于:95%- **级别**: P0-Critical | **状态**: Enabled | **ID**: 33 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: Disk is almost running out of available inodes (< 10% left)**PromQL**:```promql (1 - (node_filesystem_files_free{env!~"dev|sit|uat|fat|mirror|qa",PrivateIpAddress!=""} / node_filesystem_files{env!~"dev|sit|uat|fat|mirror|qa",PrivateIpAddress!=""})) >= 0.95 and (1 - (node_filesystem_files_free{env!~"dev|sit|uat|fat|mirror|qa",PrivateIpAddress!=""} / node_filesystem_files{env!~"dev|sit|uat|fat|mirror|qa",PrivateIpAddress!=""})) <= 1 and on (instance, device, mountpoint) node_filesystem_readonly == 0 ```---#### 29. EC2-node_exporter is Down-Prod- **级别**: P0-Critical | **状态**: Enabled | **ID**: 190 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: ec2服务器近近2分钟,可能故障离线或者无响应**PromQL**:```promql max_over_time(up{job="aws-ec2-nodes",PrivateIpAddress!="",env!~"dev|sit|uat|fat|mirror|qa"}[2h] ) == 1 and min_over_time(up{job="aws-ec2-nodes",PrivateIpAddress!="",env!~"dev|sit|uat|fat|mirror|qa"}[2m] ) == 0 ```---### Infra/K8S (13 条, 启用 12)#### 30. Container-HighCpuUtilization CPU使用率大于:95%- **级别**: P0-Critical | **状态**: Enabled | **ID**: 223 - **配置**: 执行间隔: 15s | 持续时间: 300s | 类型: prometheus - **备注**: 容器-CPU利用率超过95%**PromQL**:```promql (sum(rate(container_cpu_usage_seconds_total{image!="", container!="POD",cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}[5m])) by (cluster,namespace,pod,container) / sum by (cluster,namespace,pod,container) (kube_pod_container_resource_limits{resource="cpu",cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"})) * 100 >= 95 and (sum(rate(container_cpu_usage_seconds_total{image!="", container!="POD",cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}[5m])) by (cluster,namespace,pod,container) / sum by (cluster,namespace,pod,container) (kube_pod_container_resource_limits{resource="cpu",cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"})) * 100 < 100 ```---#### 31. Container-HighLowChangeCpuUsage CPU波动率大于:40%-prod- **级别**: P0-Critical | **状态**: Enabled | **ID**: 44 - **配置**: 执行间隔: 15s | 持续时间: 300s | 类型: prometheus - **备注**: 监控5m时间窗口内CPU使用情况的绝对变化,并在变化超过40%触发警报**PromQL**:```promql (sum(rate(container_cpu_usage_seconds_total{image!="", container!="POD",cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}[5m])) by (cluster,namespace,pod,container) / sum by (cluster,namespace,pod,container) (kube_pod_container_resource_limits{resource="cpu",cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"})) * 100 >= 80 and ((sum(rate(container_cpu_usage_seconds_total{image!="", container!="POD",cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}[5m])) by (cluster,namespace,pod,container) / sum by (cluster,namespace,pod,container) (kube_pod_container_resource_limits{resource="cpu",cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"})) * 100-(sum(rate(container_cpu_usage_seconds_total{image!="", container!="POD",cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}[5m] offset 5m)) by (cluster,namespace,pod,container) / sum by (cluster,namespace,pod,container) (kube_pod_container_resource_limits{resource="cpu",cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"})) * 100 ) >= 40 ```---#### 32. Container-HighMemoryUsage 内存使用率大于: 95%- **级别**: P0-Critical | **状态**: Enabled | **ID**: 225 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: 容器内存使用率达到95%-100%**PromQL**:```promql round(10000*avg_over_time((avg by (container,cluster,namespace,pod) (container_memory_working_set_bytes{cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}) / avg by (container,cluster,namespace,pod) (kube_pod_container_resource_limits{resource="memory",cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}) ) [5m:]))/100 >=95 and round(10000*avg_over_time((avg by (container,cluster,namespace,pod) (container_memory_working_set_bytes{cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}) / avg by (container,cluster,namespace,pod) (kube_pod_container_resource_limits{resource="memory",cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}) ) [5m:]))/100 <=100 ```---#### 33. Container-HighMemoryUsage 内存波动率大于: 40%-prod- **级别**: P0-Critical | **状态**: Enabled | **ID**: 246 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: 容器内存波动率大于: 40%**PromQL**:```promql round(10000*avg_over_time((avg by (container,cluster,namespace,pod) (container_memory_working_set_bytes{cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}) / avg by (container,cluster,namespace,pod) (kube_pod_container_resource_limits{resource="memory",cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}) ) [5m:]))/100 >= 80 and (round(10000*avg_over_time((avg by (container,cluster,namespace,pod) (container_memory_working_set_bytes{cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}) / avg by (container,cluster,namespace,pod) (kube_pod_container_resource_limits{resource="memory",cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}) ) [5m:]))/100-round(10000*avg_over_time((avg by (container,cluster,namespace,pod) (container_memory_working_set_bytes{cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"} offset 5m) / avg by (container,cluster,namespace,pod) (kube_pod_container_resource_limits{resource="memory",cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}) ) [5m:] offset 5m))/100 ) >=40 ```---#### 34. Container-容器因OOM终止,请检查内存使用量- **级别**: P0-Critical | **状态**: Enabled | **ID**: 59 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: 容器因内存溢出被kill,请检查使用量及配置是否需调整**PromQL**:```promql kube_pod_container_status_terminated_reason{reason='OOMKilled',cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"} >=1 ```---#### 35. Container-容器异常等待- **级别**: P0-Critical | **状态**: Enabled | **ID**: 60 - **配置**: 执行间隔: 15s | 持续时间: 120s | 类型: prometheus - **备注**: 目前容器异常等待:{{ $labels.reason }}**PromQL**:```promql kube_pod_container_status_waiting_reason{reason!~"ContainerCreating|PodInitializing",cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"} == 1 ```---#### 36. Container-容器除非运行状态- **级别**: P0-Critical | **状态**: Disabled | **ID**: 187 - **配置**: 执行间隔: 15s | 持续时间: 120s | 类型: prometheus - **备注**: 目前状态:{{ $labels.phase! }}**PromQL**:```promql kube_pod_status_phase{phase!~"Running|Succeeded",cluster!~"LTP-EKS-informal|ltp-eks-uat|ltp-nonprod-eks",namespace!="sec-prod"}==1 ```---#### 37. K8s-Node NotReady 节点状态异常- **级别**: P0-Critical | **状态**: Enabled | **ID**: 108 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: Prod k8s节点-K8s-Node NotReady 节点状态异常**PromQL**:```promql (kube_node_status_condition{job!='aws-ec2-nodes',container="kube-state-metrics",condition="Ready",status="unknown",cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}) ==1 ```---#### 38. K8s-Node-DiskUsed. 磁盘使用率大于:95%- **级别**: P0-Critical | **状态**: Enabled | **ID**: 222 - **配置**: 执行间隔: 15s | 持续时间: 180s | 类型: prometheus - **备注**: k8s节点-磁盘利用率到达95%-100%**PromQL**:```promql ((node_filesystem_size_bytes{job="node-exporter",mountpoint="/",fstype!~"tmpfs|rootfs|overlay",cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"} - node_filesystem_free_bytes{job="node-exporter",mountpoint="/",fstype!~"tmpfs|rootfs|overlay",cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}) / node_filesystem_size_bytes{job="node-exporter",mountpoint="/",fstype!~"tmpfs|rootfs|overlay",cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"} ) * 100 >= 95 and ((node_filesystem_size_bytes{job="node-exporter",mountpoint="/",fstype!~"tmpfs|rootfs|overlay",cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"} - node_filesystem_free_bytes{job="node-exporter",mountpoint="/",fstype!~"tmpfs|rootfs|overlay",cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}) / node_filesystem_size_bytes{job="node-exporter",mountpoint="/",fstype!~"tmpfs|rootfs|overlay",cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"} ) * 100 <= 100 ```---#### 39. K8s-Node-HighCpuUtilization CPU使用率大于:95%- **级别**: P0-Critical | **状态**: Enabled | **ID**: 209 - **配置**: 执行间隔: 15s | 持续时间: 600s | 类型: prometheus - **备注**: 生产k8s节点-CPU利用率超过95%**PromQL**:```promql 100 - (avg(rate(node_cpu_seconds_total{job="node-exporter",mode='idle',cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}[5m])) by (cluster,instance,namespace,job) * 100) >=95 and 100 - (avg(rate(node_cpu_seconds_total{job="node-exporter",mode='idle',cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}[5m])) by (cluster,instance,namespace,job) * 100) <=100 ```---#### 40. K8s-Node-MemUsed. 内存使用率大于 95%- **级别**: P0-Critical | **状态**: Enabled | **ID**: 208 - **配置**: 执行间隔: 15s | 持续时间: 120s | 类型: prometheus - **备注**: 生产k8s节点-内存利用率到达95%-100%**PromQL**:```promql 100 * (1 - (node_memory_MemFree_bytes + node_memory_Buffers_bytes{job='node-exporter',cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"} + node_memory_Cached_bytes{job='node-exporter',cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}) / node_memory_MemTotal_bytes{job='node-exporter',cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}) >=95 and 100 * (1 - (node_memory_MemFree_bytes + node_memory_Buffers_bytes{job='node-exporter',cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"} + node_memory_Cached_bytes{job='node-exporter',cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}) / node_memory_MemTotal_bytes{job='node-exporter',cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}) <=100 ```---#### 41. PVC-persistent VolumeClaim 大于90%-prod- **级别**: P0-Critical | **状态**: Enabled | **ID**: 237 - **配置**: 执行间隔: 15s | 持续时间: 180s | 类型: prometheus - **备注**: pvc存储使用率>90%**PromQL**:```promql sum by (namespace, persistentvolumeclaim,cluster) (kubelet_volume_stats_used_bytes{cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"} / kubelet_volume_stats_capacity_bytes{cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}) * 100 >=90 ```---#### 42. Prod EKS-node-exporter is Down- **级别**: P0-Critical | **状态**: Enabled | **ID**: 271 - **配置**: 执行间隔: 15s | 持续时间: 180s | 类型: prometheus - **备注**: Prod node-exporter is Down. Please checkout**PromQL**:```promql up{job=~"node_exporter|node-exporter",cluster=~"ltp-eks-prod|aws-jp-prod-ltp-infra-eks"}==0 ```---### Infra/Kafka (5 条, 启用 5)#### 43. Kafka broker Down Instance大于2- **级别**: P0-Critical | **状态**: Enabled | **ID**: 274 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: Kafka broker Down Instance大于2**PromQL**:```promql count by (env,cluster_name) (probe_success{job=~"ltp_prod_kafka_tcp"} == 0) >= 2 ```---#### 44. Kafka-memory free is 0- **级别**: P0-Critical | **状态**: Enabled | **ID**: 197 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: Kafka almost is down, please checkout**PromQL**:```promql aws_kafka_memory_free_average{tag_env!~"dev|sit|uat|fat|qa|mirror"}==0 ```---#### 45. Prod-Kafaka memory_free可用内存小于:80M- **级别**: P0-Critical | **状态**: Enabled | **ID**: 80 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: aws_kafka_memory_free_average{tag_env=~"prod.*"} / 2 ^ 20 < 80**PromQL**:```promql aws_kafka_memory_free_average{tag_env=~"prod.*"} / 2 ^ 20 < 80 ```---#### 46. Prod-Kafaka-cpu_user_average使用率大于:85%- **级别**: P0-Critical | **状态**: Enabled | **ID**: 79 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: aws_kafka_cpu_user_average{tag_env=~"prod.*"}**PromQL**:```promql aws_kafka_cpu_user_average{tag_env=~"prod.*"} >=85 and aws_kafka_cpu_user_average{tag_env=~"prod.*"} <=100 ```---#### 47. Prod-Kafaka-data dish used使用率大于:90%-prod- **级别**: P0-Critical | **状态**: Enabled | **ID**: 77 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: aws_kafka_data_logs_disk_used_average{tag_env=~"prod.*"}>=90**PromQL**:```promql aws_kafka_data_logs_disk_used_average{tag_env=~"prod.*"} >=90 and aws_kafka_data_logs_disk_used_average{tag_env=~"prod.*"} <=100 ```---### Infra/Monitoring (14 条, 启用 11)#### 48. ELB HTTP Code 5xx 大于 50- **级别**: P0-Critical | **状态**: Enabled | **ID**: 130 - **配置**: 执行间隔: 15s | 持续时间: 150s | 类型: prometheus - **备注**: ELB HTTP Code 5xx > 50**PromQL**:```promql avg_over_time(aws_applicationelb_httpcode_elb_5_xx_count_sum{dimension_AvailabilityZone!~".+",tag_env=~"prod.*"}[10m]) > 50 ```---#### 49. ELB Target HTTP Code 5xx 大于:50- **级别**: P0-Critical | **状态**: Enabled | **ID**: 128 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: ELB Target HTTP Code 5xx > 50**PromQL**:```promql avg_over_time(aws_applicationelb_httpcode_target_5_xx_count_sum{dimension_AvailabilityZone!~".+",dimension_TargetGroup=~".+",tag_env=~"prod.*",dimension_TargetGroup!="targetgroup/ltp-pb-prod-pb-api-public/8cf10e95aedaf08f"}[1m])> 50 ```---#### 50. ELB Target HTTP Code 5xx 大于 50 for 15min- **级别**: P0-Critical | **状态**: Disabled | **ID**: 129 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: ELB Target HTTP Code 5xx > 50 for 15min**PromQL**:```promql avg_over_time(aws_applicationelb_httpcode_target_5_xx_count_sum{dimension_AvailabilityZone!~".+",dimension_TargetGroup=~".+",tag_env=~"prod.*"}[1m]) > 50 ```---#### 51. ELB Target connection error count 大于 50- **级别**: P0-Critical | **状态**: Enabled | **ID**: 133 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: ELB Target connection error count > 50**PromQL**:```promql avg_over_time(aws_applicationelb_target_connection_error_count_sum{dimension_AvailabilityZone!~".+",dimension_TargetGroup=~".+",tag_env=~"prod.*"}[10m] )> 50 ```---#### 52. ELB Target response time 大于 500ms for 20min- **级别**: P0-Critical | **状态**: Disabled | **ID**: 132 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: ELB Target response time > 500ms for 20min**PromQL**:```promql aws_applicationelb_target_response_time_average{dimension_AvailabilityZone!~".+",dimension_TargetGroup=~".+",tag_env=~"prod.*"} * 1000 > 200 ```---#### 53. ELB Target response time 大于 500ms 超过 10min- **级别**: P0-Critical | **状态**: Enabled | **ID**: 259 - **配置**: 执行间隔: 15s | 持续时间: 600s | 类型: prometheus - **备注**: ELB Target response time > 500ms**PromQL**:```promql aws_applicationelb_target_response_time_average{dimension_AvailabilityZone!~".+",dimension_TargetGroup=~".+",dimension_TargetGroup!~".*metrics.*",tag_env=~"prod.*"} * 1000 > 500 ```---#### 54. Prod Site-probe success is Unhealth- **级别**: P0-Critical | **状态**: Enabled | **ID**: 216 - **配置**: 执行间隔: 15s | 持续时间: 60s | 类型: prometheus - **备注**: The Site is Down. Please checkout**PromQL**:```promql probe_success{env!~'dev|sit|uat|fat|mirror|informal',job!~".*mysql.*|.*redis.*|.*kafka.*"}==0 ```---#### 55. Prod-Node is Down- **级别**: P0-Critical | **状态**: Enabled | **ID**: 134 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: Prod Node is Down. Please checkout**PromQL**:```promql up{env!~'dev|sit|uat|uathk|fat|mirror|informal',job!~".*mysql.*|.*redis.*|.*kafka.*|aws-ec2-nodes|kubelet|kube-proxy|apiserver|rapidmarket-latency|.*metrics.*|.*tcp.*",cluster!~'ltp-eks-uat|ltp-nonprod-eks|LTP-EKS-informal',container="",instance!~".*pre.*|.*qa.*|.*:.*|.*cloudwatch.*|.*Dev.*|.*dev.*"} ==0 ```---#### 56. Prometheus-remote write mimir 写入失败率大于:5%- **级别**: P0-Critical | **状态**: Enabled | **ID**: 71 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: Prometheus remote write to mimir fail ,please check**PromQL**:```promql ((rate(prometheus_remote_storage_failed_samples_total{namespace="monitoring",cluster="ltp-eks-prod"}[5m]) or rate(prometheus_remote_storage_samples_failed_total{namespace="monitoring",cluster="ltp-eks-prod"}[5m])) / ((rate(prometheus_remote_storage_failed_samples_total{namespace="monitoring",cluster="ltp-eks-prod"}[5m]) or rate(prometheus_remote_storage_samples_failed_total{namespace="monitoring",cluster="ltp-eks-prod"}[5m])) + (rate(prometheus_remote_storage_succeeded_samples_total{namespace="monitoring",cluster="ltp-eks-prod"}[5m]) or rate(prometheus_remote_storage_samples_total{namespace="monitoring",cluster="ltp-eks-prod"}[5m])))) * 100 >= 5 ```---#### 57. Pushgateway-flink Target is Null- **级别**: P0-Critical | **状态**: Enabled | **ID**: 21 - **配置**: 执行间隔: 15s | 持续时间: 300s | 类型: prometheus - **备注**: prometheus_sd_discovered_targets{cluster="aws-jp-prod-ltp-infra-eks", config="flink-pushgateway"}**PromQL**:```promql prometheus_sd_discovered_targets{cluster="aws-jp-prod-ltp-infra-eks", config="flink-pushgateway"}==0 ```---#### 58. Service is Down-infra- **级别**: P0-Critical | **状态**: Enabled | **ID**: 272 - **配置**: 执行间隔: 15s | 持续时间: 30s | 类型: prometheus - **备注**: Service is Down , Please checkout**PromQL**:```promql up{env!~'dev|sit|uat|uathk|fat|informal|mirror|qa',job!~"kube-state-metrics|aws-ec2-nodes|kubelet|kube-proxy||apiserver|kube-prom-stack-kube-prome-operator|monitoring-kube-prometheus-operator|kube-prom-stack-kube-prome-prometheus|.*node.*exporter.*|coredns|.*mysql.*|.*redis.*|.*kafka.*|mimir/gateway",instance!="yet-another-cloudwatch-exporter",cluster=~'aws-jp-prod-ltp-infra-eks',endpoint!="http" }==0 ```---#### 59. Service is Down-prod- **级别**: P0-Critical | **状态**: Enabled | **ID**: 189 - **配置**: 执行间隔: 15s | 持续时间: 35s | 类型: prometheus - **备注**: Service is Down , Please checkout**PromQL**:```promql up{env!~'dev|sit|uat|uathk|fat|informal|mirror|qa',job!~"kube-state-metrics|aws-ec2-nodes|kubelet|kube-proxy||apiserver|kube-prom-stack-kube-prome-operator|monitoring-kube-prometheus-operator|kube-prom-stack-kube-prome-prometheus|.*node.*exporter.*|coredns|.*mysql.*|.*redis.*|.*kafka.*",instance!="yet-another-cloudwatch-exporter",cluster!~'ltp-eks-uat|ltp-nonprod-eks|LTP-EKS-informal|aws-jp-prod-ltp-infra-eks',endpoint!="http" }==0 ```---#### 60. nitro enclaves Down- **级别**: P0-Critical | **状态**: Enabled | **ID**: 120 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: nitro enclaves Down. Please checkout**PromQL**:```promql probe_success{instance="http://sec.prod.internal.liquiditytech.com/health"}==0 ```---#### 61. rapidx service down- **级别**: P0-Critical | **状态**: Disabled | **ID**: 92 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: rapidx service down**PromQL**:```promql up{env=~"prod|prodjp",job!="aws-ec2-nodes"} ==0 ```---### Infra/RDS (7 条, 启用 7)#### 62. RDS-Aws Mysql CPU波动率大于:30%- **级别**: P0-Critical | **状态**: Enabled | **ID**: 196 - **配置**: 执行间隔: 15s | 持续时间: 180s | 类型: prometheus - **备注**: AWS RDS CPU>70% 且 波动率>30%(15min),请尽快查看**PromQL**:```promql aws_rds_cpuutilization_maximum{tag_env!~"dev|sit|uat|fat|qa|mirror",dimension_DBInstanceIdentifier!~".*token.*|ltp-data-prod-rds"} >70and(aws_rds_cpuutilization_maximum{tag_env!~"dev|sit|uat|fat|qa|mirror",dimension_DBInstanceIdentifier!~".*token.*|ltp-data-prod-rds"}- aws_rds_cpuutilization_maximum{tag_env!~"dev|sit|uat|fat|qa|mirror",dimension_DBInstanceIdentifier!~".*token.*|ltp-data-prod-rds"} offset 15m) >30 ```---#### 63. RDS-Prime Mysql CPU 大于:90%- **级别**: P0-Critical | **状态**: Enabled | **ID**: 194 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: RDS-Prime Mysql CPU>=90% ,请尽快查看**PromQL**:```promql aws_rds_cpuutilization_maximum{tag_env!~"dev|sit|uat|fat|qa|mirror",dimension_DBInstanceIdentifier=~".*prime.*"}>=90 ```---#### 64. RDS-aws connections_average 大于:1500- **级别**: P0-Critical | **状态**: Enabled | **ID**: 69 - **配置**: 执行间隔: 15s | 持续时间: 360s | 类型: prometheus - **备注**: aws_rds_database_connections_average>=1500 连接较多,请检查是否有连接未释放**PromQL**:```promql aws_rds_database_connections_average{tag_env!~"dev|sit|uat|fat|qa|mirror"} >= 1500 ```---#### 65. RDS-aws cpuutilization maximum 大于:90%- **级别**: P0-Critical | **状态**: Enabled | **ID**: 68 - **配置**: 执行间隔: 15s | 持续时间: 360s | 类型: prometheus - **备注**: aws_rds_cpuutilization_maximum数据库cpu使用率大于:90 -95%**PromQL**:```promql aws_rds_cpuutilization_maximum{tag_env!~"dev|sit|uat|fat|qa|mirror"} >= 90 and aws_rds_cpuutilization_maximum{tag_env!~"dev|sit|uat|fat|qa|mirror"} <=99 ```---#### 66. RDS-aws cpuutilization maximum 大于:99%- **级别**: P0-Critical | **状态**: Enabled | **ID**: 210 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: aws_rds_cpuutilization_maximum数据库cpu使用率大于:99 -100%**PromQL**:```promql aws_rds_cpuutilization_maximum{tag_env!~"dev|sit|uat|fat|qa|mirror"} >= 99 and aws_rds_cpuutilization_maximum{tag_env!~"dev|sit|uat|fat|qa|mirror"} <=100 ```---#### 67. RDS-free memory小于: 500M- **级别**: P0-Critical | **状态**: Enabled | **ID**: 238 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: RDS-free memory小于: 500M**PromQL**:```promql aws_rds_freeable_memory_average{tag_env!~"dev|sit|uat|fat|qa|mirror"} / 2^30 < 0.5 ```---#### 68. RDS-实例连接异常-prod- **级别**: P0-Critical | **状态**: Enabled | **ID**: 103 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: MySQL 实例连接异常, 网络或者RDS故障**PromQL**:```promql probe_success{job=~"ltp_prod_mysql_tcp",env!~"dev|sit|uat|fat|qa|mirror"}==0 ```---### Infra/Redis (5 条, 启用 5)#### 69. Prod Redis Command 执行大于 3s- **级别**: P0-Critical | **状态**: Enabled | **ID**: 121 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: Redis command 执行大于 3s**PromQL**:```promql lettuce_command_firstresponse_seconds_max{command!="BLPOP", env=~"prod.*"} > 3 ```---#### 70. Redis connection is down-prod- **级别**: P0-Critical | **状态**: Enabled | **ID**: 265 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: Redis connection is down**PromQL**:```promql probe_success{job=~"ltp_prod_redis_tcp"} == 0 ```---#### 71. Redis-evictions_average 被驱逐次数大于:1- **级别**: P0-Critical | **状态**: Enabled | **ID**: 81 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: aws_elasticache_evictions_average{dimension_CacheNodeId=~".+"} >=1**PromQL**:```promql aws_elasticache_evictions_average{dimension_CacheNodeId=~".+",tag_env=~"prod|prodjp|prodhk"} >=1 ```---#### 72. Redis-memory_usage_percentage内存使用率大于:90%- **级别**: P0-Critical | **状态**: Enabled | **ID**: 84 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: aws_elasticache_database_memory_usage_percentage_maximum{dimension_CacheNodeId=~".+"} >=90**PromQL**:```promql aws_elasticache_database_memory_usage_percentage_maximum{dimension_CacheNodeId=~".+",tag_env=~"prod|prodjp|prodhk"} >= 90 and aws_elasticache_database_memory_usage_percentage_maximum{dimension_CacheNodeId=~".+",tag_env=~"prod|prodjp|prodhk"} <=100 ```---#### 73. redis-CPU利用率大于:90%-prod- **级别**: P0-Critical | **状态**: Enabled | **ID**: 240 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: redis-CPU利用率大于:90%**PromQL**:```promql aws_elasticache_cpuutilization_average{dimension_CacheNodeId=~".+",tag_env=~"prod|prodjp|prodhk"} >= 90 ```---### OTC (2 条, 启用 2)#### 74. Container-OTC服务状态异常- **级别**: P0-Critical | **状态**: Enabled | **ID**: 172 - **配置**: 执行间隔: 15s | 类型: prometheus**PromQL**:```promql increase(kube_pod_container_status_restarts_total{namespace='otc-prod'}[5m]) >=1 ```---#### 75. Logs-OTC服务日志异常- **级别**: P0-Critical | **状态**: Enabled | **ID**: 166 - **配置**: 执行间隔: 15s | 类型: loki - **备注**: count_over_time({app="ltp-otc-rfq", env="prod"} |~ `error|warning` [5m]) >=1**PromQL**:```promql count_over_time({app="ltp-otc-rfq", env="prod"} |~ `error|warning`|pattern `<message>`[5m]) >=1 ```---### Prime/EMS/mds (6 条, 启用 6)#### 76. EC2-DISK 磁盘使用率大于:80%- **级别**: P0-Critical | **状态**: Enabled | **ID**: 112 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: DISK使用率>=80%,请及时关注EC2磁盘情况**PromQL**:```promql (1 - (node_filesystem_avail_bytes{fstype!~"^(fuse.*|tmpfs|cifs|nfs)",Name=~"ltp-rapidx-prod-mdsengine-.*|aws-jp-prod-mds-algo-01|aws-jp-prod-mds-algo-02|aws-jp-prod-mds-connex-.*|aws-jp-prod-mds-quote-.*|aws-jp-prod-mdsengine-05-edx|aws-jp-prod-mds-web-01|aws-jp-prod-mds-query-.*|aws-jp-prod-mds-api-.*|aws-jp-prod-mds-onezero-01"} / node_filesystem_size_bytes{Name=~"ltp-rapidx-prod-mdsengine-.*|aws-jp-prod-mds-algo-01|aws-jp-prod-mds-algo-02|aws-jp-prod-mds-connex-.*|aws-jp-prod-mds-quote-.*|aws-jp-prod-mdsengine-05-edx|aws-jp-prod-mds-web-01|aws-jp-prod-mds-query-.*|aws-jp-prod-mds-api-.*|aws-jp-prod-mds-onezero-01"})) >=0.8 ```---#### 77. EC2-mds CPU 5min 负载大于:80%- **级别**: P0-Critical | **状态**: Enabled | **ID**: 111 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: cpu近5min负载>=80%,请及时关注EC2中程序情况**PromQL**:```promql 1 - avg by (Name,PrivateIpAddress,dept,env,instance,cluster) (rate(node_cpu_seconds_total{mode="idle", Name=~"ltp-rapidx-prod-mdsengine-.*|aws-jp-prod-mds-algo-01|aws-jp-prod-mds-algo-02|aws-jp-prod-mds-connex-.*|aws-jp-prod-mds-quote-.*|aws-jp-prod-mdsengine-05-edx|aws-jp-prod-mds-web-01|aws-jp-prod-mds-query-.*|aws-jp-prod-mds-api-.*|aws-jp-prod-mds-onezero-01"}[5m])) >80 ```---#### 78. Logs-rapidtrade_server-生产环境日志告警- **级别**: P0-Critical | **状态**: Enabled | **ID**: 185 - **配置**: 执行间隔: 15s | 类型: loki - **备注**: rapidtrade_server.log find quote not found for symbol**PromQL**:```promql (count_over_time({filename=~"/data/api/rapidtrade-server10/logs/rapidtrade_server.log|/data/api/rapidtrade-server9/logs/rapidtrade_server.log",env="prodjp"}|~`quote not found for symbol`|pattern `<message>`[5m])) >=1 ```---#### 79. api行情-OnlineConnection连接数为:0- **级别**: P0-Critical | **状态**: Enabled | **ID**: 178 - **配置**: 执行间隔: 15s | 持续时间: 300s | 类型: prometheus - **备注**: api connection 30min : 0**PromQL**:```promql sum by (team,cluster,job,Connection) (sum without (instance) (avg_over_time(light_connect_server_connection{Connection="OnlineConnection",instance=~"prod-mds-api:.*|aws-jp-prod-mds-api-02:.*"}[30m])) ) <=0 ```---#### 80. contex行情-OnlineConnection连接数为:0- **级别**: P0-Critical | **状态**: Enabled | **ID**: 179 - **配置**: 执行间隔: 15s | 持续时间: 300s | 类型: prometheus - **备注**: argo connection 30min : 0**PromQL**:```promql sum by (team,cluster,job,Connection) (sum without (instance) (avg_over_time(light_connect_server_connection{Connection="OnlineConnection",instance=~"aws-jp-prod-mds-connex-.*:.*"}[30m])) ) <= 0 ```---#### 81. 统一行情-OnlineConnection连接数为:0- **级别**: P0-Critical | **状态**: Enabled | **ID**: 176 - **配置**: 执行间隔: 15s | 持续时间: 30s | 类型: prometheus - **备注**: OnlineConnection过去半小时连接数异常**PromQL**:```promql sum by (team,cluster,job,Connection) (sum without (instance) (avg_over_time(light_connect_server_connection{Connection="OnlineConnection",instance=~"aws-jp-prod-mds-connect-03:.*|aws-jp-prod-mds-connect-04:.*|aws-jp-prod-mds-quote-03:.*|aws-jp-prod-mds-quote-04:.*"}[30m])) ) <=0 ```---### Prime/EMS/rapidtrade (12 条, 启用 11)#### 82. EMS-Rapidtrade 交易所返回ERROR Code 1m大于5- **级别**: P0-Critical | **状态**: Enabled | **ID**: 117 - **配置**: 执行间隔: 15s | 持续时间: 60s | 类型: prometheus**PromQL**:```promql increase(exchange_error_counter{Exchange=~"^HTTP.*", instance=~"rapidtrade.*", code=~"50*",env="prodjp"}[1m]) >5 ```---#### 83. Logs-rapidtrade-data-push-Invalid API-key- **级别**: P0-Critical | **状态**: Enabled | **ID**: 255 - **配置**: 执行间隔: 15s | 类型: loki - **备注**: rapidtrade-data-push-Invalid API-key**PromQL**:```promql count_over_time({app=~"rapidtrade-data-push-03", filename=~"/data/api/data-push-server.*/logs/exchange.log",env="prod"} |= `Invalid API-key` [5m]) >= 1 ```---#### 84. Logs-rapidtrade-data-push-Too many visits- **级别**: P0-Critical | **状态**: Enabled | **ID**: 253 - **配置**: 执行间隔: 15s | 类型: loki - **备注**: rapidtrade-data-push-Too many visits**PromQL**:```promql count_over_time({app="rapidtrade-data-push-03", filename="/data/api/data-push-server7/logs/exchange.log",env="prod"} |= `Too many visits` | pattern `<message>`[5m]) >= 1 ```---#### 85. Logs-rapidtrade-data-push-server10-TOO_MANY_REQUESTS- **级别**: P0-Critical | **状态**: Enabled | **ID**: 260 - **配置**: 执行间隔: 15s | 类型: loki - **备注**: rapidtrade-data-push-Too Many Requests**PromQL**:```promql count_over_time({app="rapidtrade-data-push-04", filename="/data/api/data-push-server10/logs/exchange.log",env="prod"} |= `TOO_MANY_REQUESTS` | pattern `<message>`[5m]) >= 1 ```---#### 86. Logs-rapidtrade-data-push-server9-Too Many Requests- **级别**: P0-Critical | **状态**: Enabled | **ID**: 254 - **配置**: 执行间隔: 15s | 类型: loki - **备注**: rapidtrade-data-push-Too Many Requests**PromQL**:```promql count_over_time({app="rapidtrade-data-push-04", filename="/data/api/data-push-server9/logs/exchange.log",env="prod"} |= `Too Many Requests` | pattern `<message>`[5m]) >= 1 ```---#### 87. Rapidtrade-server newline tcp is Down- **级别**: P0-Critical | **状态**: Enabled | **ID**: 211 - **配置**: 执行间隔: 15s | 持续时间: 120s | 类型: prometheus - **备注**: 专线tcp is down**PromQL**:```promql up{job="ltp-rapidtrade-server-newline-tcp"}==0 ```---#### 88. rapidmarket-DL-icmp-probe 行情专线网络异常- **级别**: P0-Critical | **状态**: Disabled | **ID**: 212 - **配置**: 执行间隔: 15s | 持续时间: 120s | 类型: prometheus - **备注**: rapidmarket-DL-icmp-probe 行情专线**PromQL**:```promql probe_success{job="rapidmarket-DL-icmp-probe"}==0 ```---#### 89. rapidtrade-data-push每秒限频错误大于1- **级别**: P0-Critical | **状态**: Enabled | **ID**: 139 - **配置**: 执行间隔: 15s | 持续时间: 60s | 类型: prometheus**PromQL**:```promql sum by (Exchange, code) (rate(exchange_error_counter{instance=~"aws-jp-prod-ltp-prime-rapidtrade-data-push.*", code=~"429"}[1m]) ) > 1 ```---#### 90. rapidtrade-server code 429- **级别**: P0-Critical | **状态**: Enabled | **ID**: 198 - **配置**: 执行间隔: 15s | 持续时间: 60s | 类型: prometheus - **备注**: rapidtrade-server code 429**PromQL**:```promql rate(exchange_error_counter{env=~"prodjp", instance=~"rapidtrade-server.*", Exchange=~"HTTP_.*",code="429"}[1m]) >0 ```---#### 91. rapidtrade-server 近 5min handler error 大于 20- **级别**: P0-Critical | **状态**: Enabled | **ID**: 282 - **配置**: 执行间隔: 15s | 类型: loki**PromQL**:```promql sum by (env,app,host,filename) (count_over_time({env="prodjp",app="rapidx-rapidtrade-server"}|="handleError"[5m])) > 50 ```---#### 92. rapidtrade-server-DL-icmp-probe 交易专线网络异常- **级别**: P0-Critical | **状态**: Enabled | **ID**: 213 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: rapidtrade-server-DL-icmp-probe 交易专线**PromQL**:```promql probe_success{job="rapidtrade-server-DL-icmp-probe"} ==0 ```---#### 93. rapidtrade-serverip大于5000- **级别**: P0-Critical | **状态**: Enabled | **ID**: 118 - **配置**: 执行间隔: 15s | 持续时间: 60s | 类型: prometheus**PromQL**:```promql sum by (weight_type) (label_replace(label_replace(exchange_gauge{env="prodjp", orderState=~"BN.*"},"weight_type", "($1)-($2)-$3", "orderState", "(.*)-([0-9.]+):(.*)"),"weight_type", "($1)-()-($2)", "orderState", "(.*)-:(.*)") ) > 5000 ```---### Prime/OMS (21 条, 启用 20)#### 94. Logs-Rapidx-Engine-DataRecoveryError- **级别**: P0-Critical | **状态**: Enabled | **ID**: 168 - **配置**: 执行间隔: 15s | 持续时间: 300s | 类型: loki - **备注**: [LoadDumpRunner] recoverData error**PromQL**:```promql (count_over_time({app="rapidx-engine",env="prodjp"}|= `[LoadDumpRunner] recoverData error`| pattern `<message>`[5m])) >= 1 ```---#### 95. Logs-Rapidx-Engine-KafkaPublishError- **级别**: P0-Critical | **状态**: Enabled | **ID**: 167 - **配置**: 执行间隔: 15s | 持续时间: 300s | 类型: loki - **备注**: [PublishEvent] sendKafka {} {} for portfolioId {} error**PromQL**:```promql (count_over_time( {app="rapidx-engine",env="prodjp"}|= `[PublishEvent] sendKafka {} {} for portfolioId {} error` | pattern `<message>`[5m] )) >= 1 ```---#### 96. Logs-Rapidx-Engine-PlaceOrderRpcError- **级别**: P0-Critical | **状态**: Enabled | **ID**: 169 - **配置**: 执行间隔: 15s | 持续时间: 300s | 类型: loki - **备注**: placeOrder RPC调用失败**PromQL**:```promql (count_over_time({app="rapidx-engine",env="prodjp"}|= `placeOrder rpc error`| pattern `<message>`[5m])) >= 1 ```---#### 97. Logs-Rapidx-Engine-orderStateMonitor ems not exist- **级别**: P0-Critical | **状态**: Enabled | **ID**: 287 - **配置**: 执行间隔: 15s | 持续时间: 300s | 类型: loki - **备注**: orderStateMonitor ems not exist**PromQL**:```promql count_over_time({app="pb-trading-engine", env="prodjp"} |= `orderStateMonitor ems not exist` != `FILLED` != `CANCELLED REJECT FAIL` | pattern `<message>`[5m]) >= 1 ```---#### 98. Logs-RapidxEngine-CompOrderRpcError- **级别**: P0-Critical | **状态**: Enabled | **ID**: 171 - **配置**: 执行间隔: 15s | 持续时间: 300s | 类型: loki - **备注**: compOrder RPC调用失败**PromQL**:```promql (count_over_time({app="rapidx-engine",env="prodjp"}|= `compOrder rpc error`| pattern `<message>`[5m])) >= 1 ```---#### 99. Logs-RapidxEngine-ReplaceOrderRpcError- **级别**: P0-Critical | **状态**: Enabled | **ID**: 170 - **配置**: 执行间隔: 15s | 持续时间: 300s | 类型: loki - **备注**: replaceOrder RPC调用失败**PromQL**:```promql (count_over_time({app="rapidx-engine",env="prodjp"}|= `replaceOrder rpc error`| pattern `<message>`[5m])) >= 1 ```---#### 100. Logs-query-persistent AlgoOrderRequest process error- **级别**: P0-Critical | **状态**: Enabled | **ID**: 173 - **配置**: 执行间隔: 15s | 类型: loki - **备注**: query-persistent AlgoOrderRequest process error**PromQL**:```promql (count_over_time({app="rapidx-query-persistent",env=~"prod.*"}|= `AlgoOrderRequest process error `| pattern `<message>`[5m])) >=1 ```---#### 101. P50 RapidX v0.5 - 下单 Extra Latency 大于 1s- **级别**: P0-Critical | **状态**: Enabled | **ID**: 278 - **配置**: 执行间隔: 15s | 持续时间: 60s | 类型: prometheus**PromQL**:```promql (order_state_transition_time_seconds{quantile="0.50", strategy="RapidX PROD PM1.5 内网Rest撤下延迟 ETH", transition="PENDING_TO_OPEN"} - ignoring(strategy, transition) order_state_transition_time_seconds{quantile="0.50", strategy="BN PM1.5 Rest撤下延迟 ETH", transition="NEW_TO_OPEN"}) * 1000 > 1000 ```---#### 102. P50 RapidX v1.0 - 下单 Extra Latency 大于 1s- **级别**: P0-Critical | **状态**: Enabled | **ID**: 277 - **配置**: 执行间隔: 15s | 持续时间: 60s | 类型: prometheus**PromQL**:```promql (order_state_transition_time_seconds{quantile="0.50", strategy="RapidX Prod PM1.0 内网Rest撤下延迟 ETH", transition=~"PENDING_TO_OPEN"} - ignoring(strategy, transition) order_state_transition_time_seconds{quantile="0.50", strategy="BN经典 Rest撤下延迟 ETH", transition="NEW_TO_OPEN"}) * 1000 > 1000 ```---#### 103. P50 RapidX v1.0 - 下单 Extra Latency Shard5 大于 1s- **级别**: P0-Critical | **状态**: Enabled | **ID**: 279 - **配置**: 执行间隔: 15s | 持续时间: 60s | 类型: prometheus**PromQL**:```promql (order_state_transition_time_seconds{quantile="0.50", strategy="RapidX Prod PM1.0 Shard5 内网Rest撤下延迟 ETH", transition="PENDING_TO_OPEN"} - ignoring(strategy,transition) order_state_transition_time_seconds{quantile="0.50", strategy="BN经典 Rest撤下延迟 ETH", transition="NEW_TO_OPEN"}) * 1000 > 1000 ```---#### 104. P50 RapidX v1.0 - 下单 Extra Latency Shard6 大于 1000- **级别**: P0-Critical | **状态**: Enabled | **ID**: 280 - **配置**: 执行间隔: 15s | 持续时间: 60s | 类型: prometheus**PromQL**:```promql (order_state_transition_time_seconds{quantile="0.50", strategy="RapidX Prod PM1.0 Shard6 内网Rest撤下延迟 ETH", transition="PENDING_TO_OPEN"} - ignoring(strategy,transition) order_state_transition_time_seconds{quantile="0.50", strategy="BN经典 Rest撤下延迟 ETH", transition="NEW_TO_OPEN"}) * 1000 > 1000 ```---#### 105. RapidX 应用异常请求 5xx 占比超过 20%- **级别**: P0-Critical | **状态**: Enabled | **ID**: 241 - **配置**: 执行间隔: 15s | 持续时间: 60s | 类型: prometheus**PromQL**:```promql sum by (java_application, uri,env, status) (rate(http_server_requests_seconds_count{env="prodjp",java_application=~"pb-trading.*|rapidx.*",java_application!="pb-trading-dump", uri != "/**",status=~"5.."}[5m]) ) / ignoring(status) group_left sum by (java_application, uri, env) (rate(http_server_requests_seconds_count{env="prodjp",java_application=~"pb-trading.*|rapidx.*",java_application!="pb-trading-dump",uri != "/**"}[5m]) ) * 100 > 20 ```---#### 106. RapidX 服务 SQL 处理平均延迟 大于 3s 超过 5min- **级别**: P0-Critical | **状态**: Enabled | **ID**: 248 - **配置**: 执行间隔: 15s | 持续时间: 300s | 类型: prometheus**PromQL**:```promql rate(mybatis_sql_timer_seconds_sum{env=~"prod.*"}[1m])>0 / rate(mybatis_sql_timer_seconds_count{env=~"prod.*"}[1m])>0 >3 ```---#### 107. Rapidx Engine Master 数量大于 1- **级别**: P0-Critical | **状态**: Enabled | **ID**: 252 - **配置**: 执行间隔: 15s | 类型: prometheus**PromQL**:```promql aws_applicationelb_healthy_host_count_average{tag_env=~"prod.*",dimension_AvailabilityZone=~".+",dimension_TargetGroup=~".+engine.+",dimension_TargetGroup!~".+gateway.+"} > 1 ```---#### 108. STP takeover function paused- **级别**: P0-Critical | **状态**: Enabled | **ID**: 283 - **配置**: 执行间隔: 15s | 类型: loki - **备注**: Logs-Rapidx-Engine-stp ems takerover function paused**PromQL**:```promql (count_over_time({env="prodjp", app="pb-trading-engine"} |= `不再接管`[5m])) >= 1 ```---#### 109. [DisrupterEventHandler] Error occurred while processing event- **级别**: P0-Critical | **状态**: Enabled | **ID**: 174 - **配置**: 执行间隔: 15s | 类型: loki - **备注**: [DisrupterEventHandler] Error occurred while processing event**PromQL**:```promql count by (app) (count_over_time({app=~".*persistent",env=~"prod.*"}|= `[DisrupterEventHandler] Error occurred while processing event`|pattern `<message>`[5m])) > 0 ```---#### 110. engine NEW order 大于:100- **级别**: P0-Critical | **状态**: Enabled | **ID**: 135 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: engine NEW order > 100**PromQL**:```promql engine_order_count{env=~"prod.*",type="NEW"} >=100 ```---#### 111. engine REPLACE_NEW order 大于:100- **级别**: P0-Critical | **状态**: Enabled | **ID**: 136 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: engine REPLACE_NEW order > 100**PromQL**:```promql engine_order_count{env=~"prod.*",type="REPLACE_NEW"} > 100 ```---#### 112. engine async error 数量增加大于 30- **级别**: P0-Critical | **状态**: Enabled | **ID**: 285 - **配置**: 执行间隔: 15s | 类型: prometheus**PromQL**:```promql sum by (subType) (increase(engine_async_errors_total{env=~"prod.*",java_application="pb-trading-engine",subType=~"400010|400011|401046|401118|401010|500"}[2m])) > 30 ```---#### 113. oms send order ems rpc error 大于10次- **级别**: P0-Critical | **状态**: Enabled | **ID**: 273 - **配置**: 执行间隔: 15s | 类型: loki - **备注**: oms发送下改撤单调用ems的rpc错误预警 1分钟>10次**PromQL**:```promql (count_over_time({app="rapidx-engine",env="prodjp"} |~ `placeOrder rpc no channel | placeOrder rpc error | cancelOrder rpc error | replaceOrder rpc no channel`| pattern `<message>`[1m])) >= 10 ```---#### 114. xchange_binance_limit_rate-S1 大于:5000- **级别**: P0-Critical | **状态**: Disabled | **ID**: 137 - **配置**: 执行间隔: 15s | 持续时间: 60s | 类型: prometheus - **备注**: xchange_binance_limit_rate > 5000**PromQL**:```promql xchange_binance_limit_rate{env=~"prod.*"} > 5000 ```---### Security (2 条, 启用 2)#### 115. Security-容器处于非运行状态- **级别**: P0-Critical | **状态**: Enabled | **ID**: 200 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: 目前状态:{{ $labels.phase! }}**PromQL**:```promql kube_pod_status_phase{phase!~"Running|Succeeded",cluster=~"aws-jp-prod-ltp-infra-eks",namespace="sec-prod"}==1 ```---#### 116. Security-容器服务状态异常-- **级别**: P0-Critical | **状态**: Enabled | **ID**: 205 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: 目前容器状态:{{ $labels.reason }}**PromQL**:```promql kube_pod_container_status_waiting_reason{reason!='ContainerCreating',cluster="aws-jp-prod-ltp-infra-eks",namespace="sec-prod"} and kube_pod_container_status_waiting_reason{reason!='PodInitializing',cluster="aws-jp-prod-ltp-infra-eks",namespace="sec-prod"} ```---## P1-Warning (86 条)### Infra/AccessLog (1 条, 启用 1)#### 117. 非生产环境-Nginx-Status Code 499/5xx 过去 5min 大于 10- **级别**: P1-Warning | **状态**: Enabled | **ID**: 217 - **配置**: 执行间隔: 15s | 类型: loki - **备注**: 非生产环境-Nginx-Status Code 499/5xx 过去 5min 大于 10**PromQL**:```promql count by (env, site, status) (count_over_time({job="nginx",status=~"499|50+",env!~"prod|prodjp"}[180m]))>10 ```---### Infra/DevOps (2 条, 启用 2)#### 118. JVM Heap Usage 大于 80%- **级别**: P1-Warning | **状态**: Enabled | **ID**: 262 - **配置**: 执行间隔: 15s | 持续时间: 60s | 类型: prometheus**PromQL**:```promql jvm_memory_used_bytes{env=~"prod.*"} / jvm_memory_max_bytes{env=~"prod.*"} * 100 > 80 ```---#### 119. 生产日志 Erorr 计数短时间突增-S2- **级别**: P1-Warning | **状态**: Enabled | **ID**: 286 - **配置**: 执行间隔: 15s | 持续时间: 120s | 类型: prometheus**PromQL**:```promql prod_logs_error_count_1m{app!~`(loki|litellm|dolphinscheduler)`} > 100 and (prod_logs_error_count_1m{app!~`(loki|litellm|dolphinscheduler)`}/clamp_min(avg_over_time(prod_logs_error_count_1m{app!~`(loki|litellm|dolphinscheduler)`}[15m]), 1) ) > 3 ```---### Infra/EC2 (20 条, 启用 15)#### 120. AWS_EC2_INSTANCE_REBOOT_FLEXIBLE_MAINTENANCE_SCHEDULED-S2- **级别**: P1-Warning | **状态**: Disabled | **ID**: 147 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: AWS_EC2_INSTANCE_REBOOT_FLEXIBLE_MAINTENANCE_SCHEDULED>0**PromQL**:```promql avg_over_time(aws_health_event_info{event_code="AWS_EC2_INSTANCE_REBOOT_FLEXIBLE_MAINTENANCE_SCHEDULED",service="EC2",status="upcoming"}[1h]) > 0 ```---#### 121. AWS_EC2_INSTANCE_RETIREMENT_SCHEDULED-S2- **级别**: P1-Warning | **状态**: Disabled | **ID**: 146 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: AWS_EC2_INSTANCE_RETIREMENT_SCHEDULED>0**PromQL**:```promql avg_over_time(aws_health_event_info{env="prod",event_code=~"AWS_EC2_PERSISTENT_INSTANCE_RETIREMENT_SCHEDULED|AWS_EC2_INSTANCE_RETIREMENT_SCHEDULED",service="EC2",status="upcoming"}[1h]) > 0 ```---#### 122. AWS_EC2_INSTANCE_STOP_SCHEDULED-S2- **级别**: P1-Warning | **状态**: Disabled | **ID**: 145 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: avg_over_time(aws_health_event_info{env="prod",event_code=~"AWS_EC2_INSTANCE_STOP_SCHEDULED",service="EC2",status="upcoming"}[1h]) > 0**PromQL**:```promql avg_over_time(aws_health_event_info{event_code=~"AWS_EC2_INSTANCE_STOP_SCHEDULED",service="EC2",status="upcoming"}[1h]) > 0 ```---#### 123. EC2-Available Memory 使用率大于:90%- **级别**: P1-Warning | **状态**: Enabled | **ID**: 28 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: EC2实例内存使用率>90%**PromQL**:```promql (1 - (node_memory_MemAvailable_bytes{PrivateIpAddress!=""} / node_memory_MemTotal_bytes{PrivateIpAddress!=""})) * 100 >= 90 and (1 - (node_memory_MemAvailable_bytes{PrivateIpAddress!=""} / node_memory_MemTotal_bytes{PrivateIpAddress!=""})) * 100 < 95 ```---#### 124. EC2-Available Memory-S2 使用率大于:95%- **级别**: P1-Warning | **状态**: Enabled | **ID**: 140 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: EC2实例内存使用率>95%**PromQL**:```promql (1 - (node_memory_MemAvailable_bytes{env!~"prod|prodjk|prodhk",PrivateIpAddress!=""} / node_memory_MemTotal_bytes{env!~"prod|prodjk|prodhk",PrivateIpAddress!=""})) * 100 >= 95 and (1 - (node_memory_MemAvailable_bytes{env!~"prod|prodjk|prodhk",PrivateIpAddress!=""} / node_memory_MemTotal_bytes{env!~"prod|prodjk|prodhk",PrivateIpAddress!=""})) * 100 <=100 ```---#### 125. EC2-CPU负载变化量大于:40%-nonprod- **级别**: P1-Warning | **状态**: Enabled | **ID**: 243 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: EC2-CPU负载变化量大于:40%**PromQL**:```promql 100 - sum(rate(node_cpu_seconds_total{mode="idle", PrivateIpAddress!="",env=~"dev|sit|uat|fat|mirror|qa"}[5m])) by (Region,instance,Name,env,PrivateIpAddress) / sum(rate(node_cpu_seconds_total{PrivateIpAddress!="",env=~"dev|sit|uat|fat|mirror|qa"}[5m])) by (Region,instance,Name,env,PrivateIpAddress) * 100 >= 80 and ((100 - sum(rate(node_cpu_seconds_total{mode="idle", PrivateIpAddress!="",env=~"dev|sit|uat|fat|mirror|qa"}[5m])) by (Region,instance,Name,env,PrivateIpAddress) / sum(rate(node_cpu_seconds_total{PrivateIpAddress!="",env=~"dev|sit|uat|fat|mirror|qa"}[5m])) by (Region,instance,Name,env,PrivateIpAddress) * 100)-(100 - sum(rate(node_cpu_seconds_total{mode="idle", PrivateIpAddress!="",env=~"dev|sit|uat|fat|mirror|qa"}[5m] offset 5m)) by (Region,instance,Name,env,PrivateIpAddress) / sum(rate(node_cpu_seconds_total{PrivateIpAddress!="",env=~"dev|sit|uat|fat|mirror|qa"}[5m] offset 5m)) by (Region,instance,Name,env,PrivateIpAddress) * 100) ) >= 40 ```---#### 126. EC2-CPU负载大于:90%- **级别**: P1-Warning | **状态**: Enabled | **ID**: 31 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: CPU负载>90%-100%**PromQL**:```promql 100 - sum(rate(node_cpu_seconds_total{mode="idle", PrivateIpAddress!=""}[5m])) by (Region,instance,Name,env,PrivateIpAddress) / sum(rate(node_cpu_seconds_total{PrivateIpAddress!=""}[5m])) by (Region,instance,Name,env,PrivateIpAddress) * 100 >= 90 and 100 - sum(rate(node_cpu_seconds_total{mode="idle", PrivateIpAddress!=""}[5m])) by (Region,instance,Name,env,PrivateIpAddress) / sum(rate(node_cpu_seconds_total{PrivateIpAddress!=""}[5m])) by (Region,instance,Name,env,PrivateIpAddress) * 100 < 100 ```---#### 127. EC2-Connect Limit 连接限制大于:80%- **级别**: P1-Warning | **状态**: Enabled | **ID**: 40 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: EC2实例内存使用率>80%-89% 连接数量接近极限**PromQL**:```promql node_nf_conntrack_entries{PrivateIpAddress!=""} / node_nf_conntrack_entries_limit{PrivateIpAddress!=""} > 0.8 ```---#### 128. EC2-DISK IO使用率大于:90%-S2- **级别**: P1-Warning | **状态**: Enabled | **ID**: 233 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: IO Usage>90%。检查存储问题或提高IOPS能力。检查存储器中的问题**PromQL**:```promql rate(node_disk_io_time_seconds_total{PrivateIpAddress!="",env=~"dev|sit|uat|fat|mirror|qa"}[5m]) >= 0.9 ```---#### 129. EC2-DISK IO使用率大于:90%-prod- **级别**: P1-Warning | **状态**: Enabled | **ID**: 41 - **配置**: 执行间隔: 15s | 持续时间: 360s | 类型: prometheus - **备注**: IO Usage>90%。检查存储问题或提高IOPS能力。检查存储器中的问题**PromQL**:```promql rate(node_disk_io_time_seconds_total{PrivateIpAddress!="",env!~"dev|sit|uat|fat|mirror|qa"}[5m]) >= 0.9 ```---#### 130. EC2-Disk avail_bytes 磁盘使用率大于:90%-S2- **级别**: P1-Warning | **状态**: Enabled | **ID**: 7 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: 磁盘使用率高于 >90%**PromQL**:```promql (1 - (node_filesystem_avail_bytes{fstype!~"^(fuse.*|tmpfs|cifs|nfs)",env!~"prod|prodjk|prodhk",PrivateIpAddress!=""} / node_filesystem_size_bytes{env!~"prod|prodjk|prodhk"})) >= 0.90 and (1 - (node_filesystem_avail_bytes{fstype!~"^(fuse.*|tmpfs|cifs|nfs)",env!~"prod|prodjk|prodhk",PrivateIpAddress!=""} / node_filesystem_size_bytes{env!~"prod|prodjk|prodhk"})) < 0.95 and on (instance, device, mountpoint) node_filesystem_readonly == 0 ```---#### 131. EC2-Disk avail_bytes-S2 磁盘使用率大于:95%- **级别**: P1-Warning | **状态**: Disabled | **ID**: 141 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: EC2实例磁盘使用率高于>95%**PromQL**:```promql (1 - (node_filesystem_avail_bytes{fstype!~"^(fuse.*|tmpfs|cifs|nfs)",env!~"prod|prodjk|prodhk",PrivateIpAddress!=""} / node_filesystem_size_bytes{env!~"prod|prodjk|prodhk"})) >=0.95 and (1 - (node_filesystem_avail_bytes{fstype!~"^(fuse.*|tmpfs|cifs|nfs)",env!~"prod|prodjk|prodhk",PrivateIpAddress!=""} / node_filesystem_size_bytes{env!~"prod|prodjk|prodhk"})) <= 1 and on (instance, device, mountpoint) node_filesystem_readonly == 0 ```---#### 132. EC2-MEM利用率变化量大于:40%-nonprod- **级别**: P1-Warning | **状态**: Enabled | **ID**: 245 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: EC2实例MEM利用率变化量>40%**PromQL**:```promql (1 - (node_memory_MemAvailable_bytes{env=~"dev|sit|uat|fat|mirror|qa",PrivateIpAddress!=""} / node_memory_MemTotal_bytes{env=~"dev|sit|uat|fat|mirror|qa",PrivateIpAddress!=""})) * 100 >= 80 and ((1 - (node_memory_MemAvailable_bytes{env=~"dev|sit|uat|fat|mirror|qa",PrivateIpAddress!=""} / node_memory_MemTotal_bytes{env=~"dev|sit|uat|fat|mirror|qa",PrivateIpAddress!=""})) * 100 -(1 - (node_memory_MemAvailable_bytes{env=~"dev|sit|uat|fat|mirror|qa",PrivateIpAddress!=""} offset 1m / node_memory_MemTotal_bytes{env=~"dev|sit|uat|fat|mirror|qa",PrivateIpAddress!=""} offset 1m)) * 100 ) >= 40 ```---#### 133. EC2-NetworkReceiveErrors 主机网络接收异常大于:1%-S2- **级别**: P1-Warning | **状态**: Enabled | **ID**: 235 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: EC2-NetworkReceiveErrors 主机网络接收异常**PromQL**:```promql (rate(node_network_receive_errs_total{PrivateIpAddress!="",env=~"dev|sit|uat|fat|mirror|qa"}[2m]) / rate(node_network_receive_packets_total{PrivateIpAddress!="",env=~"dev|sit|uat|fat|mirror|qa"}[2m]) > 0.01) ```---#### 134. EC2-NetworkTransmitErrors 主机网络传输错误大于:1%-S2- **级别**: P1-Warning | **状态**: Enabled | **ID**: 236 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: EC2-NetworkTransmitErrors 主机网络传输错误**PromQL**:```promql (rate(node_network_transmit_errs_total{PrivateIpAddress!="",env=~"dev|sit|uat|fat|mirror|qa"}[2m]) / rate(node_network_transmit_packets_total{PrivateIpAddress!="",env=~"dev|sit|uat|fat|mirror|qa"}[2m]) > 0.01) ```---#### 135. EC2-OomKillDetected-S2 检测到OOM终止- **级别**: P1-Warning | **状态**: Enabled | **ID**: 142 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: 检测到EC2实例OOM终止,请立刻查看**PromQL**:```promql (increase(node_vmstat_oom_kill{InstanceType!="",env!~"prod|prodjk|prodhk",PrivateIpAddress!=""}[2m])) >=1 ```---#### 136. EC2-OutOf available inodes-S2使用率大于:95%- **级别**: P1-Warning | **状态**: Enabled | **ID**: 143 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: Disk is almost running out of available inodes (< 10% left)**PromQL**:```promql (1 - (node_filesystem_files_free{env!~"prod|prodjk|prodhk",PrivateIpAddress!=""} / node_filesystem_files{env!~"prod|prodjk|prodhk",PrivateIpAddress!=""})) >= 0.95 and (1 - (node_filesystem_files_free{env!~"prod|prodjk|prodhk",PrivateIpAddress!=""} / node_filesystem_files{env!~"prod|prodjk|prodhk",PrivateIpAddress!=""})) <= 1 and on (instance, device, mountpoint) node_filesystem_readonly == 0 ```---#### 137. EC2-OutOf available inodes使用率大于:90%- **级别**: P1-Warning | **状态**: Enabled | **ID**: 32 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: Disk is almost running out of available inodes (< 10% left)**PromQL**:```promql (1 - (node_filesystem_files_free{PrivateIpAddress!=""} / node_filesystem_files{PrivateIpAddress!=""})) >= 0.90 and (1 - (node_filesystem_files_free{PrivateIpAddress!=""} / node_filesystem_files{PrivateIpAddress!=""})) < 0.95 and on (instance, device, mountpoint) node_filesystem_readonly == 0 ```---#### 138. EC2-SystemdServiceCrashed-S2 主机系统奔溃- **级别**: P1-Warning | **状态**: Disabled | **ID**: 144 - **配置**: 执行间隔: 15s | 持续时间: 60s | 类型: prometheus - **备注**: 主机系统崩溃**PromQL**:```promql (node_systemd_unit_state{state="failed",env!~"prod|prodjk|prodhk",PrivateIpAddress!=""} == 1) ```---#### 139. EC2-node_exporter is Down-S2- **级别**: P1-Warning | **状态**: Enabled | **ID**: 275 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: ec2服务器近近2分钟,可能故障离线或者无响应-S2**PromQL**:```promql max_over_time(up{job="aws-ec2-nodes",PrivateIpAddress!="",env=~"dev|sit|uat|fat|mirror|qa"}[2h] ) == 1 and min_over_time(up{job="aws-ec2-nodes",PrivateIpAddress!="",env=~"dev|sit|uat|fat|mirror|qa"}[2m] ) == 0 ```---### Infra/K8S (16 条, 启用 16)#### 140. Container-HighCpuUtilization CPU使用率大于:90%- **级别**: P1-Warning | **状态**: Enabled | **ID**: 2 - **配置**: 执行间隔: 15s | 持续时间: 300s | 类型: prometheus - **备注**: 容器-CPU利用率超过90%**PromQL**:```promql (sum(rate(container_cpu_usage_seconds_total{image!="", container!="POD"}[5m])) by (cluster,namespace,pod,container) / sum by (cluster,namespace,pod,container) (kube_pod_container_resource_limits{resource="cpu"})) * 100 >= 90 and (sum(rate(container_cpu_usage_seconds_total{image!="", container!="POD"}[5m])) by (cluster,namespace,pod,container) / sum by (cluster,namespace,pod,container) (kube_pod_container_resource_limits{resource="cpu"})) * 100 < 100 ```---#### 141. Container-HighLowChangeCpuUsage CPU波动率大于:40%-nonprod- **级别**: P1-Warning | **状态**: Enabled | **ID**: 45 - **配置**: 执行间隔: 15s | 持续时间: 600s | 类型: prometheus - **备注**: 监控5m时间窗口内CPU使用情况的绝对变化,并在变化超过40%时触发警报**PromQL**:```promql (sum(rate(container_cpu_usage_seconds_total{image!="", container!="POD",cluster=~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}[5m])) by (cluster,namespace,pod,container) / sum by (cluster,namespace,pod,container) (kube_pod_container_resource_limits{resource="cpu",cluster=~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"})) * 100 >= 80 and ((sum(rate(container_cpu_usage_seconds_total{image!="", container!="POD",cluster=~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}[5m])) by (cluster,namespace,pod,container) / sum by (cluster,namespace,pod,container) (kube_pod_container_resource_limits{resource="cpu",cluster=~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"})) * 100-(sum(rate(container_cpu_usage_seconds_total{image!="", container!="POD",cluster=~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}[5m] offset 5m)) by (cluster,namespace,pod,container) / sum by (cluster,namespace,pod,container) (kube_pod_container_resource_limits{resource="cpu",cluster=~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"})) * 100 ) >= 40 ```---#### 142. Container-HighMemoryUsage 内存使用率大于: 90%- **级别**: P1-Warning | **状态**: Enabled | **ID**: 47 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: 容器内存使用率达到90%-100%**PromQL**:```promql round(10000*avg_over_time((avg by (container,cluster,namespace,pod) (container_memory_working_set_bytes{}) / avg by (container,cluster,namespace,pod) (kube_pod_container_resource_limits{resource="memory"}) ) [5m:]))/100 >=90 and round(10000*avg_over_time((avg by (container,cluster,namespace,pod) (container_memory_working_set_bytes{}) / avg by (container,cluster,namespace,pod) (kube_pod_container_resource_limits{resource="memory"}) ) [5m:]))/100 <=100 ```---#### 143. Container-HighMemoryUsage 内存波动率大于: 40%-nonprod- **级别**: P1-Warning | **状态**: Enabled | **ID**: 247 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: 容器内存波动率大于: 40%**PromQL**:```promql round(10000*avg_over_time((avg by (container,cluster,namespace,pod) (container_memory_working_set_bytes{cluster=~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}) / avg by (container,cluster,namespace,pod) (kube_pod_container_resource_limits{resource="memory",cluster=~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}) ) [5m:]))/100 >= 80 and (round(10000*avg_over_time((avg by (container,cluster,namespace,pod) (container_memory_working_set_bytes{cluster=~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}) / avg by (container,cluster,namespace,pod) (kube_pod_container_resource_limits{resource="memory",cluster=~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}) ) [5m:]))/100-round(10000*avg_over_time((avg by (container,cluster,namespace,pod) (container_memory_working_set_bytes{cluster=~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"} offset 5m) / avg by (container,cluster,namespace,pod) (kube_pod_container_resource_limits{resource="memory",cluster=~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}) ) [5m:] offset 5m))/100 ) >=40 ```---#### 144. Container-daemonset not scheduled- **级别**: P1-Warning | **状态**: Enabled | **ID**: 101 - **配置**: 执行间隔: 15s | 持续时间: 360s | 类型: prometheus - **备注**: daemonset not scheduled**PromQL**:```promql kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics",namespace=~".*"} - kube_daemonset_status_current_number_scheduled{job="kube-state-metrics",namespace=~".*"} >0 ```---#### 145. Container-threads 线程数量大于:2000- **级别**: P1-Warning | **状态**: Enabled | **ID**: 58 - **配置**: 执行间隔: 15s | 持续时间: 300s | 类型: prometheus - **备注**: Container-threads 线程数量较大异常**PromQL**:```promql sum by (cluster,namespace,pod,container) (container_threads{cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}) >= 2000 ```---#### 146. Container-容器5分钟内发生过重启- **级别**: P1-Warning | **状态**: Enabled | **ID**: 219 - **配置**: 执行间隔: 15s | 持续时间: 180s | 类型: prometheus - **备注**: 容器在5分钟内重启,请检查是否存在崩溃循环**PromQL**:```promql increase(kube_pod_container_status_restarts_total{cluster!~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}[5m]) >= 1 ```---#### 147. Container-容器5分钟内发生过重启-s2- **级别**: P1-Warning | **状态**: Enabled | **ID**: 62 - **配置**: 执行间隔: 15s | 持续时间: 180s | 类型: prometheus - **备注**: 容器在5分钟内重启,请检查是否存在崩溃循环**PromQL**:```promql increase(kube_pod_container_status_restarts_total{cluster=~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}[5m]) >= 1 ```---#### 148. Container-容器因OOM终止-S2- **级别**: P1-Warning | **状态**: Enabled | **ID**: 148 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: 容器因内存溢出被kill,请检查使用量及配置是否需调整**PromQL**:```promql kube_pod_container_status_terminated_reason{reason='OOMKilled',cluster=~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"} >=1 ```---#### 149. Container-容器异常等待-S2- **级别**: P1-Warning | **状态**: Enabled | **ID**: 149 - **配置**: 执行间隔: 15s | 持续时间: 360s | 类型: prometheus - **备注**: 目前容器异常等待:{{ $labels.reason }}**PromQL**:```promql kube_pod_container_status_waiting_reason{reason!~"ContainerCreating|PodInitializing",cluster=~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"} == 1 ```---#### 150. K8s-Node NotReady 节点状态异常-S2- **级别**: P1-Warning | **状态**: Enabled | **ID**: 150 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: k8s节点-K8s-Node NotReady 节点状态异常**PromQL**:```promql (kube_node_status_condition{job!='aws-ec2-nodes',container="kube-state-metrics",condition="Ready",status="unknown",cluster=~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}) ==1 ```---#### 151. K8s-Node-DiskUsed. 磁盘使用率大于:90%- **级别**: P1-Warning | **状态**: Enabled | **ID**: 221 - **配置**: 执行间隔: 15s | 持续时间: 180s | 类型: prometheus - **备注**: k8s节点-磁盘利用率到达90%-100%**PromQL**:```promql ((node_filesystem_size_bytes{job="node-exporter",mountpoint="/",fstype!~"tmpfs|rootfs|overlay"} - node_filesystem_free_bytes{job="node-exporter",mountpoint="/",fstype!~"tmpfs|rootfs|overlay"}) / node_filesystem_size_bytes{job="node-exporter",mountpoint="/",fstype!~"tmpfs|rootfs|overlay"} ) * 100 >= 90 and ((node_filesystem_size_bytes{job="node-exporter",mountpoint="/",fstype!~"tmpfs|rootfs|overlay"} - node_filesystem_free_bytes{job="node-exporter",mountpoint="/",fstype!~"tmpfs|rootfs|overlay"}) / node_filesystem_size_bytes{job="node-exporter",mountpoint="/",fstype!~"tmpfs|rootfs|overlay"} ) * 100 <= 100 ```---#### 152. K8s-Node-HighCpuUtilization CPU使用率大于:90%- **级别**: P1-Warning | **状态**: Enabled | **ID**: 105 - **配置**: 执行间隔: 15s | 持续时间: 300s | 类型: prometheus - **备注**: k8s节点-CPU利用率超过90%**PromQL**:```promql 100 - (avg(rate(node_cpu_seconds_total{job="node-exporter",mode='idle'}[5m])) by (cluster,instance,namespace,job) * 100) >=90 and 100 - (avg(rate(node_cpu_seconds_total{job="node-exporter",mode='idle'}[5m])) by (cluster,instance,namespace,job) * 100) <=100 ```---#### 153. K8s-Node-MemUsed. 内存使用率大于90%- **级别**: P1-Warning | **状态**: Enabled | **ID**: 188 - **配置**: 执行间隔: 15s | 持续时间: 180s | 类型: prometheus - **备注**: k8s节点-内存利用率 90%-100%**PromQL**:```promql 100 * (1 - (node_memory_MemFree_bytes + node_memory_Buffers_bytes{job='node-exporter'} + node_memory_Cached_bytes{job='node-exporter'}) / node_memory_MemTotal_bytes{job='node-exporter'}) >=90 and 100 * (1 - (node_memory_MemFree_bytes + node_memory_Buffers_bytes{job='node-exporter'} + node_memory_Cached_bytes{job='node-exporter'}) / node_memory_MemTotal_bytes{job='node-exporter'}) <100 ```---#### 154. NonProd EKS-Node is Down- **级别**: P1-Warning | **状态**: Enabled | **ID**: 270 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: NonProd EKS node is Down. Please checkout**PromQL**:```promql up{job=~"node_exporter|node-exporter",cluster!~"ltp-eks-prod|aws-jp-prod-ltp-infra-eks"}==0 ```---#### 155. PVC-persistent VolumeClaim 大于90%-S2- **级别**: P1-Warning | **状态**: Enabled | **ID**: 99 - **配置**: 执行间隔: 15s | 持续时间: 180s | 类型: prometheus - **备注**: pvc存储使用率>90%**PromQL**:```promql sum by (namespace, persistentvolumeclaim,cluster) (kubelet_volume_stats_used_bytes{cluster=~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"} / kubelet_volume_stats_capacity_bytes{cluster=~"LTP-EKS-informal|ltp-nonprod-eks|ltp-eks-uat"}) * 100 >=90 ```---### Infra/Kafka (11 条, 启用 11)#### 156. Kafaka memory_free-S2 可用内存小于:80M- **级别**: P1-Warning | **状态**: Enabled | **ID**: 151 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: aws_kafka_memory_free_average{tag_env=~"prod.*"} / 2 ^ 20 < 80**PromQL**:```promql aws_kafka_memory_free_average{tag_env!~"prod.*"} / 2 ^ 20 < 80 ```---#### 157. Kafka connection is Down-S2- **级别**: P1-Warning | **状态**: Enabled | **ID**: 266 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: Kafka connection is Down**PromQL**:```promql probe_success{job=~"ltp_nonprod_kafka_tcp|ltp_uat_kafka_tcp"} == 0 ```---#### 158. Kafka connection is Down-prod- **级别**: P1-Warning | **状态**: Enabled | **ID**: 267 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: Kafka connection is Down**PromQL**:```promql probe_success{job=~"ltp_prod_kafka_tcp"} == 0 ```---#### 159. Prod-Kafaka-cpu_user_average-S2使用率大于:85%- **级别**: P1-Warning | **状态**: Enabled | **ID**: 152 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: aws_kafka_cpu_user_average{tag_env=~"prod.*"}**PromQL**:```promql aws_kafka_cpu_user_average{tag_env!~"prod.*"} >=85 and aws_kafka_cpu_user_average{tag_env!~"prod.*"} <=100 ```---#### 160. Prod-Kafaka-cpu_user_average使用率大于:70%- **级别**: P1-Warning | **状态**: Enabled | **ID**: 78 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: aws_kafka_cpu_user_average{tag_env=~"prod.*"}**PromQL**:```promql aws_kafka_cpu_user_average{} >=70 and aws_kafka_cpu_user_average{} <85 ```---#### 161. Prod-Kafaka-data disk used使用率大于:80%- **级别**: P1-Warning | **状态**: Enabled | **ID**: 76 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: aws_kafka_data_logs_disk_used_average >= 80**PromQL**:```promql aws_kafka_data_logs_disk_used_average{} >=80 ```---#### 162. 自建-Kafka Offline Partitions 分区离线- **级别**: P1-Warning | **状态**: Enabled | **ID**: 290 - **配置**: 执行间隔: 15s | 持续时间: 60s | 类型: prometheus - **备注**: 离线分区完全不可读写,影响消息生产和消费**PromQL**:```promql sum by (namespace, cluster) (kafka_controller_kafkacontroller_offlinepartitionscount_value{namespace=~"kafka-.*"}) > 0 ```---#### 163. 自建-KafkaBrokerCountDown- **级别**: P1-Warning | **状态**: Enabled | **ID**: 292 - **配置**: 执行间隔: 15s | 持续时间: 60s | 类型: prometheus - **备注**: broker 下线导致分区副本不足,影响高可**PromQL**:```promql min by (namespace, cluster) (kafka_controller_kafkacontroller_activebrokercount_value{namespace=~"kafka-.*"}) < 3 ```---#### 164. 自建-KafkaNoActiveController控制器数量异常- **级别**: P1-Warning | **状态**: Enabled | **ID**: 291 - **配置**: 执行间隔: 15s | 持续时间: 60s | 类型: prometheus - **备注**: 无控制器,集群无法进行 Leader 选举和分区重分**PromQL**:```promql sum by (namespace, cluster) (kafka_controller_kafkacontroller_activecontrollercount_value{namespace=~"kafka-.*"}) !=1 ```---#### 165. 自建-KafkaRequestQueueFull请求队列堆积大于50- **级别**: P1-Warning | **状态**: Enabled | **ID**: 294 - **配置**: 执行间隔: 15s | 持续时间: 60s | 类型: prometheus - **备注**: 请求排队过多会导致生产/消费延迟增**PromQL**:```promql kafka_network_requestchannel_requestqueuesize_value{namespace=~"kafka-.*"} > 50 ```---#### 166. 自建-KafkaUncleanLeaderElection非干净 Leader 选举- **级别**: P1-Warning | **状态**: Enabled | **ID**: 293 - **配置**: 执行间隔: 15s | 持续时间: 60s | 类型: prometheus - **备注**: 非干净 Leader 选举,可能导致数据丢失**PromQL**:```promql sum by (namespace, cluster) (rate(kafka_controller_controllerstats_uncleanleaderelectionspersec_count{namespace=~"kafka-.*"}[5m])) > 0 ```---### Infra/Monitoring (14 条, 启用 9)#### 167. ELB Target HTTP Code 5xx 大于:50-S2- **级别**: P1-Warning | **状态**: Enabled | **ID**: 281 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: ELB Target HTTP Code 5xx > 50-S2**PromQL**:```promql avg_over_time(aws_applicationelb_httpcode_target_5_xx_count_sum{dimension_AvailabilityZone!~".+",dimension_TargetGroup=~".+",tag_env=~"prod.*",dimension_TargetGroup="targetgroup/ltp-pb-prod-pb-api-public/8cf10e95aedaf08f"}[1m])> 50 ```---#### 168. ELB Target response time 大于 200ms- **级别**: P1-Warning | **状态**: Enabled | **ID**: 131 - **配置**: 执行间隔: 15s | 持续时间: 150s | 类型: prometheus - **备注**: ELB Target response time > 200ms**PromQL**:```promql aws_applicationelb_target_response_time_average{dimension_AvailabilityZone!~".+",dimension_TargetGroup=~".+",dimension_TargetGroup!~".*metrics.*",tag_env=~"prod.*"} * 1000 > 200 ```---#### 169. Nonprod Prometheus-remote write mimir 写入失败率大于:10%- **级别**: P1-Warning | **状态**: Enabled | **ID**: 119 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: nonprod Prometheus remote write to mimir fail ,please check**PromQL**:```promql ((rate(prometheus_remote_storage_failed_samples_total{namespace="monitoring",cluster!="ltp-eks-prod"}[5m]) or rate(prometheus_remote_storage_samples_failed_total{namespace="monitoring",cluster!="ltp-eks-prod"}[5m])) / ((rate(prometheus_remote_storage_failed_samples_total{namespace="monitoring",cluster!="ltp-eks-prod"}[5m]) or rate(prometheus_remote_storage_samples_failed_total{namespace="monitoring",cluster!="ltp-eks-prod"}[5m])) + (rate(prometheus_remote_storage_succeeded_samples_total{namespace="monitoring",cluster!="ltp-eks-prod"}[5m]) or rate(prometheus_remote_storage_samples_total{namespace="monitoring",cluster!="ltp-eks-prod"}[5m])))) * 100 >=10 ```---#### 170. Prometheus is Lost-nonprod- **级别**: P1-Warning | **状态**: Enabled | **ID**: 226 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: Prometheus is Lost, Please check target Prometheus-nonprod**PromQL**:```promql count by (prometheus) (up{cluster!~"ltp-eks-prod"})==0 ```---#### 171. Prometheus is Lost-prod- **级别**: P1-Warning | **状态**: Enabled | **ID**: 23 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: Prometheus is Lost, Please check target Prometheus-prod**PromQL**:```promql count by (prometheus) (up{cluster=~"ltp-eks-prod"})==0 ```---#### 172. Prometheus- sd discovered EC2 targes is Null- **级别**: P1-Warning | **状态**: Disabled | **ID**: 22 - **配置**: 执行间隔: 15s | 持续时间: 300s | 类型: prometheus - **备注**: prometheus_sd_discovered_targets{cluster="aws-jp-prod-ltp-infra-eks", config="aws-ec2-nodes"} == 0**PromQL**:```promql prometheus_sd_discovered_targets{cluster="aws-jp-prod-ltp-infra-eks", config="aws-ec2-nodes"} ==0 ```---#### 173. Prometheus-config last reload filed- **级别**: P1-Warning | **状态**: Disabled | **ID**: 100 - **配置**: 执行间隔: 15s | 持续时间: 360s | 类型: prometheus - **备注**: max_over_time(prometheus_config_last_reload_successful[5m])==0**PromQL**:```promql max_over_time(prometheus_config_last_reload_successful[5m]) ==0 ```---#### 174. Prometheus-remote write mimir-S2写入失败率大于:10%- **级别**: P1-Warning | **状态**: Disabled | **ID**: 153 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: Prometheus remote write to mimir fail ,please check**PromQL**:```promql ((rate(prometheus_remote_storage_failed_samples_total{namespace="monitoring"}[5m]) or rate(prometheus_remote_storage_samples_failed_total{namespace="monitoring"}[5m])) / ((rate(prometheus_remote_storage_failed_samples_total{namespace="monitoring"}[5m]) or rate(prometheus_remote_storage_samples_failed_total{namespace="monitoring"}[5m])) + (rate(prometheus_remote_storage_succeeded_samples_total{namespace="monitoring"}[5m]) or rate(prometheus_remote_storage_samples_total{namespace="monitoring"}[5m])))) * 100 >= 10 ```---#### 175. Prometheus-sd discovered Discovered Targes is Null-nonprod- **级别**: P1-Warning | **状态**: Enabled | **ID**: 227 - **配置**: 执行间隔: 15s | 持续时间: 300s | 类型: prometheus - **备注**: Prometheus-sd discovered Discovered Targes is Null-nonprod**PromQL**:```promql prometheus_sd_discovered_targets{cluster!~'ltp-eks-prod'} == 0 ```---#### 176. Prometheus-sd discovered Discovered Targes is Null-prod- **级别**: P1-Warning | **状态**: Enabled | **ID**: 75 - **配置**: 执行间隔: 15s | 持续时间: 300s | 类型: prometheus - **备注**: Prometheus-sd discovered Discovered Targes is Null-prod**PromQL**:```promql prometheus_sd_discovered_targets{cluster=~'ltp-eks-prod'} == 0 ```---#### 177. SSL-Earliest Cert Expiry 证书过期时间小于15天- **级别**: P1-Warning | **状态**: Enabled | **ID**: 94 - **配置**: 执行间隔: 15s | 持续时间: 360s | 类型: prometheus - **备注**: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 15**PromQL**:```promql (probe_ssl_earliest_cert_expiry - time()) / 86400 < 15 ```---#### 178. Service is Down-mirror- **级别**: P1-Warning | **状态**: Enabled | **ID**: 229 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: Service is Down , Please checkout-mirror**PromQL**:```promql up{env=~'mirror',job!~"kube-state-metrics|aws-ec2-nodes|kubelet|kube-proxy||apiserver|kube-prom-stack-kube-prome-operator|monitoring-kube-prometheus-operator|kube-prom-stack-kube-prome-prometheus|.*node.*exporter.*|coredns",instance!="yet-another-cloudwatch-exporter",cluster!~'ltp-eks-uat|ltp-nonprod-eks|LTP-EKS-informal',endpoint!="http" }==0 ```---#### 179. Service is Down-sit fault drill- **级别**: P1-Warning | **状态**: Disabled | **ID**: 261 - **配置**: 执行间隔: 15s | 持续时间: 60s | 类型: prometheus - **备注**: Service is Down-sit fault drill**PromQL**:```promql up{job="sit-fault-drill-service"} == 0 ```---#### 180. XXL-JOB-EXCUTE-FAILED- **级别**: P1-Warning | **状态**: Disabled | **ID**: 268 - **配置**: 执行间隔: 15s | 类型: mysql - **备注**: XXL-JOB-EXCUTE-FAILED> 无 PromQL (可能为 N9E 内置规则或事件型告警)---### Infra/RDS (8 条, 启用 8)#### 181. RDS-Prime Mysql CPU 大于:60%- **级别**: P1-Warning | **状态**: Enabled | **ID**: 195 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: aws Prime Cpu使用率>=60%,请尽快查看**PromQL**:```promql aws_rds_cpuutilization_maximum{tag_env!~"dev|sit|uat|fat|qa|mirror",dimension_DBInstanceIdentifier=~".*prime.*"} >=60 ```---#### 182. RDS-Too many Slow Queries 大于:50-nonprod- **级别**: P1-Warning | **状态**: Enabled | **ID**: 258 - **配置**: 执行间隔: 15s | 持续时间: 360s | 类型: prometheus - **备注**: RDS-Too many Slow Queries 大于:50-nonprod**PromQL**:```promql sum by (job, instance,cluster) (rate(mysql_global_status_slow_queries{job=~"nonprod-aws-mysql-metrics|uat-aws-mysql-metrics"}[30m])) * 100>=50 ```---#### 183. RDS-Too many Slow Queries 大于:50-prod- **级别**: P1-Warning | **状态**: Enabled | **ID**: 95 - **配置**: 执行间隔: 15s | 持续时间: 360s | 类型: prometheus - **备注**: sum by (job, instance,cluster) (rate(mysql_global_status_slow_queries[30m])) * 100>=50**PromQL**:```promql sum by (job, instance,cluster) (rate(mysql_global_status_slow_queries{job="Prod_mysql_exporter"}[30m])) * 100>=50 ```---#### 184. RDS-aws connections_average 大于:1000- **级别**: P1-Warning | **状态**: Enabled | **ID**: 157 - **配置**: 执行间隔: 15s | 持续时间: 360s | 类型: prometheus - **备注**: aws_rds_database_connections_average>=1000 连接较多,请检查是否有连接未释放**PromQL**:```promql aws_rds_database_connections_average >= 1000 ```---#### 185. RDS-aws cpuutilization maximum-S2大于:90%- **级别**: P1-Warning | **状态**: Enabled | **ID**: 159 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: aws_rds_cpuutilization_maximum数据库cpu使用率大于:90 -100%**PromQL**:```promql aws_rds_cpuutilization_maximum{tag_env!~"prod|prodjp|prodhk"} >= 90 and aws_rds_cpuutilization_maximum{tag_env!~"prod|prodjp|prodhk"} <=100 ```---#### 186. RDS-free memory小于: 500M-S2- **级别**: P1-Warning | **状态**: Enabled | **ID**: 249 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: RDS-free memory小于: 500M**PromQL**:```promql aws_rds_freeable_memory_average{tag_env!~"prod|prodjp|prodhk"} / 2^30 < 0.5 ```---#### 187. RDS-实例连接异常-nonprod- **级别**: P1-Warning | **状态**: Enabled | **ID**: 256 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: RDS-实例连接异常-nonprod**PromQL**:```promql mysql_up{job=~"nonprod-aws-mysql-metrics|uat-aws-mysql-metrics"} == 0 ```---#### 188. RDS-连接利用率大于80%-nonprod- **级别**: P1-Warning | **状态**: Enabled | **ID**: 257 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: RDS-连接利用率大于80%-nonprod**PromQL**:```promql mysql_global_status_threads_running{job=~"nonprod-aws-mysql-metrics|uat-aws-mysql-metrics"} / mysql_global_variables_max_connections{job=~"nonprod-aws-mysql-metrics|uat-aws-mysql-metrics"} > 0.8 ```---### Infra/Redis (7 条, 启用 7)#### 189. Redis command执行大于3s-S2- **级别**: P1-Warning | **状态**: Enabled | **ID**: 250 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: Redis command执行大于3s**PromQL**:```promql lettuce_command_firstresponse_seconds_max{command!="BLPOP", env!~"prod.*"} > 3 ```---#### 190. Redis connection is down-S2- **级别**: P1-Warning | **状态**: Enabled | **ID**: 264 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: Redis connection is down**PromQL**:```promql probe_success{job=~"ltp_nonprod_redis_tcp|ltp_uat_redis_tcp"} == 0 ```---#### 191. Redis-evictions_average-S2被驱逐次数大于:1- **级别**: P1-Warning | **状态**: Enabled | **ID**: 155 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: aws_elasticache_evictions_average{dimension_CacheNodeId=~".+"} >=1**PromQL**:```promql aws_elasticache_evictions_average{dimension_CacheNodeId=~".+",tag_env!~"prod|prodjp|prodhk"} >=1 ```---#### 192. Redis-memory_usage_percentage-S2内存使用率大于:90%- **级别**: P1-Warning | **状态**: Enabled | **ID**: 156 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: aws_elasticache_database_memory_usage_percentage_maximum{dimension_CacheNodeId=~".+"} >=90**PromQL**:```promql aws_elasticache_database_memory_usage_percentage_maximum{dimension_CacheNodeId=~".+",tag_env!~"prod|prodjp|prodhk"} >= 90 and aws_elasticache_database_memory_usage_percentage_maximum{dimension_CacheNodeId=~".+",tag_env!~"prod|prodjp|prodhk"} <=100 ```---#### 193. Redis-memory_usage_percentage内存使用率大于:80%- **级别**: P1-Warning | **状态**: Enabled | **ID**: 83 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: aws_elasticache_database_memory_usage_percentage_maximum{dimension_CacheNodeId=~".+"} >=80**PromQL**:```promql aws_elasticache_database_memory_usage_percentage_maximum{dimension_CacheNodeId=~".+"} >= 80 and aws_elasticache_database_memory_usage_percentage_maximum{dimension_CacheNodeId=~".+"} <90 ```---#### 194. redis-CPU利用率大于:80%- **级别**: P1-Warning | **状态**: Enabled | **ID**: 239 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: redis-CPU利用率大于:80%**PromQL**:```promql aws_elasticache_cpuutilization_average{dimension_CacheNodeId=~".+"} >= 80 and aws_elasticache_cpuutilization_average{dimension_CacheNodeId=~".+"} < 90 ```---#### 195. redis-CPU利用率大于:90%-S2- **级别**: P1-Warning | **状态**: Enabled | **ID**: 251 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: redis-CPU利用率大于:90%**PromQL**:```promql aws_elasticache_cpuutilization_average{dimension_CacheNodeId=~".+",tag_env!~"prod|prodjp|prodhk"} >= 90 ```---### Prime/Custody (1 条, 启用 1)#### 196. Logs-DMA 生产日志告警- **级别**: P1-Warning | **状态**: Enabled | **ID**: 175 - **配置**: 执行间隔: 15s | 类型: loki - **备注**: DMA生产日志告警**PromQL**:```promql sum by (app, message) (count_over_time({app=~"(api-gateway|asset|backend-ltp-internalapi-project|base-fund|exchange|finance|ltp-config|ltp-data-custody-project|openapi|permission|pigeon-core|pigeon-gw|sec-crypt-service|security|siteapi|transfer|tsf-job|user)",env="prod"}|~"ALERTERROR"!~"(updateSumsubWebhookStatus|batchUpdateSumsubStatus|TO_FB_WITHDRAW_VAULT_TO_CHECK|SumsubController|ServiceListRequest|getPriceFromCamIndex)|DefaultSerializeClassChecker|DefaultSerializeClassChecker|UserLimiterAspect|ltp_secret_key"|pattern `<message>`[5m])) ```---### Prime/EMS/mds (2 条, 启用 0)#### 197. api行情-SLA 小于100%- **级别**: P1-Warning | **状态**: Disabled | **ID**: 181 - **配置**: 执行间隔: 15s | 持续时间: 1800s | 类型: prometheus - **备注**: api行情-SLA 小于100%**PromQL**:```promql clamp_min(sum by (cluster,env,job) (light_connect_server_connection{Connection="successfulConnection",instance=~"aws-jp-prod-mds-api-02:.*"} * 100),0) / ignoring(Connection) clamp_min(sum by (cluster,env,job) (light_connect_server_connection{Connection="OnlineConnection", instance=~"aws-jp-prod-mds-api-02:.*"}),1) <100 ```---#### 198. 统一行情-SLA 小于100%- **级别**: P1-Warning | **状态**: Disabled | **ID**: 180 - **配置**: 执行间隔: 15s | 持续时间: 1800s | 类型: prometheus - **备注**: 统一行情-SLA 小于100%**PromQL**:```promql clamp_min(sum by (cluster,env,job) (light_connect_server_connection{Connection="successfulConnection",instance=~"aws-jp-prod-mds-connect-01:.*|aws-jp-prod-mds-connect-02:.*|aws-jp-prod-mds-quote-01:.*|aws-jp-prod-mds-quote-02:.*"} * 100),0) / ignoring(Connection) clamp_min(sum by (cluster,env,job) (light_connect_server_connection{Connection="OnlineConnection", instance=~"aws-jp-prod-mds-connect-01:.*|aws-jp-prod-mds-connect-02:.*|aws-jp-prod-mds-quote-01:.*|aws-jp-prod-mds-quote-02:.*"}),1) <100 ```---### Prime/OMS (4 条, 启用 3)#### 199. RapidX 应用异常请求 429 占比超过 50%- **级别**: P1-Warning | **状态**: Enabled | **ID**: 276 - **配置**: 执行间隔: 15s | 持续时间: 60s | 类型: prometheus**PromQL**:```promql sum by (java_application, uri,env, status) (rate(http_server_requests_seconds_count{env="prodjp",java_application=~"pb-trading.*|rapidx.*",java_application!="pb-trading-dump", uri != "/**",status=~"429"}[5m]) ) / ignoring(status) group_left sum by (java_application, uri, env) (rate(http_server_requests_seconds_count{env="prodjp",java_application=~"pb-trading.*|rapidx.*",java_application!="pb-trading-dump",uri != "/**"}[5m]) ) * 100 > 95 ```---#### 200. RapidX 应用异常请求 499/5xx 占比超过 10%- **级别**: P1-Warning | **状态**: Enabled | **ID**: 231 - **配置**: 执行间隔: 15s | 持续时间: 60s | 类型: prometheus**PromQL**:```promql sum by (java_application, uri,env, status) (rate(http_server_requests_seconds_count{env="prodjp",java_application=~"pb-trading.*|rapidx.*",java_application!="pb-trading-dump", uri != "/**",status=~"5..|499"}[5m]) ) / ignoring(status) group_left sum by (java_application, uri, env) (rate(http_server_requests_seconds_count{env="prodjp",java_application=~"pb-trading.*|rapidx.*",java_application!="pb-trading-dump",uri != "/**"}[5m]) ) * 100 > 10 ```---#### 201. RapidX 服务 SQL 处理平均延迟 大于 3s 超过 1min- **级别**: P1-Warning | **状态**: Enabled | **ID**: 193 - **配置**: 执行间隔: 15s | 持续时间: 60s | 类型: prometheus**PromQL**:```promql rate(mybatis_sql_timer_seconds_sum{env=~"prod.*"}[1m])>0 / rate(mybatis_sql_timer_seconds_count{env=~"prod.*"}[1m])>0 >3 ```---#### 202. xchange_binance_limit_rate -S2 大于:5000- **级别**: P1-Warning | **状态**: Disabled | **ID**: 158 - **配置**: 执行间隔: 15s | 持续时间: 60s | 类型: prometheus - **备注**: xchange_binance_limit_rate > 5000**PromQL**:```promql xchange_binance_limit_rate >5000 ```---## P2-Info (20 条)### Infra/EC2 (9 条, 启用 8)#### 203. AWS 维护事件- **级别**: P2-Info | **状态**: Enabled | **ID**: 288 - **配置**: 执行间隔: 15s | 持续时间: 180s | 类型: prometheus**PromQL**:```promql aws_health_event_info > 0 ```---#### 204. EC2-Available Memory 使用率大于:80%- **级别**: P2-Info | **状态**: Enabled | **ID**: 25 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: EC2实例内存使用率>80%-89%**PromQL**:```promql (1 - (node_memory_MemAvailable_bytes{PrivateIpAddress!=""} / node_memory_MemTotal_bytes{PrivateIpAddress!=""})) * 100 >= 80 and (1 - (node_memory_MemAvailable_bytes{PrivateIpAddress!=""} / node_memory_MemTotal_bytes{PrivateIpAddress!=""})) * 100 < 90 ```---#### 205. EC2-CPU负载大于:80%- **级别**: P2-Info | **状态**: Enabled | **ID**: 11 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: CPU负载>80%-90%**PromQL**:```promql 100 - sum(rate(node_cpu_seconds_total{mode="idle", PrivateIpAddress!=""}[5m])) by (Region,instance,Name,env,PrivateIpAddress) / sum(rate(node_cpu_seconds_total{PrivateIpAddress!=""}[5m])) by (Region,instance,Name,env,PrivateIpAddress) * 100 >= 80 and 100 - sum(rate(node_cpu_seconds_total{mode="idle", PrivateIpAddress!=""}[5m])) by (Region,instance,Name,env,PrivateIpAddress) / sum(rate(node_cpu_seconds_total{PrivateIpAddress!=""}[5m])) by (Region,instance,Name,env,PrivateIpAddress) * 100 < 90 ```---#### 206. EC2-DISK IO使用率大于:80%- **级别**: P2-Info | **状态**: Enabled | **ID**: 12 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: >80%。检查存储问题或提高IOPS能力。检查存储器中的问题**PromQL**:```promql rate(node_disk_io_time_seconds_total{PrivateIpAddress!=""}[5m]) >= 0.8 and rate(node_disk_io_time_seconds_total{PrivateIpAddress!=""}[5m]) < 0.9 ```---#### 207. EC2-DISK 写入延迟- **级别**: P2-Info | **状态**: Enabled | **ID**: 37 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: EC2-DISK 写入延迟情况**PromQL**:```promql (rate(node_disk_write_time_seconds_total{PrivateIpAddress!=""}[1m]) / rate(node_disk_writes_completed_total{PrivateIpAddress!=""}[1m]) > 0.1 and rate(node_disk_writes_completed_total{PrivateIpAddress!=""}[1m]) > 0) ```---#### 208. EC2-DISK 读取延迟- **级别**: P2-Info | **状态**: Enabled | **ID**: 34 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: EC2-DISK 读取延迟**PromQL**:```promql (rate(node_disk_read_time_seconds_total{PrivateIpAddress!=""}[1m]) / rate(node_disk_reads_completed_total{PrivateIpAddress!=""}[1m])) > 0.1 and (rate(node_disk_reads_completed_total{PrivateIpAddress!=""}[1m]))>0 ```---#### 209. EC2-Disk IO wait大于:80%- **级别**: P2-Info | **状态**: Disabled | **ID**: 6 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: Disk is too busy (IO wait > 80%)**PromQL**:```promql rate(node_disk_io_time_seconds_total{PrivateIpAddress!=""}[5m]) > 0.80 ```---#### 210. EC2-Disk avail_bytes 磁盘使用率大于:80%- **级别**: P2-Info | **状态**: Enabled | **ID**: 30 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: Disk使用率>80%**PromQL**:```promql (1 - (node_filesystem_avail_bytes{fstype!~"^(fuse.*|tmpfs|cifs|nfs)",PrivateIpAddress!=""} / node_filesystem_size_bytes{PrivateIpAddress!=""})) >= 0.80 and (1 - (node_filesystem_avail_bytes{fstype!~"^(fuse.*|tmpfs|cifs|nfs)",PrivateIpAddress!=""} / node_filesystem_size_bytes{PrivateIpAddress!=""})) < 0.90 and on (instance, device, mountpoint) node_filesystem_readonly == 0 ```---#### 211. EC2-OutOf available inodes使用率大于:80%- **级别**: P2-Info | **状态**: Enabled | **ID**: 8 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: Disk is almost running out of available inodes (< 10% left)**PromQL**:```promql (1 - (node_filesystem_files_free{PrivateIpAddress!=""} / node_filesystem_files{PrivateIpAddress!=""})) >= 0.80 and (1 - (node_filesystem_files_free{PrivateIpAddress!=""} / node_filesystem_files{PrivateIpAddress!=""})) < 0.90 and on (instance, device, mountpoint) node_filesystem_readonly == 0 ```---### Infra/K8S (6 条, 启用 6)#### 212. Container-HighCpuUtilization CPU使用率大于:80%- **级别**: P2-Info | **状态**: Enabled | **ID**: 42 - **配置**: 执行间隔: 15s | 持续时间: 300s | 类型: prometheus - **备注**: 容器-CPU利用率超过80%**PromQL**:```promql (sum(rate(container_cpu_usage_seconds_total{image!="", container!="POD"}[5m])) by (cluster,namespace,pod,container) / sum by (cluster,namespace,pod,container) (kube_pod_container_resource_limits{resource="cpu"})) * 100 >= 80 and (sum(rate(container_cpu_usage_seconds_total{image!="", container!="POD"}[5m])) by (cluster,namespace,pod,container) / sum by (cluster,namespace,pod,container) (kube_pod_container_resource_limits{resource="cpu"})) * 100 < 90 ```---#### 213. Container-HighMemoryUsage 内存使用率大于: 80%- **级别**: P2-Info | **状态**: Enabled | **ID**: 46 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: 容器内存使用率达到80%-90%**PromQL**:```promql round(10000*avg_over_time((avg by (container,cluster,namespace,pod) (container_memory_working_set_bytes{}) / avg by (container,cluster,namespace,pod) (kube_pod_container_resource_limits{resource="memory"}) ) [5m:]))/100 >=80 and round(10000*avg_over_time((avg by (container,cluster,namespace,pod) (container_memory_working_set_bytes{}) / avg by (container,cluster,namespace,pod) (kube_pod_container_resource_limits{resource="memory"}) ) [5m:]))/100 <90 ```---#### 214. K8s-Node-DiskUsed. 磁盘使用率大于:80%- **级别**: P2-Info | **状态**: Enabled | **ID**: 220 - **配置**: 执行间隔: 15s | 持续时间: 180s | 类型: prometheus - **备注**: k8s节点-磁盘利用率到达80%-90%**PromQL**:```promql ((node_filesystem_size_bytes{job="node-exporter",mountpoint="/",fstype!~"tmpfs|rootfs|overlay"} - node_filesystem_free_bytes{job="node-exporter",mountpoint="/",fstype!~"tmpfs|rootfs|overlay"}) / node_filesystem_size_bytes{job="node-exporter",mountpoint="/",fstype!~"tmpfs|rootfs|overlay"} ) * 100 >= 80 and ((node_filesystem_size_bytes{job="node-exporter",mountpoint="/",fstype!~"tmpfs|rootfs|overlay"} - node_filesystem_free_bytes{job="node-exporter",mountpoint="/",fstype!~"tmpfs|rootfs|overlay"}) / node_filesystem_size_bytes{job="node-exporter",mountpoint="/",fstype!~"tmpfs|rootfs|overlay"} ) * 100 < 90 ```---#### 215. K8s-Node-HighCpuUtilization CPU使用率大于:80%- **级别**: P2-Info | **状态**: Enabled | **ID**: 104 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: k8s节点-CPU利用率超过80%**PromQL**:```promql 100 - (avg(rate(node_cpu_seconds_total{job="node-exporter",mode='idle'}[5m])) by (cluster,instance,namespace,job) * 100) >=80 and 100 - (avg(rate(node_cpu_seconds_total{job="node-exporter",mode='idle'}[5m])) by (cluster,instance,namespace,job) * 100) <90 ```---#### 216. K8s-Node-MemUsed. 内存使用率大于:80%- **级别**: P2-Info | **状态**: Enabled | **ID**: 106 - **配置**: 执行间隔: 15s | 持续时间: 180s | 类型: prometheus - **备注**: k8s节点-内存利用率到达80%-90%**PromQL**:```promql 100 * (1 - (node_memory_MemFree_bytes + node_memory_Buffers_bytes{job='node-exporter'} + node_memory_Cached_bytes{job='node-exporter'}) / node_memory_MemTotal_bytes{job='node-exporter'}) >=80 and 100 * (1 - (node_memory_MemFree_bytes + node_memory_Buffers_bytes{job='node-exporter'} + node_memory_Cached_bytes{job='node-exporter'}) / node_memory_MemTotal_bytes{job='node-exporter'}) <90 ```---#### 217. PVC-persistent VolumeClaim 大于80%- **级别**: P2-Info | **状态**: Enabled | **ID**: 98 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: pvc存储使用率>80%**PromQL**:```promql sum by (namespace, persistentvolumeclaim,cluster) (kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) * 100 >=80 and sum by (namespace, persistentvolumeclaim,cluster) (kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) * 100 <90 ```---### Infra/RDS (1 条, 启用 1)#### 218. RDS-aws cpuutilization maximum 大于:80%- **级别**: P2-Info | **状态**: Enabled | **ID**: 67 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: aws_rds_cpuutilization_maximum数据库cpu使用率大于:80 -90%**PromQL**:```promql aws_rds_cpuutilization_maximum >= 80 and aws_rds_cpuutilization_maximum <90 ```---### Infra/Redis (1 条, 启用 1)#### 219. Redis-memory_usage_percentage内存使用率大于:70%- **级别**: P2-Info | **状态**: Enabled | **ID**: 82 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: aws_elasticache_database_memory_usage_percentage_maximum{dimension_CacheNodeId=~".+"} > 70**PromQL**:```promql aws_elasticache_database_memory_usage_percentage_maximum{dimension_CacheNodeId=~".+"} >= 70 and aws_elasticache_database_memory_usage_percentage_maximum{dimension_CacheNodeId=~".+"} <80 ```---### Prime/EMS/mds (3 条, 启用 2)#### 220. EC2-mds Memory 5min 内存使用率大于:80%- **级别**: P2-Info | **状态**: Enabled | **ID**: 114 - **配置**: 执行间隔: 15s | 类型: prometheus - **备注**: Memory近5min负载>=80%,请及时关注程序使用情况及日志**PromQL**:```promql (1 - (node_memory_MemAvailable_bytes{Name=~"ltp-rapidx-prod-mdsengine-.*|aws-jp-prod-mds-algo-01|aws-jp-prod-mds-algo-02|aws-jp-prod-mds-connex-.*|aws-jp-prod-mds-quote-.*|aws-jp-prod-mdsengine-05-edx|aws-jp-prod-mds-web-01|aws-jp-prod-mds-query-.*|aws-jp-prod-mds-api-*|aws-jp-prod-mds-onezero-01"} / node_memory_MemTotal_bytes{Name=~"ltp-rapidx-prod-mdsengine-.*|aws-jp-prod-mds-algo-01|aws-jp-prod-mds-algo-02|aws-jp-prod-mds-connex-.*|aws-jp-prod-mds-quote-.*|aws-jp-prod-mdsengine-05-edx|aws-jp-prod-mds-web-01|aws-jp-prod-mds-query-.*|aws-jp-prod-mds-api-.*|aws-jp-prod-mds-onezero-01"})) * 100 >80 ```---#### 221. argo行情-OnlineConnection连接数为:0- **级别**: P2-Info | **状态**: Enabled | **ID**: 177 - **配置**: 执行间隔: 15s | 持续时间: 300s | 类型: prometheus - **备注**: argo connection 30min : 0**PromQL**:```promql sum by (team,cluster,job,Connection) (sum without (instance) (avg_over_time(light_connect_server_connection{Connection="OnlineConnection",instance=~"aws-jp-prod-mds-algo-01:.*|aws-jp-prod-mds-algo-02:.*"}[30m])) ) <=0 ```---#### 222. connex行情-SLA 小于100%- **级别**: P2-Info | **状态**: Disabled | **ID**: 182 - **配置**: 执行间隔: 15s | 持续时间: 1800s | 类型: prometheus - **备注**: connex行情-SLA 小于100%**PromQL**:```promql clamp_min(sum by (cluster,env,job) (light_connect_server_connection{Connection="successfulConnection",instance=~"aws-jp-prod-mds-connex-02:.*"} * 100),0) / ignoring(Connection) clamp_min(sum by (cluster,env,job) (light_connect_server_connection{Connection="OnlineConnection", instance=~"aws-jp-prod-mds-connex-02:.*"}),1) <100 ```---## 覆盖缺口分析### 已覆盖的监控维度| 维度 | 覆盖情况 | 业务组 | |------|---------|--------| | EC2 主机 (CPU/MEM/Disk/Network/OOM) | 完善,prod/nonprod 分级 | Infra/EC2 | | K8s 容器 (CPU/MEM/OOM/Restart/Wait) | 完善,prod/nonprod 分级 | Infra/K8S | | K8s 节点 (CPU/MEM/Disk/NotReady) | 完善 | Infra/K8S | | RDS MySQL (CPU/连接数/慢查询/内存) | 完善,prod/nonprod 分级 | Infra/RDS | | Redis (CPU/内存/驱逐/连接/命令延迟) | 完善,prod/nonprod 分级 | Infra/Redis | | AWS Kafka (CPU/内存/磁盘/连接) | 基本覆盖 | Infra/Kafka | | 自建 Kafka (Controller/Broker/Partition/Queue/Election) | 已覆盖核心指标 | Infra/Kafka | | ELB (5xx/响应时间/连接错误) | 完善 | Infra/Monitoring | | Prometheus 自身 (remote write/target/service discovery) | 完善 | Infra/Monitoring | | SSL 证书过期 | 已覆盖 (< 15 天) | Infra/Monitoring | | Nginx 状态码 (499/5xx) | 已覆盖 | Infra/AccessLog | | JVM Heap | 已覆盖 (80%/95%) | Infra/DevOps | | RapidX/OMS 业务 (延迟/错误/引擎状态) | 详细覆盖 | Prime/OMS | | EMS/交易 (交易所错误/限频/专线) | 详细覆盖 | Prime/EMS/* |### 建议补充的告警| 维度 | 缺失内容 | 建议级别 | 说明 | |------|---------|---------|------| | 自建 Kafka | Consumer Lag 消费延迟 | P1-Warning | 需部署 kafka-exporter,当前 JMX 无 lag 指标 | | 自建 Kafka | 吞吐量骤降/骤增 | P1-Warning | 已生成在 kafka-alerting-rules.yaml 中,待部署 | | 自建 Kafka | JMX Exporter 采集失败 | P1-Warning | jmx_scrape_error > 0,监控盲区检测 | | 自建 Kafka | Log Segment 磁盘增长 | P2-Info | kafka_log_log_size 增长率监控 | | Mimir | Ingester Ring 不健康 | P0-Critical | 本次故障暴露的盲区 | | Mimir | Distributor 写入失败率 | P1-Warning | 补充 Mimir 自身业务告警 | | Mimir | Compactor 停止运行 | P1-Warning | 影响历史数据查询 | | PVC | 通用 PVC 用量 > 80% | P1-Warning | 当前仅 K8s 组有,建议全局覆盖 | | DNS | CoreDNS 延迟/失败率 | P1-Warning | DNS 故障影响全集群服务发现 | | Ingress | Ingress Controller 5xx/延迟 | P1-Warning | 有 ELB 层面告警但缺 Ingress 层 | | CronJob | K8s CronJob 执行失败 | P1-Warning | 定时任务失败无告警 | | HPA | HPA 达到最大副本数 | P2-Info | 自动扩容到顶需关注 |### 优化建议| 问题 | 涉及规则 | 建议 | |------|---------|------| | P0 占比过高 (52%) | 116 条 Critical | 审查是否有可降级为 Warning 的规则,减少告警疲劳 | | 25 条规则已禁用 | 散布各组 | 定期清理,确认是临时静默还是已废弃 | | prod/nonprod 分级重复 | EC2/K8S/RDS/Redis | 考虑用 N9E 变量或模板减少规则数量 | | 部分规则无 PromQL | XXLJOB 等 | 确认是否为事件型规则,补充文档说明 |
