背景
公司业务应用在 ECS 环境,计划迁移到k8s 容器集群,应用是 java, jdk 版本 1.8.333
故障现象
同样代码,同样启动命令,limit 限制的 cpu mem 与 ECS 的配置一样,但是迁移后容器 pod状态变黄,还是 running(其实是健康检查没通过,但是没到次数所以不是 notready),研发反馈该应用处理事务及响应变慢,日志显示 redis 查询超时

排查过程
查看 jvm相关监控,发现 GC 耗时巨大

堆内存使用率很高

查看gc日志,
2026-04-24T17:59:48.003+0800: 84612.407: [Full GC (Allocation Failure) 2026-04-24T17:59:48.003+0800: 84612.407: [CMS: 4194301K->4194279K(4194304K), 6.4063180 secs] 6081789K->5351315K(6081792K), [Metaspace: 143127K->143127K(1185792K)], 6.4064413 secs] [Times: user=6.34 sys=0.02, real=6.41 secs] 2026-04-24T17:59:54.410+0800: 84618.814: [GC (CMS Initial Mark) [1 CMS-initial-mark: 4194279K(4194304K)] 5353033K(6081792K), 0.6205127 secs] [Times: user=0.62 sys=0.00, real=0.62 secs] 2026-04-24T17:59:55.030+0800: 84619.434: [CMS-concurrent-mark-start] 2026-04-24T17:59:57.630+0800: 84622.034: [Full GC (Allocation Failure) 2026-04-24T17:59:57.630+0800: 84622.034: [CMS2026-04-24T17:59:58.323+0800: 84622.727: [CMS-concurrent-mark: 3.286/3.293 secs] [Times: user=8.94 sys=1.03, real=3.29 secs] (concurrent mode failure): 4194288K->4194246K(4194304K), 7.4057876 secs] 6081776K->5336664K(6081792K), [Metaspace: 143127K->143127K(1185792K)], 7.4059258 secs] [Times: user=7.29 sys=0.05, real=7.41 secs] 2026-04-24T18:00:06.727+0800: 84631.131: [Full GC (Allocation Failure) 2026-04-24T18:00:06.728+0800: 84631.131: [CMS: 4194297K->4194255K(4194304K), 6.6118649 secs] 6081785K->5345113K(6081792K), [Metaspace: 143127K->143127K(1185792K)], 6.6120142 secs] [Times: user=6.53 sys=0.03, real=6.61 secs] 2026-04-24T18:00:13.341+0800: 84637.745: [GC (CMS Initial Mark) [1 CMS-initial-mark: 4194255K(4194304K)] 5352379K(6081792K), 0.6283735 secs] [Times: user=0.63 sys=0.00, real=0.63 secs] 2026-04-24T18:00:13.969+0800: 84638.373: [CMS-concurrent-mark-start] 2026-04-24T18:00:16.238+0800: 84640.641: [Full GC (Allocation Failure) 2026-04-24T18:00:16.238+0800: 84640.642: [CMS2026-04-24T18:00:17.436+0800: 84641.840: [CMS-concurrent-mark: 3.454/3.467 secs] [Times: user=8.60 sys=1.79, real=3.47 secs] (concurrent mode failure): 4194303K->4194269K(4194304K), 7.7652063 secs] 6081791K->5431139K(6081792K), [Metaspace: 143127K->143127K(1185792K)], 7.7654367 secs] [Times: user=7.69 sys=0.02, real=7.76 secs] 2026-04-24T18:00:26.025+0800: 84650.429: [GC (CMS Initial Mark) [1 CMS-initial-mark: 4194269K(4194304K)] 6057347K(6081792K), 0.6756669 secs] [Times: user=0.93 sys=0.00, real=0.67 secs] 2026-04-24T18:00:26.701+0800: 84651.105: [CMS-concurrent-mark-start] 2026-04-24T18:00:26.939+0800: 84651.343: [Full GC (Allocation Failure) 2026-04-24T18:00:26.939+0800: 84651.343: [CMS2026-04-24T18:00:29.289+0800: 84653.693: [CMS-concurrent-mark: 2.584/2.588 secs] [Times: user=3.24 sys=0.14, real=2.59 secs] (concurrent mode failure): 4194291K->4194256K(4194304K), 8.9050760 secs] 6081779K->5451447K(6081792K), [Metaspace: 143127K->143127K(1185792K)], 8.9052878 secs] [Times: user=8.81 sys=0.03, real=8.90 secs] 2026-04-24T18:00:37.636+0800: 84662.040: [Full GC (Allocation Failure) 2026-04-24T18:00:37.636+0800: 84662.040: [CMS: 4194292K->4194257K(4194304K), 6.5953778 secs] 6081780K->5429931K(6081792K), [Metaspace: 143127K->143127K(1185792K)], 6.5955893 secs] [Times: user=6.50 sys=0.04, real=6.59 secs] 2026-04-24T18:00:44.233+0800: 84668.637: [GC (CMS Initial Mark) [1 CMS-initial-mark: 4194257K(4194304K)] 5435054K(6081792K), 0.7153459 secs] [Times: user=0.72 sys=0.00, real=0.72 secs] 2026-04-24T18:00:44.949+0800: 84669.353: [CMS-concurrent-mark-start] 2026-04-24T18:00:47.068+0800: 84671.472: [Full GC (Allocation Failure) 2026-04-24T18:00:47.068+0800: 84671.472: [CMS2026-04-24T18:00:48.383+0800: 84672.787: [CMS-concurrent-mark: 3.428/3.434 secs] [Times: user=9.10 sys=0.84, real=3.43 secs] (concurrent mode failure): 4194303K->4194301K(4194304K), 8.4859177 secs] 6081791K->5431366K(6081792K), [Metaspace: 143127K->143127K(1185792K)], 8.4861286 secs] [Times: user=8.40 sys=0.03, real=8.48 secs] 2026-04-24T18:00:57.555+0800: 84681.959: [GC (CMS Initial Mark) [1 CMS-initial-mark: 4194301K(4194304K)] 6003921K(6081792K), 0.6812176 secs] [Times: user=0.91 sys=0.00, real=0.68 secs] 2026-04-24T18:00:58.236+0800: 84682.640: [CMS-concurrent-mark-start] 2026-04-24T18:00:58.874+0800: 84683.278: [Full GC (Allocation Failure) 2026-04-24T18:00:58.874+0800: 84683.278: [CMS2026-04-24T18:01:00.976+0800: 84685.380: [CMS-concurrent-mark: 2.737/2.740 secs] [Times: user=4.40 sys=0.40, real=2.74 secs] (concurrent mode failure): 4194304K->4194295K(4194304K), 9.2351389 secs] 6081791K->5361251K(6081792K), [Metaspace: 143127K->143127K(1185792K)], 9.2353161 secs] [Times: user=9.11 sys=0.03, real=9.24 secs] 2026-04-24T18:01:10.124+0800: 84694.528: [GC (CMS Initial Mark) [1 CMS-initial-mark: 4194295K(4194304K)] 5909578K(6081792K), 0.6292803 secs] [Times: user=0.85 sys=0.00, real=0.63 secs] 2026-04-24T18:01:10.754+0800: 84695.158: [CMS-concurrent-mark-start] 2026-04-24T18:01:12.256+0800: 84696.660: [Full GC (Allocation Failure) 2026-04-24T18:01:12.256+0800: 84696.660: [CMS2026-04-24T18:01:13.842+0800: 84698.246: [CMS-concurrent-mark: 3.083/3.088 secs] [Times: user=6.97 sys=0.77, real=3.09 secs] (concurrent mode failure): 4194304K->4194297K(4194304K), 8.6715846 secs] 6081791K->5333166K(6081792K), [Metaspace: 143127K->143127K(1185792K)], 8.6717326 secs] [Times: user=8.56 sys=0.04, real=8.67 secs] 2026-04-24T18:01:22.764+0800: 84707.168: [Full GC (Allocation Failure) 2026-04-24T18:01:22.764+0800: 84707.168: [CMS: 4194303K->4194296K(4194304K), 6.4223515 secs] 6081791K->5349743K(6081792K), [Metaspace: 143127K->143127K(1185792K)], 6.4225309 secs] [Times: user=6.36 sys=0.03, real=6.43 secs] 2026-04-24T18:01:29.188+0800: 84713.591: [GC (CMS Initial Mark) [1 CMS-initial-mark: 4194296K(4194304K)] 5357633K(6081792K), 0.6392615 secs] [Times: user=0.64 sys=0.00, real=0.64 secs] 2026-04-24T18:01:29.827+0800: 84714.231: [CMS-concurrent-mark-start] 2026-04-24T18:01:31.533+0800: 84715.936: [Full GC (Allocation Failure) 2026-04-24T18:01:31.533+0800: 84715.937: [CMS2026-04-24T18:01:33.002+0800: 84717.406: [CMS-concurrent-mark: 3.169/3.175 secs] [Times: user=7.33 sys=0.98, real=3.17 secs] (concurrent mode failure): 4194303K->4194295K(4194304K), 7.9170215 secs] 6081791K->5355085K(6081792K), [Metaspace: 143127K->143127K(1185792K)], 7.9172085 secs] [Times: user=7.83 sys=0.03, real=7.92 secs] 2026-04-24T18:01:41.450+0800: 84725.854: [GC (CMS Initial Mark) [1 CMS-initial-mark: 4194295K(4194304K)] 6029196K(6081792K), 0.6419288 secs] [Times: user=0.95 sys=0.01, real=0.64 secs] 2026-04-24T18:01:42.092+0800: 84726.496: [CMS-concurrent-mark-start] 2026-04-24T18:01:42.471+0800: 84726.874: [Full GC (Allocation Failure) 2026-04-24T18:01:42.471+0800: 84726.875: [CMS2026-04-24T18:01:44.798+0800: 84729.202: [CMS-concurrent-mark: 2.703/2.706 secs] [Times: user=3.72 sys=0.25, real=2.71 secs] (concurrent mode failure): 4194303K->4194281K(4194304K), 8.8586955 secs] 6081791K->5343373K(6081792K), [Metaspace: 143127K->143127K(1185792K)], 8.8588622 secs] [Times: user=8.76 sys=0.03, real=8.86 secs] 2026-04-24T18:01:53.330+0800: 84737.734: [GC (CMS Initial Mark) [1 CMS-initial-mark: 4194281K(4194304K)] 5926812K(6081792K), 0.6300073 secs] [Times: user=0.87 sys=0.00, real=0.63 secs] 2026-04-24T18:01:53.960+0800: 84738.364: [CMS-concurrent-mark-start] 2026-04-24T18:01:55.264+0800: 84739.668: [Full GC (Allocation Failure) 2026-04-24T18:01:55.264+0800: 84739.668: [CMS2026-04-24T18:01:56.828+0800: 84741.232: [CMS-concurrent-mark: 2.862/2.868 secs] [Times: user=6.39 sys=0.58, real=2.87 secs] (concurrent mode failure): 4194298K->4194303K(4194304K), 8.1292896 secs] 6081786K->5320677K(6081792K), [Metaspace: 143127K->143127K(1185792K)], 8.1294378 secs] [Times: user=8.03 sys=0.03, real=8.13 secs] 2026-04-24T18:02:04.788+0800: 84749.192: [Full GC (Allocation Failure) 2026-04-24T18:02:04.788+0800: 84749.192: [CMS: 4194303K->4194295K(4194304K), 6.3928626 secs] 6081791K->5335401K(6081792K), [Metaspace: 143127K->143127K(1185792K)], 6.3930074 secs] [Times: user=6.33 sys=0.02, real=6.39 secs] 2026-04-24T18:02:11.182+0800: 84755.586: [GC (CMS Initial Mark) [1 CMS-initial-mark: 4194295K(4194304K)] 5337539K(6081792K), 0.6297399 secs] [Times: user=0.63 sys=0.00, real=0.63 secs] 2026-04-24T18:02:11.812+0800: 84756.216: [CMS-concurrent-mark-start] 2026-04-24T18:02:12.870+0800: 84757.274: [Full GC (Allocation Failure) 2026-04-24T18:02:12.870+0800: 84757.274: [CMS
🔴 根因:CMS concurrent mode failure(并发模式失败)
Old 区容量: 4,194,304K = 4GB(完全打满)
每次 Full GC 后 CMS marked: 4,194,29X K(几乎纹丝不动)
CMS concurrent mark: ~3秒(但 Old 区在 3秒内又满了)
→ 并发标记还没跑完
→ 触发 Full GC(STW)
→ 6~9秒的应用暂停
→ 死循环
分析过程
正常情况 jvm的垃圾回收在年轻代回收就能满足内存使用,如果年轻代回收不足以满足使用的情况下就会触发老年代的 Full GC 模式,而 Full GC 会全局扫描,耗时很长通常好几秒,这期间触发 STW,
也就是说会把业务应用所以线程暂停,业务不在响应,所以几秒钟的中断开,对业务影响巨大,k8s层面探针未通过变黄。
故障定位
正常 ECSjava 日志只有年轻代的 GC 不会触发 FUllGC那么问题来了,什么导致的容器环境频繁 full GC 呢,或者说是为啥容器环境垃圾回收效率如此低下进而触发了 full gc ?
启动参数如下:
-Dfile.encoding=UTF-8 -XX:MaxRAMPercentage=90.0 -XX:+UseContainerSupport -Xloggc:/debug/gc.log -XX:NewRatio=2 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/debug/ -Xloggc:/debug/gc.log -Xmx6g -Xms6g -XX:MetaspaceSize=512M -XX:MaxMetaspaceSize=512M -XX:NewRatio=2 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/debug/ -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8 -Denv=pro -Didc=3c
两边启动参数也一样,那么两边区别在哪里?那就是容器的 cgroup限制了,基于 limit场景
真正的原因:
容器 CFS 调度 vs jvm 垃圾回收器CMS 并发线程(启动参数-XX:+UseConcMarkSweepGC表示使用 CMS 类型回收器 )
CFS 公平调度原理
Linux CFS(Completely Fair Scheduler)按 时间片 分配 CPU 给线程。在容器里:容器 CPU limit = 4核(假设是 cfs_quota_us/cfs_period_us=400000/100000)
一个时间窗口(100ms)内,容器最多用 400ms CPU 时间
所有线程竞争这 400ms
CMS 并发线程在这里踩坑


cat /sys/fs/cgroup/cpu/cpu.stat nr_periods 884164 nr_throttled 86841 throttled_time 116831289349089 current_bw 326101028 nr_burst 0 burst_time 0




解决方案
更换为适合容器环境的 G1 垃圾回收器
-XX:+UseG1GC \ -XX:MaxGCPauseMillis=200 \
解释:
G1GC 是增量式的,被 throttle 打断后自动从下一个 Region 继续,不存在"整块扫描没完成"的问题。换成 G1GC 后,Full GC 频率会断崖式下降。
