当前位置：首页 > news >正文

perf热点找到热进程6 - 小镇

news 2026/5/9 19:10:10

perf热点找到热进程6

好，这次按你要求，从 vmstat → top → perf → sched → 定位链路 → 结论 给你做一套标准系统级排障闭环流程，不跳步、不靠 PID 猜测，每一步都讲清楚：

命令 → 输出含义 → 判断依据 → 结论 → 下一步

🚨 一、vmstat：先看系统是不是“CPU型问题”

🧪 命令

vmstat 1

你给的关键输出：

r  b   us sy id wa
102     57 20 24 0
cs: 400k+
in: 400k+

🔍 分析

1️⃣ r = 102

👉 run queue 极高

指标	含义
r > CPU核数（通常几十）	CPU排队严重

✔ 结论：

CPU 已经“排队执行”，不是轻负载

2️⃣ cs = 40万+

👉 context switch 爆炸

说明：

线程极多
或锁竞争极严重
或 runtime scheduler 抖动

3️⃣ wa = 0

👉 IO 无关

🚨 vmstat 结论

❗系统不是 IO 问题，是 CPU调度/线程问题

🚨 二、top：确认“是谁在制造负载”

🧪 命令

top -H

关键输出：

load average: 67+
tokio-runtime-w (791 threads)
haproxy
kernel softirq

🔍 分析

1️⃣ load average 67+

👉 Linux load = runnable + uninterruptible

说明：

CPU/调度已经满载

2️⃣ tokio-runtime-w 791 threads

👉 关键点

说明：

一个 runtime 开了 700+线程
或多 runtime 累积
或 task spawning 爆炸

🚨 top 结论

❗系统负载来源是 tokio runtime thread storm

🚨 三、perf top：看 CPU到底在干嘛

🧪 命令

perf top -a

你看到：

swapper
tokio-runtime-w
futex
schedule
net_rx_action

🔍 分析

1️⃣ futex

👉 用户态锁竞争

说明：

mutex
channel
queue contention

2️⃣ schedule

👉 CPU被频繁抢占

3️⃣ net_rx_action

👉 网络软中断

说明：

NIC收包压力
或 HAProxy flood

🚨 perf top 结论

❗CPU消耗在：锁竞争 + 调度 + 网络中断

🚨 四、perf sched latency：看调度是否失控

🧪 命令

perf sched latency

关键输出：

tokio-runtime-w:switches: 555,241max delay: 77msavg delay: 0.12ms

🔍 分析

1️⃣ switches 55万+

👉 极端异常

说明：

scheduler 被 task flood

2️⃣ max delay 77ms

👉 已经出现“CPU饥饿”

🚨 sched 结论

❗Linux调度器已经被 tokio runtime 压垮

🚨 五、perf sched timehist：看“谁在卡住谁”

你提供：

tokio-runtime-w
sleep 6000ms
run-time 80–900ms
preempt 20–300ms

🔍 核心解读

1️⃣ sleep 6000ms

👉 task 被长期挂起

说明：

backlog
或 reactor delay

2️⃣ run-time 很小

👉 CPU 没有连续执行机会

3️⃣ preempt 很大

👉 CPU被频繁抢占

🚨 sched timehist 结论

❗不是慢，是“被打断到无法执行”

🚨 六、系统级因果链（最重要）

把所有数据合起来：

🔴 Step 1：外部压力源

可能是：

HAProxy请求洪峰
网络流量突增
upstream慢

👉 证据：

net_rx_action
softirq

🔴 Step 2：tokio runtime poll压力

task backlog 增长
reactor loop变慢

🔴 Step 3：task wakeup爆炸

sched-in 8000+
switch 55万+

🔴 Step 4：futex锁竞争

__x64_sys_futex
schedule

🔴 Step 5：CPU调度崩溃

run queue 100+
load 67+

🚨 七、最终根因（工程结论）

❗不是 PID问题

❗不是单服务问题

❗不是 CPU不够

🚨 真正根因是：

👉 ① tokio runtime task 风暴（主因）

表现：

700+ threads
wakeup storm
backlog爆炸

👉 ② futex锁竞争（次因）

表现：

schedule + futex
context switch 爆炸

👉 ③ 网络 softirq（放大器）

表现：

net_rx_action
ksoftirqd CPU占用

🚨 八、为什么你“找不到 PID”是正常的？

因为：

❗问题不在某个进程，而在 runtime 层

你看到的：

tokio-runtime-w[31232]
tokio-runtime-w[32512]
tokio-runtime-w[16641]

👉 这些只是：

同一个系统问题的“不同线程表现”

🚨 九、工程级定位结论

❗当前系统状态：

Linux scheduler 已进入“runtime级失控”

🚨 十、你下一步必须做的（关键）

✅ 1：抓 wakeup 来源

perf sched record -a
perf sched timehist --state

✅ 2：抓调用链

perf sched timehist --callchain

✅ 3：抓 kernel wakeup

perf record -e sched:sched_wakeup,sched:sched_switch -a
perf script

✅ 4：检查 softirq

cat /proc/net/softnet_stat

🚨 十一、最终一句话总结

❗你现在的问题不是“找不到 PID”，而是：

tokio runtime 进入 task 风暴 + futex竞争 + scheduler崩溃的系统级失控状态

如果你下一步把：

perf sched timehist --callchain

或者：

perf script

贴出来，我可以帮你直接定位到：

👉 是哪个 HTTP 请求 / socket / handler / 代码路径把 tokio 打爆的

这一步就能进入代码级根因定位。