141. PLEG is not healthy 问题
- A node in your environment is responding erratically, switching between "active" and "unavailable" states, reporting a
NotReadystatus with a "PLEG is not healthy" message.
你环境中的某个节点响应不稳定,在“活跃”和“不可用”状态之间切换,报告“NotReady”状态并提示“PLEG 不健康”。 - In the kubelet logs, you can find many messages like this:
在 kubelet 日志中,你可以找到许多类似这样的消息:E0830 10:36:49.162629 3137 kubelet.go:2040] "Skipping pod synchronization" err="PLEG is not healthy: pleg was last seen active 3m3.978055897s ago; threshold is 3m0s"
E0830 10:36:49.162629 3137 kubelet.go:2040] “跳过舱体同步” err=“PLEG 不健康:Pleg 最后一次被发现活跃于 3 分 3.97805897 秒前;阈值是 3m0 秒”
A potential quick fix to make your node available again could be restarting the affected services: restart kubelet, the container runtime, or even the whole node.
一个可能的快速解决方案是重启受影响的服务:重启 Kubelet、容器运行时,甚至整个节点。
However, this will only be a temporary solution, as the root of the issue will still be present and could lead you to hit the error again in the future. As the "PLEG is not healthy" issue can have multiple origins, it will need a root cause analysis to understand what exactly triggered the issue on the first place.
不过这只是暂时的解决方案,问题根源依然存在,可能导致你将来再次遇到错误。由于“PLEG 不健康”的问题可能有多个起因,因此需要进行根本原因分析,以确定最初是什么引发了这个问题。
As a lack of resources is the most common trigger, there are a few recommended actions that can help to avoid this problem:
由于资源匮乏是最常见的诱因,以下是一些建议的措施可以帮助避免此问题:
- Set up a memory reservation for the kubelet and the operating system at a cluster level. You can find the steps on how to do so in this KB article.
在集群层面为 kubelet 和操作系统设置内存预留。你可以在这篇知识基础文章中找到相关步骤。 - Make sure there are enough workers to host all application workloads.
确保有足够的工人来承载所有应用工作负载。 - Optional/recommended: establish request/limits.
可选/推荐:设定请求/限制 。
A "PLEG is not healthy" error in Kubernetes indicates the kubelet on a node cannot communicate with the Container Runtime (like contained or Docker), which disrupts the whole pod lifecycle management.
Kubernetes 中的“PLEG is not healthy”错误表示节点上的 kubelet 无法与容器运行时(如 contained 或 Docker)通信,这会干扰整个 Pod 生命周期管理。
There may be multiple reasons behind an unhealthy PLEG error, but the most common are:
不良 PLEG 错误可能有多种原因,但最常见的有:
- High System Load: Excessive CPU, memory, or disk I/O on the node makes the container runtime unresponsive.
高系统负载:节点上的过多 CPU、内存或磁盘 I/O 会导致容器运行时无响应。 - Security Software Interference: Host-based firewalls or IDS/IPS might block communication between the kubelet and the container runtime socket.
安全软件干扰:基于主机的防火墙或 IDS/IPS 可能阻止 kubelet 与容器运行时套接字之间的通信。 - Bugs in certain older versions of the container runtime/kubelet versions (less common).
某些较旧版本的容器运行时/kubelet 版本存在 bug(较少见)。
- Kubernetes documentation on resource reservations for system daemonsKubernetes 关于系统守护进程资源预留的文档
- SUSE KB article on system-reserved and kube-reserved resource reservationsSUSE KB 关于系统预留和库比预留资源预留的文章
A Kubernetes cluster, running any distribution (e.g., RKE2, k3s).
一个 Kubernetes 集群,运行任意发行版(例如 RKE2、k3s)。
访问Rancher-K8S解决方案博主,企业合作伙伴 :
https://blog.csdn.net/lidw2009
