139. 由于卸载Rancher主目录,恢复失败
访问Rancher-K8S解决方案博主,企业合作伙伴 :
When attempting to restore an RKE2 cluster, it fails due to Rancher directories being unmounted by the rke2-killall.sh script.
当尝试恢复 RKE2 集群时,由于 rke2-killall.sh 脚本卸载了 Rancher 目录,导致恢复失败。
After initiating the restore process, the job expects the "/var/lib/rancher" to be mounted but the rke2-killall explicitly unmount it due to the command being hardcoded within the script itself per here in Kubernetesv1.27.12
启动恢复过程后,作业预期“/var/lib/rancher”会被挂载,但 rke2-killall 因为命令硬编码在脚本中,明确卸载了它,正如 Kubernetesv1.27.12中所述
It will then try to run "[Applyinator] Command touch [/var/lib/rancher/rke2/server/db/etcd/tombstone", which fails.
然后它会尝试运行“[Applyinator] Command touch [/var/lib/rancher/rke2/server/db/etcd/tombstone”,但失败了。
This leaves the cluster in a broken state and even performing a cluster reset will not help in this case.
这会导致集群处于破损状态,即使进行集群重置也无济于事。
Some error messages (symptoms) can be seen in the logs as the following
日志中可以看到一些错误信息(症状),如下level=info msg="[Applyinator] Command sh [-c rke2 etcd-snapshot list --etcd-s3 2>/dev/null] finished with err: <nil> and exit code: 127level=info 消息 g=“[Applyinator] 命令 sh [-c rke2 etcd-snapshot list --etcd-s3 2>/dev/null] 以 err: <nil> 和退出代码 结束
It's strongly recommended to upgrade to at least Kubernetesv1.27.16as the issue has been addressed starting from that version.
强烈建议至少升级到Kubernetes v1.27.16,因为这个问题从该版本开始就已经得到解决。
Or, you can apply the following workaround in sequence if you're still on v1.27.12 version:
或者,如果你还在 v1.27.12 版本,也可以依次应用以下变通方法:
When the restore first fails
当恢复第一次失败时
1. Go onto each Control Plane node, and comment out the single line in the script rke2-killall
1. 进入每个控制平面节点,在脚本 rke2-killall 中注释出这行
The script is supposed to be under /usr/local/bin
脚本应该在 /usr/local/bin 下
#do_unmount_and_remove '/var/lib/rancher/rke2'
2. execute "mount -a" on each Control Plane node (as this was removed by the script)
2. 在每个控制平面节点上执行“挂载-a”(因为脚本已移除该操作)
3. execute "systemctl restart rancher-system-agent" on each node.
3. 对每个节点执行“systemctl restart Rancher-System-Agent”。
This causes it to fetch the machine-plan, and use the already present script, to successfully run or proceed with the restore.
这会导致它获取机器计划,并使用已有的脚本,成功运行或继续恢复。
The rke2-killall.sh script unmounts the Rancher directories.
rke2-killall.sh 脚本会卸载牧场主目录。
https://github.com/harvester/harvester/issues/4695
https://github.com/rancher/rancher/issues/40624
Rancher v2.8.5 and less Rancher v2.8.5 及以下版本
RKE2 v1.27 and less RKE2 v1.27 及以下版本
