Kubernetes – Troubleshooting Crash Loop Back-Off After Server Restart

kubeadmkubernetes

I have installed kubernetes on a single-node server using kubeadm. I have found that by performing the installation in a specific order everything gets up and running fine. This includes the main kube system components (api server, proxy, scheduler, etcd, etc.) which were spun up and running first (except coredns) after which the cilium CNI plugin was installed using helm for networking (resulting in coredns working).

After this installation I could then install and use other services like NVIDIA's GPU operator and kubeflow successfully. However, after performing a reboot I find that the different components enter into a crash loop back-off state.

I think this has something to do with the order that the different resources are being initialized and the length of various different timeouts. Is there a way to either set an order for the different components to start on boot? Or possibly configure an appropriate default timeout or crashing behavior of pods?

Likely there is a domino effect going on here where one component crashing results in another which then leads to ever-increasing back-off times and ultimately an unresponsive cluster.

For additional information, I have persisted the iptables rules so they are applied on boot. Also, typically all the main components enter the running state except the kube scheduler which stays in crash loop back-off. Then the other components start to crash as well in what appears to be a domino effect.

So far, I've mainly tried updating the manifest files under /etc/kubernetes/manifests/ to use various different timeouts and failureThresholds for the kubernetes system pods, however, this didn't seem to resolve the issue. I could replicate the server restart behavior by first stopping the kubelet service and destroying all running pods/containers and then restarting kubelet.

Best Answer

It turns out that this problem is very similar to the one described in this GitHub issue. Essentially without specifying the SystemdCgroup = true option for my containerd CRI runtime (which was the nvidia container runtime rather than the default of runc), the pods were going to keep inexplicably crashing. This was happening due to a violation of the single-writer rule of cgroups as the default driver for managing cgroups is cgroupfs which isn't interoperable with systemd. You can see the documentation for more detail. It also includes a full specification of the CRI plugin configuration here.

Other relevant containerd resources are:

Related Topic