![]() | ![]() |
Today there was again a not immediately understandable error in the Kubernetes cluster. I have an OKD (OpenShift) cluster 4.5 with 3 masters and 2 workers running here.
After a number of nodes became unavailable, I rebooted the entire cluster once. Unfortunately, the initiated reboot took a long time for some nodes. Therefore, I restarted individual nodes via hard reset without further ado.
After all nodes were restarted and in „Ready“ status, one of the nodes could not start pods
So it was off to troubleshoot on the node and in the cluster. First, I look at which pods cannot be started:
oc get pod --all-namespaces -o wide
And see that all cannot be started on the same host. So look a little closer at the pod:
oc describe pod -n <namespacevompod> <podname>
And see here the following error:
Failed to create pod sandbox: rpc error: code = Unknown desc = error creating pod sandbox with name "XXXX": layer not known
That doesn’t get me much further now. So I go via SSH to the node. Here I check if all services are running
systemctl
And see that the service „nodeip-configuration.service“ is failed. So take a closer look here:
systemctl status nodeip-configuration.service
And here the following error message appears:
podman[11387]: Error: error creating container storage: layer not known
Now it is at least clearer which „layer“ is meant here. So it should be a storage layer. So I asked Google again and found an entry:
https://bugzilla.redhat.com/show_bug.cgi?id=1857224
Ok. Apparently the node took the hard reset badly and everything is no longer clean under „/var/lib/containers“. So I follow the recommendation:
systemctl stop kubelet
systemctl stop crio
rm -rf /var/lib/containers/
systemctl start crio
systemctl start kubelet
And voila, the pods on the node can start again!

