I am facing a problem with kubernetes deployment. This is probably going to be a low-quality question, I'm sorry about that: I'm new to server management and I'm not able to contact the person that set up the servers in the first place, so I'm having a hard time.
My configuration (in the testing environment) consists in one master node and two normal nodes, each one of them hosting one replica of a pod (two pods in total) containing a docker image running a wildfly server.
I was messing around with the testing environment, because we used to experience a problem: sometimes after deployment (randomly a few minutes later or a few ours later) the pods would fail (liveness probe time out) and go in CrashLoopBackOff. I added a line in the code to log an Info message everytime the liveness probe was called, to see if it was called at all, and I re-deployed (deploy configuration unchanged). Since the problem presents itself randomly, I spent the afternoon re-deploying every hour or so (without changing anything), and monitoring the logs. No luck.
So, here's the part where something went wrong:
After deploying for the n-th time, I started seeing events about FailedScheduling. Looking at the pod status, I can see that one of the two pods from the old replicaset is stuck on Terminating, and the pod that's supposed to take its place is stuck on Pending. I can solve the problem by calling kubectl delete pod --force --grace-period=0 [pod name]
, but this happens again every time I deploy, so of course it's not ideal. I haven't tried yet to deploy in production environment.
Here are the logs:
Pod status: https://pastebin.com/MHuWV2dM
Events: https://pastebin.com/8hvpg9n5
Describe pods: https://pastebin.com/QFmkUQB3
Thank you in advance for any help you can provide.
Best Answer
This is caused by limited cpu resources and your liveness probe configuration combination.
The pod fails to deploy because there is not enough cpus and it is stuck in pending state:
Liveness probe:
which is set up to catch http-get with only 1s timeout, 45s after deployment. Your pods fail that catch almost right away and are eventually terminated.
Pod 1:
Pod 2:
1. Make sure the URL for the liveness probe health check is correct. Use curl from nodes to see if it's accessible. Some applications health check URL suffix sometimes is "healthz" instead of "health". If health check still timeout on correct URL increase 1s timeout.
2. Make sure you don't run out of resources available. Run
kubectl top nodes
to see resources usage.3. Check kubelet logs. How to view kubelet logs or Amazon EKS case.
4. Look into Kubernetes termination lifecycle, which could affect how it long it takes for pods to be terminated. Read this guide.
UPDATE:
According to kubernetes documentation
One or more nodes might have been affected by this and needs to be restarted. Running
kubectl delete pod --force --grace-period=0 [pod name]
might have caused that.