Docker – Kubernetes update results in pod stuck on terminating

amazon-web-servicesdeploymentdockerkubernetes

I am facing a problem with kubernetes deployment. This is probably going to be a low-quality question, I'm sorry about that: I'm new to server management and I'm not able to contact the person that set up the servers in the first place, so I'm having a hard time.
My configuration (in the testing environment) consists in one master node and two normal nodes, each one of them hosting one replica of a pod (two pods in total) containing a docker image running a wildfly server.
I was messing around with the testing environment, because we used to experience a problem: sometimes after deployment (randomly a few minutes later or a few ours later) the pods would fail (liveness probe time out) and go in CrashLoopBackOff. I added a line in the code to log an Info message everytime the liveness probe was called, to see if it was called at all, and I re-deployed (deploy configuration unchanged). Since the problem presents itself randomly, I spent the afternoon re-deploying every hour or so (without changing anything), and monitoring the logs. No luck.

So, here's the part where something went wrong:
After deploying for the n-th time, I started seeing events about FailedScheduling. Looking at the pod status, I can see that one of the two pods from the old replicaset is stuck on Terminating, and the pod that's supposed to take its place is stuck on Pending. I can solve the problem by calling kubectl delete pod --force --grace-period=0 [pod name], but this happens again every time I deploy, so of course it's not ideal. I haven't tried yet to deploy in production environment.
Here are the logs:
Pod status: https://pastebin.com/MHuWV2dM
Events: https://pastebin.com/8hvpg9n5
Describe pods: https://pastebin.com/QFmkUQB3
Thank you in advance for any help you can provide.

Best Answer

This is caused by limited cpu resources and your liveness probe configuration combination.

The pod fails to deploy because there is not enough cpus and it is stuck in pending state:

Events:
  Type     Reason                 Age                    From                                                  Message
  ----     ------                 ----                   ----                                                  -------
  Warning  FailedScheduling       10m                    default-scheduler                                     0/3 nodes are available: 1 PodToleratesNodeTaints, 2 Insufficient cpu.

Liveness probe:

Liveness:  http-get http://:8080/igoodi-rest/api/health delay=45s timeout=1s period=10s #success=1 #failure=6

which is set up to catch http-get with only 1s timeout, 45s after deployment. Your pods fail that catch almost right away and are eventually terminated.

Pod 1:

Events:
  Type     Reason                 Age                    From                                                 Message
  ----     ------                 ----                   ----                                                 -------
...
  Normal   Created                10m                    kubelet, ip-172-20-46-26.eu-west-1.compute.internal  Created container
  Normal   Started                10m                    kubelet, ip-172-20-46-26.eu-west-1.compute.internal  Started container
  Warning  Unhealthy              9m11s (x2 over 9m21s)  kubelet, ip-172-20-46-26.eu-west-1.compute.internal  Liveness probe failed: Get http://100.96.2.247:8080/igoodi-rest/api/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Pod 2:

Events:
  Type     Reason                 Age                    From                                                  Message
  ----     ------                 ----                   ----                                                  -------
...
  Normal   Created                10m                    kubelet, ip-172-20-32-139.eu-west-1.compute.internal  Created container
  Normal   Started                10m                    kubelet, ip-172-20-32-139.eu-west-1.compute.internal  Started container
  Warning  Unhealthy              9m14s (x3 over 9m34s)  kubelet, ip-172-20-32-139.eu-west-1.compute.internal  Liveness probe failed: Get http://100.96.1.253:8080/igoodi-rest/api/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

1. Make sure the URL for the liveness probe health check is correct. Use curl from nodes to see if it's accessible. Some applications health check URL suffix sometimes is "healthz" instead of "health". If health check still timeout on correct URL increase 1s timeout.

2. Make sure you don't run out of resources available. Run kubectl top nodes to see resources usage.

3. Check kubelet logs. How to view kubelet logs or Amazon EKS case.

4. Look into Kubernetes termination lifecycle, which could affect how it long it takes for pods to be terminated. Read this guide.

UPDATE:

According to kubernetes documentation

Manual force deletion should be undertaken with caution, as it has the potential to violate the at most one semantics inherent to StatefulSet. StatefulSets may be used to run distributed and clustered applications which have a need for a stable network identity and stable storage. These applications often have configuration which relies on an ensemble of a fixed number of members with fixed identities. Having multiple members with the same identity can be disastrous and may lead to data loss (e.g. split brain scenario in quorum-based systems).

One or more nodes might have been affected by this and needs to be restarted. Running kubectl delete pod --force --grace-period=0 [pod name] might have caused that.

Best Answer

Related Solutions

Kubernetes stuck on ContainerCreating

Pod keeps restarting and is in a CrashLoopBackOff state

Related Topic