Kubernetes services timing out on accessing pods on different workers

amazon-web-serviceskubeadmkubernetes

I'm trying to stand up a pair of kubernetes workers on EC2 instances, and running into a problem where the service does not appear to "see" all of the pods that it should be able to see.

My exact environment is a pair of AWS Snowballs, Red and Blue, and my cluster looks like control, worker-red, and worker-blue [1]. I'm deploying a dummy python server that waits for a GET on port 8080, and replies with the local hostname. I've set it up with enough replicas that both worker-red and worker-blue have at least one pod each. Finally, I've created a service, the spec of which looks like

spec:
    type: NodePort
    selector:
        app: hello-server
    ports:
        - port: 8080
          targetPort: 8080
          nodePort: 30080

I can now check that my pods are up

kubectl get pods -o wide
NAME                                      READY   STATUS    RESTARTS   AGE   IP              NODE          NOMINATED NODE   READINESS GATES
hello-world-deployment-587468bdb7-hf4dq   1/1     Running   0          27m   192.168.1.116   worker.red    <none>           <none>
hello-world-deployment-587468bdb7-mclhm   1/1     Running   0          27m   192.168.1.126   worker.blue   <none>           <none>

Now I can try to curl them

curl worker-red:30080
greetings from hello-world-deployment-587468bdb7-hf4dq
curl worker-blue:30080
greetings from hello-world-deployment-587468bdb7-mclhm

That's what happens about half the time. The other half of the time, the curl fails with a timeout error. Specifically – curling worker-red will ONLY yield a response from hf4dq, and curling worker-blue will ONLY yield a response from mclhm. If I cordon and drain worker-blue so both of my pods are running on worker-red, there is never a timeout, and both pods will respond.

It seems like the NodePort service is not reaching pods that are not on the host I am curling. As I understand them, this isn't how services are supposed to work. What am I missing?

[1] If I set up such that I have two workers both on Red, the same problem I'm describing happens, but this is my primary use case so it's the one I'll concentrate on.

Best Answer

It is hard to simply say what might be wrong here but there are some steps you can take in order to troubleshoot your issue:

  1. Debug Pods, especially check if there is something suspicious in the logs:
  • kubectl logs ${POD_NAME} ${CONTAINER_NAME}

  • kubectl logs --previous ${POD_NAME} ${CONTAINER_NAME}

  1. Debug Services, for example by checking:
  • Does the Service exist?

  • Does the Service work by DNS name?

  • Does the Service work by IP?

  • Is the Service defined correctly?

  • Does the Service have any Endpoints?

  • Is the kube-proxy working?

Going through those steps will help you find the cause of your issue and also better understand the mechanics behind the services.

Related Topic