Kubernetes connection refused during deployment

google-cloud-platformgoogle-kubernetes-enginekubernetes

I'm trying to achieve a zero downtime deployment using kubernetes and during my test the service doesn't load balance well.

My kubernetes manifest is:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: myapp-deployment
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  template:
    metadata:
      labels:
        app: myapp
        version: "0.2"
    spec:
      containers:
      - name: myapp-container
        image: gcr.io/google-samples/hello-app:1.0
        imagePullPolicy: Always
        ports:
          - containerPort: 8080
            protocol: TCP
        readinessProbe:
          httpGet:
            path: /
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
          successThreshold: 1

---

apiVersion: v1
kind: Service
metadata:
  name: myapp-lb
  labels:
    app: myapp
spec:
  type: LoadBalancer
  externalTrafficPolicy: Local
  ports:
    - port: 80
      targetPort: 8080
  selector:
    app: myapp

If I loop over the service with the external IP, let's say:

$ kubectl get services
NAME         TYPE           CLUSTER-IP     EXTERNAL-IP      PORT(S)        AGE
kubernetes   ClusterIP      10.35.240.1    <none>           443/TCP        1h
myapp-lb     LoadBalancer   10.35.252.91   35.205.100.174   80:30549/TCP   22m

using the bash script:

while True
    do
        curl 35.205.100.174 
        sleep 0.2s
    done

I receive some connection refused during the deployment:

curl: (7) Failed to connect to 35.205.100.174 port 80: Connection refused

The application is the default helloapp provided by Google Cloud Platform and running on 8080.

Cluster information:

  • Kubernetes version: 1.8.8
  • Google cloud platform
  • Machine type: g1-small

Best Answer

I got the same problem and tried to dig a bit deeper in the GKE network setup for this kind of LoadBalancing.

My suspicion is that the iptables rules on the node that runs the container are updated to early. I increased the timeouts a bit in your example to better find the stage in where the requests are getting timeouts.

My changes on your deployment:

spec:
...
  replicas: 1         # easier to track the state of the system
  minReadySeconds: 30 # give the load-balancer time to pick up the new node
...
  template:
    spec:
      containers:
        command: ["sh", "-c", "./hello-app"] # ignore SIGTERM and keep serving requests for 30s

Everything works well until the old pod switches from state Running to Terminating. I tested with a kubectl port-forward on the terminating pod and my requests were served without timeouts.

The following things happens during the change from Running to Terminating:

  • Pod-IP is removed from the service
  • Health check on the node returns 503 with "localEndpoints": 0
  • iptables rules are changed an that node and traffic for this service is dropped (--comment "default/myapp-lb: has no local endpoints" -j KUBE-MARK-DROP

The default settings of the load-balancer checks every 2 seconds and needs 5 failures to remove the node. This means for at least 10 seconds the packets are dropped. After I changed the interval to 1 and only switch after 1 failure the amount of dropped packages decreased.

If you are not interested in the source IP of the client, you could remove the line:

externalTrafficPolicy: Local

in your service definition and the deployments are without connection timeouts.

Tested on GKE Cluster with 4 nodes and version v1.9.7-gke.1.

Related Topic