I'm trying to achieve a zero downtime deployment using kubernetes and during my test the service doesn't load balance well.
My kubernetes manifest is:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: myapp-deployment
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
template:
metadata:
labels:
app: myapp
version: "0.2"
spec:
containers:
- name: myapp-container
image: gcr.io/google-samples/hello-app:1.0
imagePullPolicy: Always
ports:
- containerPort: 8080
protocol: TCP
readinessProbe:
httpGet:
path: /
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 1
---
apiVersion: v1
kind: Service
metadata:
name: myapp-lb
labels:
app: myapp
spec:
type: LoadBalancer
externalTrafficPolicy: Local
ports:
- port: 80
targetPort: 8080
selector:
app: myapp
If I loop over the service with the external IP, let's say:
$ kubectl get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.35.240.1 <none> 443/TCP 1h
myapp-lb LoadBalancer 10.35.252.91 35.205.100.174 80:30549/TCP 22m
using the bash script:
while True
do
curl 35.205.100.174
sleep 0.2s
done
I receive some connection refused
during the deployment:
curl: (7) Failed to connect to 35.205.100.174 port 80: Connection refused
The application is the default helloapp provided by Google Cloud Platform and running on 8080.
Cluster information:
- Kubernetes version: 1.8.8
- Google cloud platform
- Machine type: g1-small
Best Answer
I got the same problem and tried to dig a bit deeper in the GKE network setup for this kind of LoadBalancing.
My suspicion is that the iptables rules on the node that runs the container are updated to early. I increased the timeouts a bit in your example to better find the stage in where the requests are getting timeouts.
My changes on your deployment:
Everything works well until the old pod switches from state
Running
toTerminating
. I tested with akubectl port-forward
on the terminating pod and my requests were served without timeouts.The following things happens during the change from
Running
toTerminating
:"localEndpoints": 0
--comment "default/myapp-lb: has no local endpoints" -j KUBE-MARK-DROP
The default settings of the load-balancer checks every 2 seconds and needs 5 failures to remove the node. This means for at least 10 seconds the packets are dropped. After I changed the interval to 1 and only switch after 1 failure the amount of dropped packages decreased.
If you are not interested in the source IP of the client, you could remove the line:
in your service definition and the deployments are without connection timeouts.
Tested on GKE Cluster with 4 nodes and version
v1.9.7-gke.1
.