Intermittent DNS failures in Google Container Engine

google-compute-enginegoogle-kubernetes-enginekubernetes

[Question rewritten with details of findings.]

I am running a Google Container Engine cluster with about 100 containers which perform about 100,000 API calls a day. Some of the pods started getting 50% failure in DNS resolution. I dug into this and it only happens for pods on nodes that are running kube-dns. I also noticed that this only happens just before a node in the system gets shut down for being out-of-memory.

The background resque jobs are attaching to Google APIs and then uploading data to S3. When I see failed jobs, they fail with "Temporary failure in name resolution." This happens for "accounts.google.com" and "s3.amazonaws.com".

When I log into the server and try to connect to these (or other hosts) with host, nslookup, or dig it seems to work just fine. When I connect to the rails console and run the same code that's failing in the queues I can't get a failure to happen. Howerver, as I said these background failures seem to be intermittent (about 50% of the time for the workers running on nodes running kube-dns).

So far, my interim fix was to delete the pods that were failing, and let kubernetes reschedule them, and keep doing this until kubernetes scheduled them to a node not running kube-dns.

Incidentally, removing the failing node did not resolve this. It just caused kubernetes to move everything to other nodes and moved the problem.

Best Answer

I solved this by upgrading to Kubernetes 1.4.

The 1.4 release included several fixes to keep kubernetes from crashing under out-of-memory conditions. I think this helped reduce the likelihood of hitting this issue, although I'm not convinced that the core issue was fixed (unless the issue was that one of the kube-dns instances was crashed or non-responsive due to kubernetes system being unstable when a node hit OOM).

Related Topic