this is an odd one.
I've setup my K8s cluster, 1 master and 1 worker. It uses calico as cni, and everything looks to be working as expected (I'm able to deploy pods, services, etc). I'm able to reach my pods/services via IP, however I was trying to reach them using their dns name, i.e. myservice.default.svc and it is not reachable. So I started digging and troubleshooting DNS resolution, until I finally have come to the conclusion that my kube-dns pods are not reachable.
Here's a bit of information:
DNS pods running:
kubectl --kubeconfig mycluster get pods --namespace=kube-system -l k8s-app=kube-dns
NAME READY STATUS RESTARTS AGE
coredns-f9fd979d6-jsqp9 1/1 Running 0 20h
coredns-f9fd979d6-tppbt 1/1 Running 0 20h
DNS Service running:
kubectl --kubeconfig cluster get svc --namespace=kube-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 21h
DNS Endpoints exposed:
kubectl --kubeconfig cluster get endpoints kube-dns --namespace=kube-system
NAME ENDPOINTS AGE
kube-dns 10.45.83.1:53,10.45.83.2:53,10.45.83.1:9153 + 3 more... 21h
From a busybox pod, I'm able to access other services – for example a database:
/ # ping 10.36.12.13
PING 10.36.12.13 (10.36.12.13): 56 data bytes
64 bytes from 10.36.12.13: seq=0 ttl=63 time=0.213 ms
64 bytes from 10.36.12.13: seq=1 ttl=63 time=0.091 ms
# telnet 10.36.12.13 3306
Connected to 10.36.12.13
/etc/resolv.conf looks to be setup as expected:
cat /etc/resolv.conf
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
However, if I try to do an DNS lookup, it hangs with unreachable errors:
nslookup backend.default.svc.cluster.local
;; connection timed out; no servers could be reached
If I try to do telnet or ping to the coreDNS pods, it fails:
telnet 10.45.83.1 53
^C
ping 10.45.83.1
PING 10.45.83.1 (10.45.83.1): 56 data bytes
^C
--- 10.45.83.1 ping statistics ---
2 packets transmitted, 0 packets received, 100% packet loss
Logs on both the DNS pods are looking good:
kubectl --kubeconfig cluster logs --namespace=kube-system -l k8s-app=kube-dns
.:53
[INFO] plugin/reload: Running configuration MD5 = db32ca3650231d74073ff4cf814959a7
CoreDNS-1.7.0
linux/amd64, go1.14.4, f59c03d
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 5s
[INFO] plugin/reload: Running configuration MD5 = 3d3f6363f05ccd60e0f885f0eca6c5ff
[INFO] Reloading complete
.:53
[INFO] plugin/reload: Running configuration MD5 = db32ca3650231d74073ff4cf814959a7
CoreDNS-1.7.0
linux/amd64, go1.14.4, f59c03d
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 5s
[INFO] plugin/reload: Running configuration MD5 = 3d3f6363f05ccd60e0f885f0eca6c5ff
[INFO] Reloading complete
[INFO] 127.0.0.1:40656 - 48819 "HINFO IN 1796540929503221175.488499616278261636. udp 56 false 512" NXDOMAIN qr,rd,ra 131 0.015421704s
Any ideas on what to check would be appreciated. I'd be happy to add any further info.
Best Answer
So, the root cause was that my pods weren't able to reach out other pods in another host. Both DSN pods ended up being created on host1, and everything worked on host1, but on host2 (since it wasn't able to see anything on host1) everything got messed up as it wasn't able to resolve any dns queries.
This got resolved by changing the CNI to Weave instead of Calico. I was troubleshooting calico well over a week and then I gave up; seems the pods weren't able to get from one node to the other. Checked the BGP stuff, the networking ports opened and working, etc. and
calicoctl node status
kept throwing that the peer connection wasn't established. At this point, I don't know what was causing it, the one thing I noticed is that this weird virtual interface with a very odd CIDR got created every time I installed calico, and that cidr didn't match any of my networking needs. I decided it wasn't worth the effort as there are no hard requirements for calico.Thanks everyone that checked!