Kubernetes – Troubleshooting Network Issues in Cluster

kubernetes

I have a Kubernetes cluster of one master and three nodes inside a VPN, it shows ready status. It was built using kubeadm and flannel. The VPN network has the range 192.168.1.0/16.

$ kubectl get nodes -o wide

NAME        STATUS   ROLES    AGE    VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
k8-master   Ready    master   144d   v1.17.0   192.168.1.132   <none>        Ubuntu 18.04.3 LTS   4.15.0-72-generic   docker://18.9.7
k8-n1       Ready    <none>   144d   v1.17.0   192.168.1.133   <none>        Ubuntu 18.04.3 LTS   4.15.0-72-generic   docker://18.9.7
k8-n2       Ready    <none>   144d   v1.17.0   192.168.1.134   <none>        Ubuntu 18.04.3 LTS   4.15.0-72-generic   docker://18.9.7
k8-n3       Ready    <none>   144d   v1.17.0   192.168.1.135   <none>        Ubuntu 18.04.3 LTS   4.15.0-72-generic   docker://18.9.7

I can reach the nodes.

$ ping 192.168.1.133

PING 192.168.1.133 (192.168.1.133) 56(84) bytes of data.
64 bytes from 192.168.1.133: icmp_seq=1 ttl=64 time=0.219 ms
64 bytes from 192.168.1.133: icmp_seq=2 ttl=64 time=0.246 ms
64 bytes from 192.168.1.133: icmp_seq=3 ttl=64 time=0.199 ms
64 bytes from 192.168.1.133: icmp_seq=4 ttl=64 time=0.209 ms
^X^C
--- 192.168.1.133 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3071ms
rtt min/avg/max/mdev = 0.199/0.218/0.246/0.020 ms

$ ping 192.168.1.134

PING 192.168.1.134 (192.168.1.134) 56(84) bytes of data.
64 bytes from 192.168.1.134: icmp_seq=1 ttl=64 time=0.288 ms
64 bytes from 192.168.1.134: icmp_seq=2 ttl=64 time=0.272 ms
64 bytes from 192.168.1.134: icmp_seq=3 ttl=64 time=0.268 ms
^C
--- 192.168.1.134 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2032ms
rtt min/avg/max/mdev = 0.268/0.276/0.288/0.008 ms

$ ping 192.168.1.135

PING 192.168.1.135 (192.168.1.135) 56(84) bytes of data.
64 bytes from 192.168.1.135: icmp_seq=1 ttl=64 time=0.278 ms
64 bytes from 192.168.1.135: icmp_seq=2 ttl=64 time=0.221 ms
64 bytes from 192.168.1.135: icmp_seq=3 ttl=64 time=0.181 ms
^C
--- 192.168.1.135 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2030ms

But I set a nginx 2 pods deployment to test if it worked

nginx-deployment-574b87c764-2gz8t   1/1     Running   0          25m     192.168.2.12   k8-n2   <none>           <none>
nginx-deployment-574b87c764-rst8x   1/1     Running   0          25m     192.168.1.17   k8-n1   <none>           <none>

$ kubectl get svc

NAME               TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
kubernetes         ClusterIP   10.96.0.1       <none>        443/TCP        3d17h
nginx-deployment   NodePort    10.96.211.211   <none>        80:31577/TCP   13s

And I can't connect to it.

$ curl k8-n1:31577
curl: (7) Failed to connect to k8-n1 port 31577: Connection refused
$ curl k8-n2:31577
curl: (7) Failed to connect to k8-n2 port 31577: Connection refused
$ curl k8-n3:31577
curl: (7) Failed to connect to k8-n3 port 31577: Connection refused
$ curl 10.96.211.211:80
curl: (7) Failed to connect to 10.96.211.211 port 80: Connection refused
$ curl 192.168.1.17:80
curl: (7) Failed to connect to 192.168.1.17 port 80: No route to host
$ curl 192.168.1.17:31577
curl: (7) Failed to connect to 192.168.1.17 port 31577: No route to host
$ curl 192.168.1.133:31577
curl: (7) Failed to connect to 192.168.1.133 port 31577: Connection refused
$ curl 192.168.1.133:6443
curl: (7) Failed to connect to 192.168.1.133 port 6443: Connection refused

I changed:

sudo kubeadm init --pod-network-cidr=192.168.1.0/16 --apiserver-advertise-address=192.168.1.132

And I changed flannel.yaml network to 192.168.1.0/16 with

kubectl edit cm -n kube-system kube-flannel-cfg

The core-dns pod description after restart:

  Normal   Scheduled  109s                default-scheduler  Successfully assigned kube-system/coredns-6955765f44-vwqgm to k8-n1
  Normal   Pulled     106s                kubelet, k8-n1     Container image "k8s.gcr.io/coredns:1.6.5" already present on machine
  Normal   Created    105s                kubelet, k8-n1     Created container coredns
  Normal   Started    105s                kubelet, k8-n1     Started container coredns
  Warning  Unhealthy  3s (x11 over 103s)  kubelet, k8-n1     Readiness probe failed: Get http://192.168.1.19:8181/ready: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  1s (x5 over 41s)    kubelet, k8-n1     Liveness probe failed: Get http://192.168.1.19:8080/health: dial tcp 192.168.1.19:8080: connect: no route to host
  Normal   Killing    1s                  kubelet, k8-n1     Container coredns failed liveness probe, will be restarted

I would appreciate any help or ask for more information.

Best Answer

While check the question I noticed that OP initialized cluster with CIDR 192.168.1.0/16 which overlapped with this node IP addresses and then cause problem with coreDNS pods.

Initializing cluster with the new different CIDR solved the issue.

Related Solutions

Kubernetes stuck on ContainerCreating

kubectl describe pods will list some (probably most but not all) of the events associated with the pod, including pulling of images, starting of containers.

Firewall – kubectl: The connection to the server XXX.XXX.XXXXXX was refused

Resolving my own answer. It seems that the real problem was access and connecting to accounts.google.com via DNS. After I check that I have ping:

$ ping accounts.google.com
PING accounts.google.com (216.58.201.141) 56(84) bytes of data.
64 bytes from mad06s25-in-f13.1e100.net (216.58.201.141): icmp_seq=1 ttl=56 time=21.9 ms
64 bytes from mad06s25-in-f13.1e100.net (216.58.201.141): icmp_seq=2 ttl=56 time=19.0 ms
64 bytes from mad06s25-in-f13.1e100.net (216.58.201.141): icmp_seq=3 ttl=56 time=20.4 ms
^C
--- accounts.google.com ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 19.070/20.468/21.914/1.173 ms

And stracing all the opened files during the command:

$ strace -eopenat kubectl version
openat(AT_FDCWD, "/proc/sys/net/core/somaxconn", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/proc/stat", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/proc/sys/kernel/hostname", O_RDONLY|O_CLOEXEC) = 3
Client Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.2", GitCommit:"08e099554f3c31f6e6f07b448ab3ed78d0520507", GitTreeState:"clean", BuildDate:"2017-01-12T04:57:25Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}
openat(AT_FDCWD, "/home/shakaran/.kube/config", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/home/shakaran/.config/gcloud/application_default_credentials.json", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/etc/nsswitch.conf", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/etc/resolv.conf", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/proc/sys/kernel/hostname", O_RDONLY|O_CLOEXEC) = 4
openat(AT_FDCWD, "/etc/hosts", O_RDONLY|O_CLOEXEC) = 3
The connection to the server 104.155.120.114 was refused - did you specify the right host or port?
+++ exited with 1 +++

I try to figure out the opened connections:

$ systemd-resolve --status | cat
Global
         DNS Servers: 127.0.1.1
                      8.8.8.8
                      8.8.4.4
          DNSSEC NTA: 10.in-addr.arpa
                      16.172.in-addr.arpa
                      168.192.in-addr.arpa
                      17.172.in-addr.arpa
                      18.172.in-addr.arpa
                      19.172.in-addr.arpa
                      20.172.in-addr.arpa
                      21.172.in-addr.arpa
                      22.172.in-addr.arpa
                      23.172.in-addr.arpa
                      24.172.in-addr.arpa
                      25.172.in-addr.arpa
                      26.172.in-addr.arpa
                      27.172.in-addr.arpa
                      28.172.in-addr.arpa
                      29.172.in-addr.arpa
                      30.172.in-addr.arpa
                      31.172.in-addr.arpa
                      corp
                      d.f.ip6.arpa
                      home
                      internal
                      intranet
                      lan
                      local
                      private
                      test

Link 10 (vboxnet3)
      Current Scopes: LLMNR/IPv4 LLMNR/IPv6
       LLMNR setting: yes
MulticastDNS setting: no
      DNSSEC setting: allow-downgrade
    DNSSEC supported: yes

Link 9 (vboxnet2)
      Current Scopes: none
       LLMNR setting: yes
MulticastDNS setting: no
      DNSSEC setting: allow-downgrade
    DNSSEC supported: yes

Link 8 (vboxnet1)
      Current Scopes: none
       LLMNR setting: yes
MulticastDNS setting: no
      DNSSEC setting: allow-downgrade
    DNSSEC supported: yes

Link 7 (vboxnet0)
      Current Scopes: none
       LLMNR setting: yes
MulticastDNS setting: no
      DNSSEC setting: allow-downgrade
    DNSSEC supported: yes

Link 6 (docker0)
      Current Scopes: none
       LLMNR setting: yes
MulticastDNS setting: no
      DNSSEC setting: allow-downgrade
    DNSSEC supported: yes

Link 5 (tun0)
      Current Scopes: none
       LLMNR setting: yes
MulticastDNS setting: no
      DNSSEC setting: allow-downgrade
    DNSSEC supported: yes

Link 3 (wlan0)
      Current Scopes: DNS LLMNR/IPv4 LLMNR/IPv6
       LLMNR setting: yes
MulticastDNS setting: no
      DNSSEC setting: allow-downgrade
    DNSSEC supported: no
         DNS Servers: 8.8.8.8
                      8.8.4.4

Link 2 (eth0)
      Current Scopes: none
       LLMNR setting: yes
MulticastDNS setting: no
      DNSSEC setting: allow-downgrade
    DNSSEC supported: yes

I just discover that I have it the openvpn with tun0 enabled (blocking the connection to accounts.google.com), after I run the disable of the interface:

sudo ifconfig tun0 down

I get perfectly:

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.2", GitCommit:"08e099554f3c31f6e6f07b448ab3ed78d0520507", GitTreeState:"clean", BuildDate:"2017-01-12T04:57:25Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.2", GitCommit:"08e099554f3c31f6e6f07b448ab3ed78d0520507", GitTreeState:"clean", BuildDate:"2017-01-12T04:52:34Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}
So sorry for all the noise. But probably it is a good idea add this in FAQ's or so for warning the users about VPNs

So the issue was mostly a refused connection. It could be useful the issue #41975 in kubernetes project for debug with the -v=4 like:

$ kubectl version -v=4
Client Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.2", GitCommit:"08e099554f3c31f6e6f07b448ab3ed78d0520507", GitTreeState:"clean", BuildDate:"2017-01-12T04:57:25Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}
I0224 11:32:36.914299   30751 helpers.go:221] Connection error: Get https://XXX.XXX.XXX.XXX/api: Post https://accounts.google.com/o/oauth2/token: dial tcp: lookup accounts.google.com on 127.0.1.1:53: read udp 127.0.0.1:46403->127.0.1.1:53: read: connection refused
F0224 11:32:36.914378   30751 helpers.go:116] The connection to the server XXX.XXX.XXX.XXX was refused - did you specify the right host or port?

Best Answer

Related Solutions

Kubernetes stuck on ContainerCreating

Firewall – kubectl: The connection to the server XXX.XXX.XXXXXX was refused

Related Topic