Pods stuck with containerCreating status in self-managed Kubernetes cluster in Google Compute Engine (GCE) with an external kube node

google-compute-enginekubernetes

I have a Kubernetes cluster with 5 nodes, 4 Google compute engine VMs (one controller and 3 worker node) and one bare metal local machine at my home (kube worker node).
Cluster is up and running and all nodes are in Ready status.

  1. Self-managed cluster configured based on: https://docs.projectcalico.org/getting-started/kubernetes/self-managed-public-cloud/gce
  2. Firewall rules for Ingress and Engress are added for all IPs (0.0.0.0/0) and any ports.
  3. I advertise kube master node with **–control-plane-endpoint IP:PORT ** tag for public IP of the master node and join the worker nodes based on that.

Problem: I'm having a problem when I deploy an application, all pods in local worker node stuck with ContainerCreating status while containers on GCE VM workers are deploying currectly.
Anyone knows what is the problem with this setup and how can I solve this problem?

  • This is the output of events of one my pods for kubect describe pod output:

Events:
Successfully assigned social-network/home-timeline-redis-6f4c5d55fc-tql2l to volatile

Warning  FailedCreatePodSandBox  3m14s  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "32b64e6efcaff6401b7b0a6936f005a00a53c19a2061b0a14906b8bc3a81bf20" network for pod "home-timeline-redis-6f4c5d55fc-tql2l": networkPlugin cni failed to set up pod "home-timeline-redis-6f4c5d55fc-tql2l_social-network" network: unable to connect to Cilium daemon: failed to create cilium agent client after 30.000000 seconds timeout: Get "http:///var/run/cilium/cilium.sock/v1/config": dial unix /var/run/cilium/cilium.sock: connect: no such file or directory
Is the agent running?
  
Warning  FailedCreatePodSandBox  102s  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "1e95fa10d49abf5edc8693345256b91e88c31d1b6414761de80e6038cd7696a4" network for pod "home-timeline-redis-6f4c5d55fc-tql2l": networkPlugin cni failed to set up pod "home-timeline-redis-6f4c5d55fc-tql2l_social-network" network: unable to connect to Cilium daemon: failed to create cilium agent client after 30.000000 seconds timeout: Get "http:///var/run/cilium/cilium.sock/v1/config": dial unix /var/run/cilium/cilium.sock: connect: no such file or directory
Is the agent running?
  
Normal   SandboxChanged          11s (x3 over 3m14s)  kubelet  Pod sandbox changed, it will be killed and re-created.
 
Warning  FailedCreatePodSandBox  11s                  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "8f5959966e4c25f94bd49b82e1fa6da33a114b1680eae8898ba6685f22e7d37f" network for pod "home-timeline-redis-6f4c5d55fc-tql2l": networkPlugin cni failed to set up pod "home-timeline-redis-6f4c5d55fc-tql2l_social-network" network: unable to connect to Cilium daemon: failed to create cilium agent client after 30.000000 seconds timeout: Get "http:///var/run/cilium/cilium.sock/v1/config": dial unix /var/run/cilium/cilium.sock: connect: no such file or directory
Is the agent running?

UPDATE

I reset kubeadm on all nodes and removed cilium and recreate calico cni. I also changed the container CDIR to sudo kubeadm init --pod-network-cidr=20.96.0.0/12 --control-plane-endpoint "34.89.7.120:6443"and it seems it solved the confict with host CDIR. But still pods in Volatile (local machine) stuck with ContainerCreating:

`

>root@controller:~# kubectl get pods -n kube-system -o wide
NAME                                       READY   STATUS             RESTARTS   AGE   IP               NODE         NOMINATED NODE   READINESS GATES
calico-kube-controllers-744cfdf676-bh2nc   1/1     Running            0          12m   20.109.133.129   worker-2     <none>           <none>
calico-node-frv5r                          1/1     Running            0          12m   10.240.0.11      controller   <none>           <none>
calico-node-lplx6                          1/1     Running            0          12m   10.240.0.20      worker-0     <none>           <none>
calico-node-lwrdr                          1/1     Running            0          12m   10.240.0.21      worker-1     <none>           <none>
calico-node-ppczn                          0/1     CrashLoopBackOff   7          12m   130.239.41.206   volatile     <none>           <none>
calico-node-zplwx                          1/1     Running            0          12m   10.240.0.22      worker-2     <none>           <none>
coredns-74ff55c5b-69mn2                    1/1     Running            0          14m   20.105.55.194    controller   <none>           <none>
coredns-74ff55c5b-djczf                    1/1     Running            0          14m   20.105.55.193    controller   <none>           <none>
etcd-controller                            1/1     Running            0          14m   10.240.0.11      controller   <none>           <none>
kube-apiserver-controller                  1/1     Running            0          14m   10.240.0.11      controller   <none>           <none>
kube-controller-manager-controller         1/1     Running            0          14m   10.240.0.11      controller   <none>           <none>
kube-proxy-5vzdf                           1/1     Running            0          13m   10.240.0.20      worker-0     <none>           <none>
kube-proxy-d22q4                           1/1     Running            0          13m   10.240.0.22      worker-2     <none>           <none>
kube-proxy-hml5c                           1/1     Running            0          14m   10.240.0.11      controller   <none>           <none>
kube-proxy-hw8kl                           1/1     Running            0          13m   10.240.0.21      worker-1     <none>           <none>
kube-proxy-zb6t7                           1/1     Running            0          13m   130.239.41.206   volatile     <none>           <none>
kube-scheduler-controller                  1/1     Running            0          14m   10.240.0.11      controller   <none>           <none>

root@controller:~# kubectl describe pod calico-node-ppczn -n kube-system:

   > root@controller:~# kubectl describe pod calico-node-ppczn -n kube-system
Name:                 calico-node-ppczn
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 volatile/130.239.41.206
Start Time:           Mon, 04 Jan 2021 13:01:36 +0000
Labels:               controller-revision-hash=89c447898
                      k8s-app=calico-node
                      pod-template-generation=1
Annotations:          <none>
Status:               Running
IP:                   130.239.41.206
IPs:
  IP:           130.239.41.206
Controlled By:  DaemonSet/calico-node
Init Containers:
  upgrade-ipam:
    Container ID:  docker://27f988847a484c5f74e000c4b8f473895b71ed49f27e0bf4fab4b425940951dc
    Image:         docker.io/calico/cni:v3.17.1
    Image ID:      docker-pullable://calico/cni@sha256:3dc2506632843491864ce73a6e73d5bba7d0dc25ec0df00c1baa91d17549b068
    Port:          <none>
    Host Port:     <none>
    Command:
      /opt/cni/bin/calico-ipam
      -upgrade
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 04 Jan 2021 13:01:37 +0000
      Finished:     Mon, 04 Jan 2021 13:01:38 +0000
    Ready:          True
    Restart Count:  0
    Environment Variables from:
      kubernetes-services-endpoint  ConfigMap  Optional: true
    Environment:
      KUBERNETES_NODE_NAME:        (v1:spec.nodeName)
      CALICO_NETWORKING_BACKEND:  <set to the key 'calico_backend' of config map 'calico-config'>  Optional: false
    Mounts:
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/lib/cni/networks from host-local-net-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-8r94c (ro)
  install-cni:
    Container ID:  docker://5629f6984cfe545864d187112a0c1f65e7bdb7dbfae9b4971579f420ab55b77b
    Image:         docker.io/calico/cni:v3.17.1
    Image ID:      docker-pullable://calico/cni@sha256:3dc2506632843491864ce73a6e73d5bba7d0dc25ec0df00c1baa91d17549b068
    Port:          <none>
    Host Port:     <none>
    Command:
      /opt/cni/bin/install
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 04 Jan 2021 13:01:39 +0000
      Finished:     Mon, 04 Jan 2021 13:01:41 +0000
    Ready:          True
    Restart Count:  0
    Environment Variables from:
      kubernetes-services-endpoint  ConfigMap  Optional: true
    Environment:
      CNI_CONF_NAME:         10-calico.conflist
      CNI_NETWORK_CONFIG:    <set to the key 'cni_network_config' of config map 'calico-config'>  Optional: false
      KUBERNETES_NODE_NAME:   (v1:spec.nodeName)
      CNI_MTU:               <set to the key 'veth_mtu' of config map 'calico-config'>  Optional: false
      SLEEP:                 false
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-8r94c (ro)
  flexvol-driver:
    Container ID:   docker://3a4bf307a347926893aeb956717d84049af601fd4cc4aa7add6e182c85dc4e7c
    Image:          docker.io/calico/pod2daemon-flexvol:v3.17.1
    Image ID:       docker-pullable://calico/pod2daemon-flexvol@sha256:48f277d41c35dae051d7dd6f0ec8f64ac7ee6650e27102a41b0203a0c2ce6c6b
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 04 Jan 2021 13:01:43 +0000
      Finished:     Mon, 04 Jan 2021 13:01:43 +0000
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /host/driver from flexvol-driver-host (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-8r94c (ro)
Containers:
  calico-node:
    Container ID:   docker://2576b2426c2a3fc4b6a972839a94872160c7ac5efa5b1159817be8d4ad4ddf60
    Image:          docker.io/calico/node:v3.17.1
    Image ID:       docker-pullable://calico/node@sha256:25e0b0495c0df3a7a06b6f9e92203c53e5b56c143ac1c885885ee84bf86285ff
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Mon, 04 Jan 2021 13:18:48 +0000
      Finished:     Mon, 04 Jan 2021 13:19:57 +0000
    Ready:          False
    Restart Count:  9
    Requests:
      cpu:      250m
    Liveness:   exec [/bin/calico-node -felix-live -bird-live] delay=10s timeout=1s period=10s #success=1 #failure=6
    Readiness:  exec [/bin/calico-node -felix-ready -bird-ready] delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment Variables from:
      kubernetes-services-endpoint  ConfigMap  Optional: true
    Environment:
      DATASTORE_TYPE:                     kubernetes
      WAIT_FOR_DATASTORE:                 true
      NODENAME:                            (v1:spec.nodeName)
      CALICO_NETWORKING_BACKEND:          <set to the key 'calico_backend' of config map 'calico-config'>  Optional: false
      CLUSTER_TYPE:                       k8s,bgp
      IP:                                 autodetect
      CALICO_IPV4POOL_IPIP:               Always
      CALICO_IPV4POOL_VXLAN:              Never
      FELIX_IPINIPMTU:                    <set to the key 'veth_mtu' of config map 'calico-config'>  Optional: false
      FELIX_VXLANMTU:                     <set to the key 'veth_mtu' of config map 'calico-config'>  Optional: false
      FELIX_WIREGUARDMTU:                 <set to the key 'veth_mtu' of config map 'calico-config'>  Optional: false
      CALICO_DISABLE_FILE_LOGGING:        true
      FELIX_DEFAULTENDPOINTTOHOSTACTION:  ACCEPT
      FELIX_IPV6SUPPORT:                  false
      FELIX_LOGSEVERITYSCREEN:            info
      FELIX_HEALTHENABLED:                true
    Mounts:
      /lib/modules from lib-modules (ro)
      /run/xtables.lock from xtables-lock (rw)
      /sys/fs/ from sysfs (rw)
      /var/lib/calico from var-lib-calico (rw)
      /var/log/calico/cni from cni-log-dir (ro)
      /var/run/calico from var-run-calico (rw)
      /var/run/nodeagent from policysync (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-8r94c (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:  
  var-run-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/calico
    HostPathType:  
  var-lib-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/calico
    HostPathType:  
  xtables-lock:
    Type:          HostPath (bare host directory volume)
    Path:          /run/xtables.lock
    HostPathType:  FileOrCreate
  sysfs:
    Type:          HostPath (bare host directory volume)
    Path:          /sys/fs/
    HostPathType:  DirectoryOrCreate
  cni-bin-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /opt/cni/bin
    HostPathType:  
  cni-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cni/net.d
    HostPathType:  
  cni-log-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log/calico/cni
    HostPathType:  
  host-local-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/cni/networks
    HostPathType:  
  policysync:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/nodeagent
    HostPathType:  DirectoryOrCreate
  flexvol-driver-host:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
    HostPathType:  DirectoryOrCreate
  calico-node-token-8r94c:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  calico-node-token-8r94c
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  kubernetes.io/os=linux
Tolerations:     :NoSchedule op=Exists
                 :NoExecute op=Exists
                 CriticalAddonsOnly op=Exists
                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists
                 node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                 node.kubernetes.io/unreachable:NoExecute op=Exists
                 node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Normal   Scheduled         22m                  default-scheduler  Successfully assigned kube-system/calico-node-ppczn to volatile
  Normal   Pulled            22m                  kubelet            Container image "docker.io/calico/cni:v3.17.1" already present on machine
  Normal   Created           22m                  kubelet            Created container upgrade-ipam
  Normal   Started           22m                  kubelet            Started container upgrade-ipam
  Normal   Pulled            21m                  kubelet            Container image "docker.io/calico/cni:v3.17.1" already present on machine
  Normal   Started           21m                  kubelet            Started container install-cni
  Normal   Created           21m                  kubelet            Created container install-cni
  Normal   Pulled            21m                  kubelet            Container image "docker.io/calico/pod2daemon-flexvol:v3.17.1" already present on machine
  Normal   Created           21m                  kubelet            Created container flexvol-driver
  Normal   Started           21m                  kubelet            Started container flexvol-driver
  Normal   Pulled            21m                  kubelet            Container image "docker.io/calico/node:v3.17.1" already present on machine
  Normal   Created           21m                  kubelet            Created container calico-node
  Normal   Started           21m                  kubelet            Started container calico-node
  Warning  Unhealthy         21m (x2 over 21m)    kubelet            Liveness probe failed: calico/node is not ready: Felix is not live: Get "http://localhost:9099/liveness": dial tcp 127.0.0.1:9099: connect: connection refused
  Warning  Unhealthy         11m (x51 over 21m)   kubelet            Readiness probe failed: calico/node is not ready: BIRD is not ready: Failed to stat() nodename file: stat /var/lib/calico/nodename: no such file or directory
  Warning  DNSConfigForming  115s (x78 over 22m)  kubelet            Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 130.239.40.2 130.239.40.3 2001:6b0:e:4040::2

calico-node-ppczn logs:

> root@controller:~# kubectl logs  calico-node-ppczn -n kube-system
2021-01-04 13:17:38.010 [INFO][8] startup/startup.go 379: Early log level set to info
2021-01-04 13:17:38.010 [INFO][8] startup/startup.go 395: Using NODENAME environment for node name
2021-01-04 13:17:38.010 [INFO][8] startup/startup.go 407: Determined node name: volatile
2021-01-04 13:17:38.011 [INFO][8] startup/startup.go 439: Checking datastore connection
2021-01-04 13:18:08.011 [INFO][8] startup/startup.go 454: Hit error connecting to datastore - retry error=Get "https://10.96.0.1:443/api/v1/nodes/foo": dial tcp 10.96.0.1:443: i/o timeout

ON local machine:

 > root@volatile:~# docker ps
CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS              PORTS               NAMES
39efaf54f558        k8s.gcr.io/pause:3.2   "/pause"                 19 minutes ago      Up 19 minutes                           k8s_POD_calico-node-ppczn_kube-system_7e98eb90-f581-4dbc-b877-da25bc2868f9_0
05bd9fa182e5        e3f6fcd87756           "/usr/local/bin/kube…"   20 minutes ago      Up 20 minutes                           k8s_kube-proxy_kube-proxy-zb6t7_kube-system_90529aeb-d226-4061-a87f-d5b303207a2f_0
ae11c77897b0        k8s.gcr.io/pause:3.2   "/pause"                 20 minutes ago      Up 20 minutes                           k8s_POD_kube-proxy-zb6t7_kube-system_90529aeb-d226-4061-a87f-d5b303207a2f_0
> root@volatile:~# docker logs 39efaf54f558
> root@volatile:~# docker logs 05bd9fa182e5
I0104 13:00:51.131737       1 node.go:172] Successfully retrieved node IP: 130.239.41.206
I0104 13:00:51.132027       1 server_others.go:142] kube-proxy node IP is an IPv4 address (130.239.41.206), assume IPv4 operation
W0104 13:00:51.162536       1 server_others.go:578] Unknown proxy mode "", assuming iptables proxy
I0104 13:00:51.162615       1 server_others.go:185] Using iptables Proxier.
I0104 13:00:51.162797       1 server.go:650] Version: v1.20.1
I0104 13:00:51.163080       1 conntrack.go:52] Setting nf_conntrack_max to 262144
I0104 13:00:51.163289       1 config.go:315] Starting service config controller
I0104 13:00:51.163300       1 config.go:224] Starting endpoint slice config controller
I0104 13:00:51.163304       1 shared_informer.go:240] Waiting for caches to sync for service config
I0104 13:00:51.163311       1 shared_informer.go:240] Waiting for caches to sync for endpoint slice config
I0104 13:00:51.263469       1 shared_informer.go:247] Caches are synced for endpoint slice config 
I0104 13:00:51.263487       1 shared_informer.go:247] Caches are synced for service config 
> root@volatile:~# docker logs ae11c77897b0
root@volatile:~# ls /etc/cni/net.d/
10-calico.conflist  calico-kubeconfig
root@volatile:~# ls /var/lib/calico/
root@volatile:~# 

Best Answer

On host volatile you appear to have cilium configured in /etc/cni/net.d/*.conf. It is a networking plugin, one of many available for Kubernetes. One of these files probably contains something like:

{
    "name": "cilium",
    "type": "cilium-cni"
}

If this is accidental, remove such file. You appear to be already running a competing networking plugin by Project Calico, which is seemingly sufficient. So, re-create the pod calico-kube-controllers in namespace kube-system, let it succeed, then re-create other pods.

If you intend to use Cilium on that host, go back to the Cillium installation guide. If you re-do it, you'll probably see that /var/run/cilium/cilium.sock has been created for you.

Related Topic