Unfortunately, there is no way to make it faster. A lot of actions are supposed to be done by Kubernetes to restart pods from a failed node.
However, it is possible to enhance reaction time.
For example, reduce the value of node-monitor-grace-period, default is 40 seconds.
It can decrease time between actual fail of a node and changing of its status.
You can find more details about these options here
So far, I found 3 problems:
docker version
In my first tries, I used docker.io from the default Ubuntu repositories (17.12.1-ce). In the tutorial https://computingforgeeks.com/how-to-setup-3-node-kubernetes-cluster-on-ubuntu-18-04-with-weave-net-cni/, I discovered they recommend something different:
apt-get --purge remove docker docker-engine docker.io
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
apt-get update
apt-get install docker-ce
This is now version 18.6.1, and also doesn't cause a warning anymore in kubeadm preflight check.
cleanup
I used kubeadm reset
and deleting some directories when resetting my VMs to an unconfigured state. After I read some bug reports, I decided to extend the list of directories to remove. This is what I do now:
kubeadm reset
rm -rf /var/lib/cni/ /var/lib/calico/ /var/lib/kubelet/ /var/lib/etcd/ /etc/kubernetes/ /etc/cni/
reboot
Calico setup
With the above changes, I was immediately able to init a full-working setup (all pods "Running" and curl working). I did "Variant with extra etcd".
All this worked until the first reboot, then I had again the
calico-kube-controllers-f4dcbf48b-qrqnc CreateContainerConfigError
Digging into this problem showed me.
$ kubectl -n kube-system describe pod/calico-kube-controllers-f4dcbf48b-dp6n9
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Failed 4m32s (x10 over 9m) kubelet, node1 Error: Couldn't find key etcd_endpoints in ConfigMap kube-system/calico-config
Then, I realized that I did two installation instructions in chain which were meant to do only one.
kubectl apply -f https://docs.projectcalico.org/v3.2/getting-started/kubernetes/installation/hosted/rbac-kdd.yaml
curl https://docs.projectcalico.org/v3.2/getting-started/kubernetes/installation/hosted/kubernetes-datastore/calico-networking/1.7/calico.yaml -O
cp -p calico.yaml calico.yaml_orig
sed -i 's/192.168.0.0/10.10.0.0/' calico.yaml
kubectl apply -f calico.yaml
Result
$ kubectl get pod,svc,nodes --all-namespaces -owide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
default pod/www1 1/1 Running 2 71m 10.10.3.4 node1 <none>
default pod/www2 1/1 Running 2 71m 10.10.4.4 node2 <none>
kube-system pod/calico-node-45sjp 2/2 Running 4 74m 192.168.1.213 node1 <none>
kube-system pod/calico-node-bprml 2/2 Running 4 74m 192.168.1.211 master1 <none>
kube-system pod/calico-node-hqdsd 2/2 Running 4 74m 192.168.1.212 master2 <none>
kube-system pod/calico-node-p8fgq 2/2 Running 4 74m 192.168.1.214 node2 <none>
kube-system pod/coredns-576cbf47c7-f2l7l 1/1 Running 2 84m 10.10.2.7 master2 <none>
kube-system pod/coredns-576cbf47c7-frq5x 1/1 Running 2 84m 10.10.2.6 master2 <none>
kube-system pod/etcd-master1 1/1 Running 2 83m 192.168.1.211 master1 <none>
kube-system pod/kube-apiserver-master1 1/1 Running 2 83m 192.168.1.211 master1 <none>
kube-system pod/kube-controller-manager-master1 1/1 Running 2 83m 192.168.1.211 master1 <none>
kube-system pod/kube-proxy-9jmsk 1/1 Running 2 80m 192.168.1.213 node1 <none>
kube-system pod/kube-proxy-gtzvz 1/1 Running 2 80m 192.168.1.214 node2 <none>
kube-system pod/kube-proxy-str87 1/1 Running 2 84m 192.168.1.211 master1 <none>
kube-system pod/kube-proxy-tps6d 1/1 Running 2 80m 192.168.1.212 master2 <none>
kube-system pod/kube-scheduler-master1 1/1 Running 2 83m 192.168.1.211 master1 <none>
kube-system pod/kubernetes-dashboard-77fd78f978-9vdqz 1/1 Running 0 24m 10.10.3.5 node1 <none>
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
default service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 84m <none>
default service/www-np NodePort 10.107.205.119 <none> 8080:30333/TCP 71m service=testwww
kube-system service/calico-typha ClusterIP 10.99.187.161 <none> 5473/TCP 74m k8s-app=calico-typha
kube-system service/kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP 84m k8s-app=kube-dns
kube-system service/kubernetes-dashboard ClusterIP 10.96.168.213 <none> 443/TCP 24m k8s-app=kubernetes-dashboard
NAMESPACE NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
node/master1 Ready master 84m v1.12.1 192.168.1.211 <none> Ubuntu 18.04 LTS 4.15.0-20-generic docker://18.6.1
node/master2 Ready <none> 80m v1.12.1 192.168.1.212 <none> Ubuntu 18.04 LTS 4.15.0-20-generic docker://18.6.1
node/node1 Ready <none> 80m v1.12.1 192.168.1.213 <none> Ubuntu 18.04 LTS 4.15.0-20-generic docker://18.6.1
node/node2 Ready <none> 80m v1.12.1 192.168.1.214 <none> Ubuntu 18.04 LTS 4.15.0-20-generic docker://18.6.1
192.168.1.211 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
192.168.1.212 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
192.168.1.213 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
192.168.1.214 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Best Answer
You will need to stop on master
If you have federation also stop federation-apiserver
Run a backup(snapshot) of etcd and stop etcd when done
For each node stop
Etcd is as robust as consul, what do you mean by
instable
?!When restore though you have the etcd data, this is not valid immediately ... you should read on backups on kubernetes