I have three nodes multi-master kubernetes(1.17.3) cluster(Stacked control plane and etcd nodes),
11.11.11.1 - master1
11.11.11.2 - master2
11.11.11.3 - master3
So before going to productions, I am trying possible failures and did below steps
Graceful Removal of Master Nodes
Run kubectl drain 11.11.11.3
on master3kubeadm reset
on master 3kubectl delete node 11.11.11.3
on master3
So by applying above steps all pods are running on master 1 and 2, it removes entries from kubeadm-conf
configmap and also from etcd
, infact i run above steps on master2 and still one master is up and running and i can run kubectl
.
Non-Graceful Removal of Master Nodes
- I shutdown master3 but dont face any issue still two master accessible and i can run
kubectl
and do administrations. - As soon as i shut master2 i have no access to
kubectl
and its saying apiserver is not accessible. How i can recover master1 in this situation?
This can happen that two nodes may have hardware issue in production, from my search it look like etcd issue but how i can access etcd and remove master2 and master3, i thought to do docker ps
and docker exec <etcd container>
but docker ps is not showing etcd container.
Best Answer
A topic near and dear to my heart.
The short version is:
master1
's etcd memberrm -rf /var/lib/etcd
, and delete the server and peer certs unless you have used the same CA for the disposable cluster -- something I highly recommend but may not be possible for a variety of reasons)master1
remains in your "cluster of one"I have had great success using etcdadm to automate all of those steps I just described, with the bad news being that you have to build the
etcdadm
binary yourself, because they don't -- as of this message -- attach built artifacts to their releasesIn the future, you'll want to include
etcdctl member remove $my_own_member_id
from any orderly master teardown process, since if a member just disappears from an etcd cluster, that's damn near fatal to the cluster. There is an etcd issue speaking to the fact that etcd really is fragile, and you need a bigger team running it than you do kubernetes itself :-(