How to remove pacemaker, corosync and pcs thoroughly on CentOS 7

centos7clustercorosyncpacemakerremove

Using two nodes:

node1: 192.168.0.1
node2: 192.168.0.2

Installed HA tools on both servers:

yum install pacemaker pcs

(It will include install corosync)

On both servers:

passwd hacluster

Set same password for cluster.

On both servers:

systemctl enable pcsd.service
systemctl start pcsd.service

Authenticating the cluster nodes:

node1# pcs cluster auth 192.168.0.1 192.168.0.2

All of them successful authenticated.

Generating the corosync configuration:

node1# pcs cluster setup --name mycluster 192.168.0.1 192.168.0.2

Starting the cluster:

node1# pcs cluster start --all

Success.

Confirm status:

pcs status corosync

Output
Membership information
----------------------
    Nodeid      Votes Name
         2          1 192.168.0.2
         1          1 192.168.0.1 (local)

Get more information about the current status of the cluster:

pcs cluster status

Output
Cluster Status:
 ...
 Stack: corosync
 ...
 2 nodes and 0 resources configured
 Online: [ node1 node2 ]

PCSD Status:
  node1 (192.168.0.1): Online
  node2 (192.168.0.2): Online

Enable the corosync and pacemaker services on both servers:

systemctl enable corosync.service
systemctl enable pacemaker.service

Disabling STONITH

node1# pcs property set stonith-enabled=false

After created a float IP and added it to pcs resource, test failover.

On node1:

reboot

Then got trouble. After it rebooted, run pcs cluster status again, it showed:

  Cluster Status:
   Stack: corosync
   Current DC: centos7lb1 (version 1.1.15-11.el7_3.5-e174ec8) - partition WITHOUT quorum
   Last updated: Sun Jul 23 23:47:53 2017         Last change: Fri Jul 21 05:56:32 2017 by hacluster via crmd on node1
   1 node and 0 resources configured

  PCSD Status:
    node1 (192.168.0.1): Online
    *Unknown* (192.168.0.2): Online

Run pcs status on node1:

    Cluster name: mycluster
    WARNING: corosync and pacemaker node names do not match (IPs used in setup?)
    Stack: corosync
    Current DC: node1 (version 1.1.15-11.el7_3.5-e174ec8) - partition WITHOUT quorum
    Last updated: Sun Jul 23 23:58:22 2017          Last change: Fri Jul 21 05:56:32 2017 by hacluster via crmd on node1

    1 node and 0 resources configured

    Online: [ node1 ]

    No resources


    Daemon Status:
      corosync: active/disabled
      pacemaker: active/disabled
      pcsd: active/enabled

Can't find node2 in the cluster. At the same time check the status on node2, got only one node(node2), too. The same as node1, can't find another node in the cluster.

I tried to remove pacemaker, corosync and pcs and redo it again. But after do that like:

yum remove pacemaker pcs

Then authenticate them:

pcs cluster auth node1 node2

Showed they Already authorized.

At this time, how to join the two nodes into the cluster rightly again? I want to remove them clearly, then how to do?

Best Answer

The reason was the firewall.

Because Corosync uses UDP transport on ports 5404 and 5405, I added:

iptables -I INPUT -m state --state NEW -p udp -m multiport --dports 5404,5405 -j ACCEPT
iptables -I OUTPUT -m state --state NEW -p udp -m multiport --sports 5404,5405 -j ACCEPT
service iptables save

and stop/start all cluster:

pcs cluster stop --all
pcs cluster start --all

Also ran:

service corosync restart

The cluster works. All the nodes can been seen and all of them online.

Related Solutions

How to setup STONITH in a 2-node active/passive linux HA pacemaker cluster

This is a slightly older question but the problem presented here is based on a misconception on how and when failover in clusters, especially two-node clusters, works.

The gist is: You can not do failover testing by disabling communication between the two nodes. Doing so will result in exactly what you are seeing, a split-brain scenario with additional, mutual STONITH. If you want to test the fencing capabilities, a simple killall -9 corosync on the active node will do. Other ways are crm node fence or stonith_admin -F.

From the not quite complete description of your cluster (where is the output of crm configure show and cat /etc/corosync/corosync.conf?) it seems you are using the 10.10.10.xx addresses for messaging, i.e. Corosync/cluster communication. The 172.10.10.xx addresses are your regular/service network addresses and you would access a given node, for example using SSH, by its 172.10.10.xx address. DNS also seems to resolve a node hostname like node1 to 172.10.10.1.

You have STONITH configured to use SSH, which is not a very good idea in itself, but you are probably just testing. I haven't used it myself but I assume the SSH STONITH agent logs into the other node and issues a shutdown command, like ssh root@node2 "shutdown -h now" or something equivalent.

Now, what happens when you cut cluster communication between the nodes? The nodes no longer see each node as alive and well, because there is no more communication between them. Thus each node assumes it is the only survivor of some unfortunate event and tries to become (or remain) the active or primary node. This is the classic and dreaded split-brain scenario.

Part of this is to make sure the other, obviously and presumably failed node is down for good, which is where STONITH comes in. Keep in mind that both nodes are now playing the same game: trying to become (or stay) active and take over all cluster resources, as well as shooting the other node in the head.

You can probably guess what happens now. node1 does ssh root@node2 "shutdown -h now" and node2 does ssh root@node1 "shutdown -h now". This doesn't use the cluster communication network 10.10.10.xx but the service network 172.10.10.xx. Since both nodes are in fact alive and well, they have no problem issuing commands or receiving SSH connections, so both nodes shoot each other at the same time. This kills both nodes.

If you don't use STONITH then a split-brain could have even worse consequences, especially in case of DRBD, where you could end up with both nodes becoming Primary. Data corruption is likely to happen and the split-brain must be resolved manually.

I recommend reading the material on http://www.hastexo.com/resources/hints-and-kinks which is written and maintained by the guys who contributed (and still contribute) a large chunk of what we today call "the Linux HA stack".

TL;DR: If you are cutting cluster communication between your nodes in order to test your fencing setup, you are doing it wrong. Use killall -9 corosync, crm node fence or stonith_admin -F instead. Cutting cluster communication will only result in a split-brain scenario, which can and will lead to data corruption.

Drbd corosync cluster second node is trying to be primary all the times

This happen, because, you don't have the cluster fencing configured(stonith), now your cluster is in split-brain

 Now you have a cluster with two DC and every node are trying to start the resource

Best Answer

Related Solutions

How to setup STONITH in a 2-node active/passive linux HA pacemaker cluster

Drbd corosync cluster second node is trying to be primary all the times

Related Topic