Resources Are UNCLEAN When Unplugging Preferred Node – Fixing Failover Cluster Issues

failoverclusterlinux-networkingpacemaker

I'm very beginner within linux network configuration

I configured linux pacemaker + corosync + stonith via ssh + drbd + nginx for 3 nodes.

pcs status:


3 nodes configured
7 resources configured

Online: [ main-node second-node third-node ]

Full list of resources:

 ClusterIP      (ocf::heartbeat:IPaddr2):       Started main-node
 WebSite        (ocf::heartbeat:nginx): Started main-node
 Master/Slave Set: WebDataClone [WebData]
     Masters: [ main-node ]
     Slaves: [ second-node third-node ]
 WebFS  (ocf::heartbeat:Filesystem):    Started main-node
 ssh-fencing    (stonith:ssh):  Started third-node

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

I'm testing stonith within those machines by just unpluging the cable from the network. It works fine and stonith is killing unplugged machine, when it's plugged again. All other machines are taking care of the cluster.

The problem appears when I unplug the machine which is preferred to provide WebSite resource. Then pcs status from other plugged machines looks like that:

3 nodes configured
7 resources configured

Node main-node: UNCLEAN (offline)
Online: [ second-node third-node ]

Full list of resources:

 ClusterIP      (ocf::heartbeat:IPaddr2):       Started main-node (UNCLEAN)
 WebSite        (ocf::heartbeat:nginx): Started main-node (UNCLEAN)
 Master/Slave Set: WebDataClone [WebData]
     WebData    (ocf::linbit:drbd):     Master main-node (UNCLEAN)
     Slaves: [ second-node third-node ]
 WebFS  (ocf::heartbeat:Filesystem):    Started main-node (UNCLEAN)
 ssh-fencing    (stonith:ssh):  Started third-node

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

And the website is down. Why is that? Shouldn't other nodes provide the resources?

Best Answer

SSH STONITH isn't real fencing, and shouldn't be used in production unless you accept that it could leave you hung in certain types of failures, just like you're seeing in your testing.

When you unplug a node's network cable the cluster is going to try to STONITH the node that disappeared from the cluster/network. The SSH STONITH agent is using the same network you've unplugged to attempt to power-off the missing node. It won't be able to do that until the network is restored (plugged back in). Since the cluster will not take any actions (failover) until the STONITH agent has successfully powered off the missing node you're left with UNCLEAN (hung) services.

You will have the same problem if you pull the power on the primary node, since you cannot SSH into a system when it does not have power.

In short, this is the expected behavior when using SSH STONITH, and proper fencing devices are required to recover from the scenario you are testing.