Pacemaker – Cluster does not pass to another node after a disconnected interface

clustercorosynchigh-availabilityload balancingpacemaker

I have the next scenario in Corosync + Pacemaker

Node1:

eth0: 10.143.0.21/24

eth1: 10.10.10.1/30 (Corosync Comunication)

eth2: 192.168.5.2/24

Node2:

eth0: 10.143.0.22/24

eth1: 10.10.10.2/30 (Corosync Comunication)

eth2: 192.168.5.3/24

Floating IP's

eth0: 10.143.0.23/24

eth2: 192.168.5.1/24

The interface eth1 is only use for corosync comunication.

For example I disconnected the network cable from interface eth0 but nothing happens, other example I disconnected the network cable from interface eth2 and I have the same result but I disconnected the network cable from interface eth1 (corosync comunication) and the Floating IP's pass to the other node.

How can I make when disconnecting any interface the resources pass to the other node?

Regards

UPDATE

I tested with the following settings

crm configure primitive PING-WAN ocf:pacemaker:ping params host_list="10.143.0.1" multiplier="1000" dampen="1s" op monitor interval="1s"
crm configure primitive Failover-WAN ocf:heartbeat:IPaddr2 params ip=10.143.0.23 nic=eth0 op monitor interval=10s meta is-managed=true
crm configure primitive Failover-LAN ocf:heartbeat:IPaddr2 params ip=192.168.5.1 nic=eth2 op monitor interval=10s meta is-managed=true
crm configure group Cluster Failover-WAN Failover-LAN
crm configure location Best_Connectivity Cluster rule pingd: defined pingd

It works for me, when disconnecting the network cable from the eth0 and losing the ping to the destination 10.143.0.1 (Gateway) resources were moved to the other node but my scenario is 3 interfaces so I decided to add a ping test more

crm configure primitive PING-LAN ocf:pacemaker:ping params host_list="192.168.5.4" multiplier="1000" dampen="1s" op monitor interval="1s"

But now it is necessary to lose the connection with the two hosts (10.143.0.1 and 192.168.5.4) so that the resources are moved to the other node.

I'm looking for information but I can not make the following scenario work:

If the node loses connectivity to any host that adds to the ping test, the other resources pass to the other node without the need to lose the connectivity of all ping tests at the same time.

Best Answer

You need to tell Pacemaker you care about interfaces failing. Look at the ocf:pacemaker:ping resource. You can use that resource-agent to ping other host lists on the different interfaces' networks, and Pacemaker will react if those pings fail.

If you group the ocf:pacemaker:ping resources, or use constraints to relate them, to whatever else you're managing in Pacemaker they'll all move together.

Also, I would bet that when you unplugged eth1 in your previous tests that the IP wasn't "moving", but rather it was being started on BOTH cluster nodes at the same time; to the cluster nodes, they both thought that their peer had gone missing. You were essentially testing what would happen if the cluster partitioned.

On that note, you should definitely configure a second redundant ring in your Corosync config as suggested in another answer, but that isn't going to have the effect you were looking for.

UPDATE 0: You should add both IPs to the same ping primitive's host_list rather than adding an additional ping primitive, and set a failure_score on that primitive to whatever is acceptable.

From the ocf:pacemaker:ping resource agent (# crm ra info ocf:pacemaker:ping):

...
failure_score (integer):
Resource is failed if the score is less than failure_score.
Default never fails.

host_list* (string): Host list
A space separated list of ping nodes to count.
...

Something like: # crm configure primitive PING-O-DOOM ocf:pacemaker:ping params host_list="10.143.0.1 192.168.5.4" failure_score="2" op monitor interval="10s"