This is a slightly older question but the problem presented here is based on a misconception on how and when failover in clusters, especially two-node clusters, works.
The gist is: You can not do failover testing by disabling communication between the two nodes. Doing so will result in exactly what you are seeing, a split-brain scenario with additional, mutual STONITH. If you want to test the fencing capabilities, a simple killall -9 corosync
on the active node will do. Other ways are crm node fence
or stonith_admin -F
.
From the not quite complete description of your cluster (where is the output of crm configure show
and cat /etc/corosync/corosync.conf
?) it seems you are using the 10.10.10.xx addresses for messaging, i.e. Corosync/cluster communication. The 172.10.10.xx addresses are your regular/service network addresses and you would access a given node, for example using SSH, by its 172.10.10.xx address. DNS also seems to resolve a node hostname like node1
to 172.10.10.1.
You have STONITH configured to use SSH, which is not a very good idea in itself, but you are probably just testing. I haven't used it myself but I assume the SSH STONITH agent logs into the other node and issues a shutdown command, like ssh root@node2 "shutdown -h now"
or something equivalent.
Now, what happens when you cut cluster communication between the nodes? The nodes no longer see each node as alive and well, because there is no more communication between them. Thus each node assumes it is the only survivor of some unfortunate event and tries to become (or remain) the active or primary node. This is the classic and dreaded split-brain scenario.
Part of this is to make sure the other, obviously and presumably failed node is down for good, which is where STONITH comes in. Keep in mind that both nodes are now playing the same game: trying to become (or stay) active and take over all cluster resources, as well as shooting the other node in the head.
You can probably guess what happens now. node1
does ssh root@node2 "shutdown -h now"
and node2
does ssh root@node1 "shutdown -h now"
. This doesn't use the cluster communication network 10.10.10.xx but the service network 172.10.10.xx. Since both nodes are in fact alive and well, they have no problem issuing commands or receiving SSH connections, so both nodes shoot each other at the same time. This kills both nodes.
If you don't use STONITH then a split-brain could have even worse consequences, especially in case of DRBD, where you could end up with both nodes becoming Primary. Data corruption is likely to happen and the split-brain must be resolved manually.
I recommend reading the material on http://www.hastexo.com/resources/hints-and-kinks which is written and maintained by the guys who contributed (and still contribute) a large chunk of what we today call "the Linux HA stack".
TL;DR: If you are cutting cluster communication between your nodes in order to test your fencing setup, you are doing it wrong. Use killall -9 corosync
, crm node fence
or stonith_admin -F
instead. Cutting cluster communication will only result in a split-brain scenario, which can and will lead to data corruption.
Your cluster architecture confuses me, as it seems you are running services that should be cluster-managed (like Varnish) standalone on two nodes at the same time and let the cluster resource manager (CRM) just juggle IP addresses around.
What is it you want to achieve with your cluster setup? Fault tolerance? Load balancing? Both? Mind you, I am talking about the cluster resources (Varnish, IP addresses, etc), not the backend servers to which Varnish distributes the load.
To me it sounds like you want an active-passive two-node cluster, which provides fault tolerance. One node is active and runs Varnish, the virtual IP addresses and possibly other resources, and the other node is passive and does nothing until the cluster resource manager moves resources over to the passive node, at which point it becomes active. This is a tried-and-true architecture that is as old as time itself. But for it to work you need to give the CRM full control over the resources. I recommend following Clusters from Scratch and modelling your cluster after that.
Edit after your updated question: your CIB looks good, and once you patched the Varnish init script so that repeated calls to "start" return 0 you should be able to add the following primitive (adjust the timeouts and intervals to your liking):
primitive p_varnish lsb:varnish \
op monitor interval="10s" timeout="15s" \
op start interval="0" timeout="10s" \
op stop interval="0" timeout="10s"
Don't forget to add it to the balancer group (the last element in the list):
group balancer eth0_gateway eth1_iceman_slider eth1_iceman_slider_ts \
eth1_iceman_slider_pm eth1_iceman_slider_jy eth1_iceman eth1_slider \
eth1_viper eth1_jester p_varnish
Edit 2: To decrease the migration threshold add a resource defaults section at the end of your CIB and set the migration-threshold
property to a low number. Setting it to 1 means the resource will be migrated after a single failure. It is also a good idea to set resource stickiness so that a resource that has been migrated because of node failure (reboot or shutdown) does not automatically get migrated back later when the node is available again.
rsc_defaults $id="rsc-options" \
resource-stickiness="100" \
migration-threshold="1"
Best Answer
With Pacemaker you don't have 'an active node' and 'a passive node'. All cluster nodes (there may be more than 2) can run services equally and the rules in the CIB database tell CRM (Pacemaker's resources manager) what node can run what services.
If you configured the service to run as a single instance that can run on both nodes with no constraints, then you cannot tell which node will run it. If you have two such services then you may end with one running on one node and the other on the other node. For the first service node1 will be active and for the second – node2.
You define actual preference by declaring some constants. Example: 'run service 1 on a node where service 2' is running or 'always prefer node 1 for both services'.
Usually you have a service which defines 'a logical master' – it can be an IP address or DRBD volume in the primary state – then all other services depend on this and you choose the 'master' by setting preferences for the primary service.
See the Pacemaker documentation for details on setting the constraints.
Heartbeat itself, when used with Pacemaker, doesn't make any decisions on master/slave states or on what resources are running.