Pacemaker clear failed actions automactically

drbdpacemaker

I have created an active/passive cluster using Pacemaker / Corosync / drbd and have "simulated" Apache failure pkill httpd and altough pacemaker recovered from the "failure" and started httpd now when executing pcs status I get:

Failed Actions:
* apache_monitor_60000 on server1 'not running' (7): call=39, status=complete, exitreason='none',
    last-rc-change='Wed May  9 09:55:45 2018', queued=0ms, exec=0ms

Why does pacemaker not clear the failed action after a successful recovery? Or is there any other way to clear the failed action other than manual?

Best Answer

That is by design. Some admins, myself included, like to see the error so that we know when it occurred and can investigate. Additionally, pacemaker needs to track these errors so that it can decide where best to start a resource.

Pacemaker does though have a method to clear failures after a specified time if no new failures have occurred. This is known as the failure-timeout. This can be configured per resource, but below is how you would specify it as a cluster-wide resource default with the crm shell. I would expect pcs would also have a method to define it.

crm configure rsc_defaults failure-timeout=15m

Please note that this is only checked upon the cluster-recheck-interval, which by default is every 15 minutes. With a failure-timeout of 15m set, depending upon when exactly the failure occurred, it is possible for this to take 29 minutes 59 seconds to clear.

Related Solutions

Corosync/Pacemaker + Haproxy Failed Actions: insufficient privileges

I think the problem is in the init script, it does not respect the LSB spec.

If you look at the function haproxy_stop, in file /etc/init.d/haproxy:

haproxy_stop()
{
    if [ ! -f $PIDFILE ] ; then
        # This is a success according to LSB
        return 0
    fi
    for pid in $(cat $PIDFILE) ; do
        /bin/kill $pid || return 4
    done
    rm -f $PIDFILE
    return 0
}

In particularly, the line /bin/kill $pid || return 4. This makes the case that the process is killed the return value is 4, which according to the spec this is: user had insufficient privileges. Which is not correct.

In case of an error while processing any init-script action except for status, the init script shall print an error message and exit with a non-zero status code:
1 generic or unspecified error (current practice)
2 invalid or excess argument(s)
3 unimplemented feature (for example, "reload")
4 user had insufficient privilege
5 program is not installed
6 program is not configured
7 program is not running
8-99  reserved for future LSB use
100-149   reserved for distribution use
150-199   reserved for application use
200-254   reserved

You can try to change by:

/bin/kill $pid || return 7

the correct way is stop daemon with killproc(8) and if this fails killproc sets the return value according to LSB.

Eg.

/sbin/killproc -p $PIDFILE $HAPROXY

sends the signal SIGTERM to the pid found in $PIDFILE if and only if this pid belongs to $HAPROXY. If the named $PIDFILE does not exist, killproc assumes that the daemon of $HAPROXY is not running. The exit status is set to 0 for successfully delivering the default signals SIGTERM and SIGKILL otherwise to 7 if the program was not running. It is also successful if no signal was specified and no program was there for Termination because it is already terminated.

Split brain on DRBD and Pacemaker cluster

1- Is my assumption correct that on event 3, the returning node can be automatically joint to the cluster?

Yes, this can be done. DRBD should not go Primary on it's own unless told to in the resource configuration; check that the 'startup { become-primary-on }' definition is not set in the resource configs.

2- If it can be done, please tell how.

Check that the following are true:

a. 'drbd' is not set to start at boot ('chkconfig drbd off' in RHEL, 'update-rc.d drbd disable' in Debian).

b. DRBD should not be configured to become primary on it's own (as mentioned above).

The DRBD user's guide has a section on configuring DRBD for use with Pacemaker that might help if my answer above doesn't do the trick: https://drbd.linbit.com/users-guide/ch-pacemaker.html

Best Answer

Related Solutions

Corosync/Pacemaker + Haproxy Failed Actions: insufficient privileges

Split brain on DRBD and Pacemaker cluster

Related Topic