Pacemaker clear failed actions automactically

drbdpacemaker

I have created an active/passive cluster using Pacemaker / Corosync / drbd and have "simulated" Apache failure pkill httpd and altough pacemaker recovered from the "failure" and started httpd now when executing pcs status I get:

Failed Actions:
* apache_monitor_60000 on server1 'not running' (7): call=39, status=complete, exitreason='none',
    last-rc-change='Wed May  9 09:55:45 2018', queued=0ms, exec=0ms

Why does pacemaker not clear the failed action after a successful recovery? Or is there any other way to clear the failed action other than manual?

Best Answer

That is by design. Some admins, myself included, like to see the error so that we know when it occurred and can investigate. Additionally, pacemaker needs to track these errors so that it can decide where best to start a resource.

Pacemaker does though have a method to clear failures after a specified time if no new failures have occurred. This is known as the failure-timeout. This can be configured per resource, but below is how you would specify it as a cluster-wide resource default with the crm shell. I would expect pcs would also have a method to define it.

crm configure rsc_defaults failure-timeout=15m

Please note that this is only checked upon the cluster-recheck-interval, which by default is every 15 minutes. With a failure-timeout of 15m set, depending upon when exactly the failure occurred, it is possible for this to take 29 minutes 59 seconds to clear.

Related Topic