I have created an active/passive cluster using Pacemaker / Corosync / drbd and have "simulated" Apache failure pkill httpd
and altough pacemaker recovered from the "failure" and started httpd now when executing pcs status
I get:
Failed Actions:
* apache_monitor_60000 on server1 'not running' (7): call=39, status=complete, exitreason='none',
last-rc-change='Wed May 9 09:55:45 2018', queued=0ms, exec=0ms
Why does pacemaker not clear the failed action after a successful recovery? Or is there any other way to clear the failed action other than manual?
Best Answer
That is by design. Some admins, myself included, like to see the error so that we know when it occurred and can investigate. Additionally, pacemaker needs to track these errors so that it can decide where best to start a resource.
Pacemaker does though have a method to clear failures after a specified time if no new failures have occurred. This is known as the failure-timeout. This can be configured per resource, but below is how you would specify it as a cluster-wide resource default with the crm shell. I would expect pcs would also have a method to define it.
Please note that this is only checked upon the cluster-recheck-interval, which by default is every 15 minutes. With a failure-timeout of 15m set, depending upon when exactly the failure occurred, it is possible for this to take 29 minutes 59 seconds to clear.