Linux – Pacemaker failure-timeout don’t reset failcount

corosyncfailoverhigh-availabilitylinuxpacemaker

I'm using Pacemaker 1.1.13 and Corosync 2.3.4 on Centos7.

I've a problem with Master/Slave resource. There is meta attrs for my resource:

migration-threshold=1

failure-timeout=10s

but when the resource goes down, there is only one attempt to start it. Documentation says that attribute failure-timeout=10s should reset failcount every 10 seconds, but that does not happen, so resource never start.

Do You know anything about this problem? Maybe I'm doing something wrong? I'm sending my 'pcs status' below:

Cluster Name: webcluster
Corosync Nodes:
 10.121.100.101 10.121.100.102
Pacemaker Nodes:
 pm-node1 pm-node2

Resources:
 Master: Services-master
  Meta Attrs: failure-timeout=10s
  Group: Services
   Meta Attrs: migration-threshold=1
   Resource: Test (class=ocf provider=scooty type=test)
    Operations: start interval=0s timeout=20 (Test-start-interval-0s)
                stop interval=0s timeout=20 (Test-stop-interval-0s)
                monitor interval=10 role=Master timeout=20 (Test-monitor-interval-10)
                monitor interval=11 role=Slave timeout=20 (Test-monitor-interval-11)

Stonith Devices:
Fencing Levels:

Location Constraints:
Ordering Constraints:
Colocation Constraints:

Resources Defaults:
 migration-threshold: 1
 failure-timeout: 10
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: webcluster
 dc-version: 1.1.13-10.el7_2.4-44eb2dd
 have-watchdog: false
 last-lrm-refresh: 1475145002
 no-quorum-policy: ignore
 start-failure-is-fatal: false
 stonith-enabled: false

Best Answer

Depending on the type of failure, failure-timeout might not be enough to clean it up. Start and Stop operation failures are considered "fatal" and will not be automatically cleaned up by failure-timeout.

If you're having issues with a start operation failing, you can set the cluster property start-failure-is-fatal=false. Fencing/STONITH devices are the only way to recover from a stop failure.

Hope that helps.