I'm using Pacemaker 1.1.13 and Corosync 2.3.4 on Centos7.
I've a problem with Master/Slave resource. There is meta attrs for my resource:
migration-threshold=1
failure-timeout=10s
but when the resource goes down, there is only one attempt to start it. Documentation says that attribute failure-timeout=10s should reset failcount every 10 seconds, but that does not happen, so resource never start.
Do You know anything about this problem? Maybe I'm doing something wrong? I'm sending my 'pcs status' below:
Cluster Name: webcluster
Corosync Nodes:
10.121.100.101 10.121.100.102
Pacemaker Nodes:
pm-node1 pm-node2
Resources:
Master: Services-master
Meta Attrs: failure-timeout=10s
Group: Services
Meta Attrs: migration-threshold=1
Resource: Test (class=ocf provider=scooty type=test)
Operations: start interval=0s timeout=20 (Test-start-interval-0s)
stop interval=0s timeout=20 (Test-stop-interval-0s)
monitor interval=10 role=Master timeout=20 (Test-monitor-interval-10)
monitor interval=11 role=Slave timeout=20 (Test-monitor-interval-11)
Stonith Devices:
Fencing Levels:
Location Constraints:
Ordering Constraints:
Colocation Constraints:
Resources Defaults:
migration-threshold: 1
failure-timeout: 10
Operations Defaults:
No defaults set
Cluster Properties:
cluster-infrastructure: corosync
cluster-name: webcluster
dc-version: 1.1.13-10.el7_2.4-44eb2dd
have-watchdog: false
last-lrm-refresh: 1475145002
no-quorum-policy: ignore
start-failure-is-fatal: false
stonith-enabled: false
Best Answer
Depending on the type of failure,
failure-timeout
might not be enough to clean it up. Start and Stop operation failures are considered "fatal" and will not be automatically cleaned up by failure-timeout.If you're having issues with a start operation failing, you can set the cluster property
start-failure-is-fatal=false
. Fencing/STONITH devices are the only way to recover from a stop failure.Hope that helps.