Corosync resource keeps moving after failed node is online again

I have a 2-node corosync cluster managing a virtual IP and an asterisk resource. When I shutdown one of the nodes (server2) intentionally (as a disaster recovery test), the first node (server1) takes over asterisk instantly.

However, when server2 has booted, it appears that the asterisk instance on server1 is no longer running on server1, neither is it running on server2. I prefer that it remains at the running server at all times. The virtual_ip didn't move, which is okay.

I've tried setting stickinessparameters on both nodes (same value), but that doesn't seem to help.

pcs resource meta asterisk resource-stickiness=100

and

pcs resource meta asterisk resource-stickiness=INFINITY

Also, the parameter "start-failure-is-fatal" is set to false, to make sure that whatever is keeping server2 from starting up asterisk, to try again, but that doesn't have any effect either. Setting quorumparameters also have no effect:

pcs property set stonith-enabled=false
pcs property set no-quorum-policy=ignore
pcs property set start-failure-is-fatal=false

Here is my general config.

totem {
  version: 2
  cluster_name: astcluster
  secauth: off
  join: 30
  consensus: 300
  vsftype: none
  max_messages: 20
  rrp_mode: none
  interface {
    member {
      memberaddr: 192.168.83.133
    }
    member {
      memberaddr: 192.168.83.135
    }
    ringnumber: 0
    bindnetaddr: 192.168.83.0
    mcastport: 5405
  }
  transport: udpu
}

nodelist {
  node {
    ring0_addr: astp5.internal.uzgent.be
    nodeid: 1
    quorum_votes: 1
  }
  node {
    ring0_addr: astp6.internal.uzgent.be
    nodeid: 2
    quorum_votes: 1
  }
}

quorum {
  provider: corosync_votequorum
  two_node: 1
  wait_for_all: 0
  expected_votes: 1
}


logging {
  to_logfile: yes
  logfile: /var/log/cluster/corosync.log
  to_syslog: no
  debug: off
  timestamp: on
}

Can someone tell me how to handle this?

EDIT, attached the pacemaker config.

<cib crm_feature_set="3.0.10" validate-with="pacemaker-2.3" epoch="43" num_updates="0" admin_epoch="0" cib-last-written="Thu Feb 23 14:56:07 2017" update-origin="server2" update-client="crm_attribute" update-user="root" have-quorum="1" dc-uuid="1">
  <configuration>
    <crm_config>
      <cluster_property_set id="cib-bootstrap-options">
        <nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" value="false"/>
        <nvpair id="cib-bootstrap-options-no-quorum-policy" name="no-quorum-policy" value="ignore"/>
        <nvpair id="cib-bootstrap-options-have-watchdog" name="have-watchdog" value="false"/>
        <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" value="1.1.13-10.el7-44eb2dd"/>
        <nvpair id="cib-bootstrap-options-cluster-infrastructure" name="cluster-infrastructure" value="corosync"/>
        <nvpair id="cib-bootstrap-options-cluster-name" name="cluster-name" value="astcluster"/>
        <nvpair id="cib-bootstrap-options-last-lrm-refresh" name="last-lrm-refresh" value="1487858117"/>
        <nvpair id="cib-bootstrap-options-start-failure-is-fatal" name="start-failure-is-fatal" value="false"/>
      </cluster_property_set>
    </crm_config>
    <nodes>
      <node id="1" uname="server1"/>
      <node id="2" uname="server2">
        <instance_attributes id="nodes-2"/>
      </node>
    </nodes>
    <resources>
      <primitive class="ocf" id="virtual_ip" provider="heartbeat" type="IPaddr2">
        <instance_attributes id="virtual_ip-instance_attributes">
          <nvpair id="virtual_ip-instance_attributes-ip" name="ip" value="192.168.83.137"/>
          <nvpair id="virtual_ip-instance_attributes-cidr_netmask" name="cidr_netmask" value="32"/>
        </instance_attributes>
        <operations>
          <op id="virtual_ip-start-interval-0s" interval="0s" name="start" timeout="20s"/>
          <op id="virtual_ip-stop-interval-0s" interval="0s" name="stop" timeout="20s"/>
          <op id="virtual_ip-monitor-interval-30s" interval="30s" name="monitor"/>
        </operations>
      </primitive>
      <primitive class="ocf" id="asterisk" provider="heartbeat" type="asterisk">
        <instance_attributes id="asterisk-instance_attributes">
          <nvpair id="asterisk-instance_attributes-user" name="user" value="asterisk"/>
          <nvpair id="asterisk-instance_attributes-group" name="group" value="asterisk"/>
        </instance_attributes>
        <meta_attributes id="asterisk-meta_attributes">
          <nvpair id="asterisk-meta_attributes-is-managed" name="is-managed" value="true"/>
          <nvpair id="asterisk-meta_attributes-expected-quorum-votes" name="expected-quorum-votes" value="1"/>
          <nvpair id="asterisk-meta_attributes-resource-stickiness" name="resource-stickiness" value="INFINITY"/>
          <nvpair id="asterisk-meta_attributes-default-resource-stickiness" name="default-resource-stickiness" value="1000"/>
        </meta_attributes>
        <operations>
          <op id="asterisk-start-interval-0s" interval="0s" name="start" timeout="20"/>
          <op id="asterisk-stop-interval-0s" interval="0s" name="stop" timeout="20"/>
          <op id="asterisk-monitor-interval-60s" interval="60s" name="monitor" timeout="30"/>
        </operations>
      </primitive>
    </resources>
    <constraints/>
  </configuration>
  <status>
    <node_state id="2" uname="server2" in_ccm="true" crmd="online" crm-debug-origin="do_update_resource" join="member" expected="member">
      <transient_attributes id="2">
        <instance_attributes id="status-2">
          <nvpair id="status-2-shutdown" name="shutdown" value="0"/>
          <nvpair id="status-2-probe_complete" name="probe_complete" value="true"/>
          <nvpair id="status-2-last-failure-asterisk" name="last-failure-asterisk" value="1487845098"/>
        </instance_attributes>
      </transient_attributes>
      <lrm id="2">
        <lrm_resources>
          <lrm_resource id="virtual_ip" type="IPaddr2" class="ocf" provider="heartbeat">
            <lrm_rsc_op id="virtual_ip_last_0" operation_key="virtual_ip_monitor_0" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.10" transition-key="7:59:7:8e6dd4d3-49ed-4e78-92b9-ec440e36f949" transition-magic="0:7;7:59:7:8e6dd4d3-49ed-4e78-92b9-ec440e36f949" on_node="server2" call-id="5" rc-code="7" op-status="0" interval="0" last-run="1487845098" last-rc-change="1487845098" exec-time="68" queue-time="0" op-digest="7ea42b08d9415fb0dbbde15977130035"/>
          </lrm_resource>
          <lrm_resource id="asterisk" type="asterisk" class="ocf" provider="heartbeat">
            <lrm_rsc_op id="asterisk_last_failure_0" operation_key="asterisk_monitor_0" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.10" transition-key="6:79:7:8e6dd4d3-49ed-4e78-92b9-ec440e36f949" transition-magic="0:0;6:79:7:8e6dd4d3-49ed-4e78-92b9-ec440e36f949" on_node="server2" call-id="22" rc-code="0" op-status="0" interval="0" last-run="1487858116" last-rc-change="1487858116" exec-time="47" queue-time="0" op-digest="337a6295a6acbbd18616daf0206c3394"/>
            <lrm_rsc_op id="asterisk_last_0" operation_key="asterisk_stop_0" operation="stop" crm-debug-origin="do_update_resource" crm_feature_set="3.0.10" transition-key="9:82:0:8e6dd4d3-49ed-4e78-92b9-ec440e36f949" transition-magic="0:0;9:82:0:8e6dd4d3-49ed-4e78-92b9-ec440e36f949" on_node="server2" call-id="25" rc-code="0" op-status="0" interval="0" last-run="1487858128" last-rc-change="1487858128" exec-time="1036" queue-time="0" op-digest="337a6295a6acbbd18616daf0206c3394" op-secure-params=" user " op-secure-digest="cf2187fe855553314a7a6bc14ff18918"/>
            <lrm_rsc_op id="asterisk_monitor_60000" operation_key="asterisk_monitor_60000" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.10" transition-key="10:80:0:8e6dd4d3-49ed-4e78-92b9-ec440e36f949" transition-magic="0:0;10:80:0:8e6dd4d3-49ed-4e78-92b9-ec440e36f949" on_node="server2" call-id="23" rc-code="0" op-status="0" interval="60000" last-rc-change="1487858116" exec-time="47" queue-time="0" op-digest="ce41237c2113b12d51aaed8af6b8a09f" op-secure-params=" user " op-secure-digest="cf2187fe855553314a7a6bc14ff18918"/>
          </lrm_resource>
        </lrm_resources>
      </lrm>
    </node_state>
    <node_state id="1" uname="server1" in_ccm="true" crmd="online" crm-debug-origin="do_update_resource" join="member" expected="member">
      <transient_attributes id="1">
        <instance_attributes id="status-1">
          <nvpair id="status-1-shutdown" name="shutdown" value="0"/>
          <nvpair id="status-1-probe_complete" name="probe_complete" value="true"/>
        </instance_attributes>
      </transient_attributes>
      <lrm id="1">
        <lrm_resources>
          <lrm_resource id="virtual_ip" type="IPaddr2" class="ocf" provider="heartbeat">
            <lrm_rsc_op id="virtual_ip_last_0" operation_key="virtual_ip_start_0" operation="start" crm-debug-origin="build_active_RAs" crm_feature_set="3.0.10" transition-key="7:6:0:b7b79be6-bb63-4f56-b425-fc84e90ef38b" transition-magic="0:0;7:6:0:b7b79be6-bb63-4f56-b425-fc84e90ef38b" on_node="server1" call-id="10" rc-code="0" op-status="0" interval="0" last-run="1487838677" last-rc-change="1487838677" exec-time="47" queue-time="0" op-digest="7ea42b08d9415fb0dbbde15977130035"/>
            <lrm_rsc_op id="virtual_ip_monitor_30000" operation_key="virtual_ip_monitor_30000" operation="monitor" crm-debug-origin="build_active_RAs" crm_feature_set="3.0.10" transition-key="7:7:0:b7b79be6-bb63-4f56-b425-fc84e90ef38b" transition-magic="0:0;7:7:0:b7b79be6-bb63-4f56-b425-fc84e90ef38b" on_node="server1" call-id="12" rc-code="0" op-status="0" interval="30000" last-rc-change="1487838679" exec-time="34" queue-time="0" op-digest="e81e10104a53c2ccab94a6935229ae08"/>
          </lrm_resource>
          <lrm_resource id="asterisk" type="asterisk" class="ocf" provider="heartbeat">
            <lrm_rsc_op id="asterisk_last_0" operation_key="asterisk_start_0" operation="start" crm-debug-origin="do_update_resource" crm_feature_set="3.0.10" transition-key="10:82:0:8e6dd4d3-49ed-4e78-92b9-ec440e36f949" transition-magic="0:0;10:82:0:8e6dd4d3-49ed-4e78-92b9-ec440e36f949" on_node="server1" call-id="77" rc-code="0" op-status="0" interval="0" last-run="1487858129" last-rc-change="1487858129" exec-time="2517" queue-time="0" op-digest="337a6295a6acbbd18616daf0206c3394" op-secure-params=" user " op-secure-digest="cf2187fe855553314a7a6bc14ff18918"/>
            <lrm_rsc_op id="asterisk_monitor_60000" operation_key="asterisk_monitor_60000" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.10" transition-key="11:82:0:8e6dd4d3-49ed-4e78-92b9-ec440e36f949" transition-magic="0:0;11:82:0:8e6dd4d3-49ed-4e78-92b9-ec440e36f949" on_node="server1" call-id="78" rc-code="0" op-status="0" interval="60000" last-rc-change="1487858132" exec-time="46" queue-time="0" op-digest="ce41237c2113b12d51aaed8af6b8a09f" op-secure-params=" user " op-secure-digest="cf2187fe855553314a7a6bc14ff18918"/>
          </lrm_resource>
        </lrm_resources>
      </lrm>
    </node_state>
  </status>
</cib>

EDIT: also tried to add some colocation constraints

[root@server1]# pcs constraint show
Location Constraints:
Ordering Constraints:
  Resource Sets:
    set virtual_ip asterisk
Colocation Constraints:
  virtual_ip with asterisk (score:INFINITY)

EDIT: Found a solution! Had to add the following parameter to the asterisk resource: on-fail=fence

Best Answer

Did you happen move the resource to the given node before you stopped it? If you move a resource manually, behind the scenes, the cluster resource manager creates a location constraint for that resource for the given node. I don't see this constraint in your configuration XML, but it could be that you captured that configuration before you moved the resource.

I had this very same problem. After I used 'crm resource unmove ' to give control back to the cluster resource manager, recovering a node no longer resulted in the resource getting moved back to the original node.

See the following documentation for details.

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html/Clusters_from_Scratch/_manually_moving_resources_around_the_cluster.html

Best Answer

Related Solutions

Pacemaker/corosync timeout before resource transfers

Linux – Corosync with Pacemaker Resource explanation

Related Topic