Linux – Pinging Virtual IP for Linux HA cluster from a different subnet does not work

heartbeathigh-availabilitylinuxpacemaker

I have setup a Linux cluster with Corosync/Pacemaker, and the two cluster nodes are within the same subnet sharing a virtual IP. For machines within the same subnet, they can ping the virtual IP "135.121.192.104" successfully.

However, if I tried to ping the virtual IP "135.121.192.104" from the machine from a different subnet, then it does not respond to my ping. The other machines resides on the subnet "135.121.196.x".

On my machines, I have the following subnet mask in my ifcfg-eth0 file:

NETMASK=255.255.254.0

and below is my output for the crm configure show:

[root@h-008 crm]# crm configure show
node h-008 \
        attributes standby="off"
node h-009 \
        attributes standby="off"
primitive GAXClusterIP ocf:heartbeat:IPaddr2 \
        params ip="135.121.192.104" cidr_netmask="23" \
        op monitor interval="30s" clusterip_hash="sourceip"
clone GAXClusterIP2 GAXClusterIP \
        meta globally-unique="true" clone-node-max="2"
property $id="cib-bootstrap-options" \
        dc-version="1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        no-quorum-policy="ignore" \
        stonith-enabled="false"
rsc_defaults $id="rsc-options" \
        resource-stickiness="100"

and the output of the crm_mon status:

[root@h-009 crm]# crm_mon status --one-shot
non-option ARGV-elements: status
============
Last updated: Thu Jun 23 08:12:21 2011
Stack: openais
Current DC: h-008 - partition with quorum
Version: 1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87
2 Nodes configured, 2 expected votes
1 Resources configured.
============

Online: [ h-008 h-009 ]

 Clone Set: GAXClusterIP2 (unique)
     GAXClusterIP:0     (ocf::heartbeat:IPaddr2):       Started h-008
     GAXClusterIP:1     (ocf::heartbeat:IPaddr2):       Started h-009

I am new to the Linux HA cluster setup, and unable to find out the root cause for the issue. Is there any configuration I can check to diagnose this problem?

Additional comments:

Below is the output of "route -n"

[root@h-008 crm]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
135.121.192.0   0.0.0.0         255.255.254.0   U     0      0        0 eth0
169.254.0.0     0.0.0.0         255.255.0.0     U     0      0        0 eth0
0.0.0.0         135.121.192.1   0.0.0.0         UG    0      0        0 eth0

and below is the traceroute output from the cluster machine to the machine outside the cluster:

[root@h-008 crm]# traceroute 135.121.196.122
traceroute to 135.121.196.122 (135.121.196.122), 30 hops max, 40 byte packets
 1  135.121.192.1 (135.121.192.1)  6.750 ms  6.967 ms  7.634 ms
 2  135.121.205.225 (135.121.205.225)  12.296 ms  14.385 ms  16.101 ms
 3  s2h-003.hpe.test.com (135.121.196.122)  0.172 ms  0.170 ms  0.170 ms

and the below is the traceroute output from the machine outside the cluster, to the virtual IP 135.121.192.104:

[root@s2h-003 ~]# traceroute 135.121.192.104
traceroute to 135.121.192.104 (135.121.192.104), 30 hops max, 40 byte packets
 1  135.121.196.1 (135.121.196.1)  10.558 ms  10.895 ms  11.556 ms
 2  135.121.205.226 (135.121.205.226)  11.016 ms  12.797 ms  14.152 ms
 3  * * *
 4  * * *
 5  * * *
 6  * * *
 7  * * *
 8  *

but when I tried to do a traceroute to the cluster's real IP address for one of the nodes, the traceroute is successful, i.e.:

[root@s2h-003 ~]# traceroute 135.121.192.102
traceroute to 135.121.192.102 (135.121.192.102), 30 hops max, 40 byte packets
 1  135.121.196.1 (135.121.196.1)  4.994 ms  5.315 ms  5.951 ms
 2  135.121.205.226 (135.121.205.226)  3.816 ms  6.016 ms  7.158 ms
 3  h-009.msite.pr.hpe.test.com (135.121.192.102)  0.236 ms  0.229 ms  0.216 ms

Best Answer

You're making the mistake of assuming your cluster config has anything to do with the issue you're seeing just because it is a new area for you. All the cluster software is doing is managing (and monitoring) resources, in this case an IP address that it'll configure on a host in the cluster. You could just as easily remove the whole cluster config and bring the IP addr up on one of the nodes and you'll see exactly the same problem.

Clearly if you can reach the IP from the same network but not from another there is a routing problem. Check your router config.

BTW, disabling stonith in a cluster is a one way ticket to data loss or corruption. I hope you've only disabled it during testing.