Odd one-way ping issue that I can’t wrap the head around

networkingtroubleshootingwindows-server-2008-r2

Long time lurker, but today I encountered an odd problem that will bug me until resolution 🙂

It seems to be presenting as a one-way ping issue from one server to a failover cluster.

All machines are running Windows Server 2008 R2 with IPV6 disabled. The windows firewall service is disabled.

Lay of the land:

Report Server – VMWare Virtual Machine using E1000 NIC. Nothing special – IP, Subnet, Gateway and routing table all appear sane.

SQL 2008R2 Active/Passive Failover cluster – Each has 7 configured NICs- 3 iSCSI, and the remaining 4 bound to 2 IPs with BACS. One NIC Team is used for local traffic and the other as part of the failover cluster. The failover cluster has a VIP.

Problem:

All was working fine last week. All machines are on the same subnet. Today, the report server couldn't ping the VIP of the failover cluster. It could ping both nodes without issue, using both non-storage IP addresses.

The SQL failover cluster could ping the report server without any issue.

I can ping the SQL VIP from any other machine, vindicating it in my mind.

The Band-Aid

I tried rebooting the report server in the event that TCP/IP was misbehaving to no avail. What ended up working was changing the Report Server IP address – As far as I know there are no host rules in place on the switch (Catalyst 3750).

What could cause this one? I'd say the ARP table was cleared after the report server rebooted, and the IP address shouldn't have become stale on the DB cluster… looking for someone with more networking know-how than I 🙂

Best Answer

Facepalm.

I know what caused it, although I may need help on the explanation. In troubleshooting tonight, I spun up another server and had it assume the Report server's IP address- this brand new server running Windows Server 2008 R2 could NOT ping the VIP.

Well, that's strange. And again, it could ping either of the nodes by name. I looked at the arp tables, and it seemed sane - I hopped on the active DB node to check the MAC address and noticed that the checkbox for IPv6 was ticked. I unchecked it, and it instantly resolved the problem.

Question becomes - why? I missed the IPv6 in the configuration of the cluster, that's for sure... but this cluster has been in production for 3+ months with no apparent issues before today. This node has been the active node for more than 3 weeks.

Does anyone have experience or an explanation of how something so good became so bad? :-)

Related Topic