Server with slow ping response

icmploopbacknetworkingwindows-server-2008

Two boxes with identical loads serving the same sites tend to slow down and stop responding to ping. The slow (or intermittent) ping causes our load balancer to think the servers are offline and disable them. There is a third server with identical content that does not have the issue, so I'm fairly confident it's not the sites.

OS is Windows Server 2008. Configuration is a little special: since we're using the Barracuda Networks load balancer in Direct Server Return mode, we've had to configure a number of loopback adapters which "fake" the IP as described here. The physical adapter has forwarding set to enabled as required by 2008 to get the loopback adapters functioning.

Symptoms:

  • When it occurs, ping usually either times out, or drops packets.
  • Fixes seem to be one or more of the following:
    • Logging in via remote desktop.
    • Clearing the dns cache or the arp cache (not sure which).
    • Restarting.
  • After one or more of the above, the server seems fine for about 4 hours before acting up again.

Question:

What possible reasons are there for this? What should I try to diagnose this? I haven't ruled anything out. Switch configuration, domain/dns server, all ideas are welcome.

Sadly, I have very little knowledge of good network administration, so obvious answers are welcome too.

EDIT:

In answer to some of the questions posed.

I have contacted Barracuda and they seem to be of the opinion that the problem is related to the network. I think I agree at this point.

The IP is assigned to a physical interface, not shared between servers. Pinging is done from within the same subnet.

The third box handles all the site load when the other two go down and hasn't had much problem with it, but occassionally it too has trouble. I haven't found a pattern with that one yet.

This evening I sat down with another (more experienced) network guy to look through some of the domain and server configurations. One of the things he found was a bad dns setup on the domain controllers. They were configured with external dns servers as their alternates rather than the other DC. We switched them to reference each other for dns, and added forwarding to the dns service. We also removed external dns references from all the web servers.

EDIT 2:

With Wireshark I was able to examine the ICMP traffic during one period of down time. I began this test because I could not reach a shared folder on box 2 from box 1.

Test:

  1. Start capturing traffic on box 2.
  2. Observed that box 2 was seeing and replying to pings from the Barracuda Load Balancer.
  3. Logged into box 1 and pinged box 2.
  4. Observed that box 2 saw but DID NOT reply to pings from box 1.
  5. Observed that box 2 saw but DID NOT reply to pings from the LB for a period of 100 seconds after the first ping from box 1.

So somehow traffic between the two boxes is causing box 2 to crap out on ICMP for a period of time.

I should note that box 1 was working fine throughout this test, but did not see any requests from box 2. While pinging box 1 from box 2, Wireshark on box 2 showed a message "Destination unreachable (Communication administratively filtered)" from a source IP I did not recognize.

Best Answer

Do you need to use ICMP ping for your server testing? HTTP requests are supported by most load balancers, and are usually a better idea, as your web server can be down while your network card is still up.

Related Topic