XenServer – Bonded NIC give random packet drops

networkingxenserver

I'm running a HP server with XenServer 5.6.

I've bonded 2 of my 4 NICs together (NIC0 and NIC1).

Now, I'm noticing at random big chunks of packet loss (usually 10-15 dropped packets, but sometimes there are no ping replies until I pull out both cables).

Neither of the NICs seem broken, because if I only connect one of the two cables it works fine. No loss at all.

64 bytes from 192.168.110.20: icmp_seq=9191 ttl=64 time=7.685 ms
64 bytes from 192.168.110.20: icmp_seq=9192 ttl=64 time=6.681 ms
64 bytes from 192.168.110.20: icmp_seq=9193 ttl=64 time=1.053 ms
Request timeout for icmp_seq 9194
Request timeout for icmp_seq 9195
Request timeout for icmp_seq 9196
Request timeout for icmp_seq 9197
Request timeout for icmp_seq 9198
Request timeout for icmp_seq 9199
Request timeout for icmp_seq 9200
Request timeout for icmp_seq 9201
Request timeout for icmp_seq 9202
Request timeout for icmp_seq 9203
64 bytes from 192.168.110.20: icmp_seq=9204 ttl=64 time=14.665 ms
64 bytes from 192.168.110.20: icmp_seq=9205 ttl=64 time=1.275 ms
64 bytes from 192.168.110.20: icmp_seq=9206 ttl=64 time=3.090 ms

and not long after….

Request timeout for icmp_seq 9252
Request timeout for icmp_seq 9253
Request timeout for icmp_seq 9254
Request timeout for icmp_seq 9255
Request timeout for icmp_seq 9256
Request timeout for icmp_seq 9257
Request timeout for icmp_seq 9258
Request timeout for icmp_seq 9259
Request timeout for icmp_seq 9260
Request timeout for icmp_seq 9261
Request timeout for icmp_seq 9262

x 50

And now it's not even coming up again. I only get loss.

I did not pull out any cable. I did not touch the machine…

NIC lights keep flashing and Xen reports both NICs (and BOND0+1) as connected.

Unplugging either of the two cables (or both) doesn't seem to solve my problem either. It keeps giving a lot of loss, until, all of a sudden, it replies on pings again.

Any clue what's happening?

Odd thing is it can run fine for 15-30 minutes, then all of a sudden I get these huge packet loss 'phases'.

In testing phase the two NICs are connected to the same switch by the way.

And yes, other services go down too, not only ICMP.

Kind regards,
Yeri

Best Answer

Seems like it was the Cisco switch that was causing the issues (perhaps some MAC address security that was turned on).

Using two HP ProCurve switches now (moved from the office to the datacenter) and it seems to be working fine.