Packet Loss – Fix Massive Packet Loss When Servers Go Online

ciscolinuxpacketlossUbuntu

This is a continuation of ubuntu server, ssh, write failed: broken pipe. I'm starting a new question, because I don't believe the issue is isolated to ssh/ubuntu.

I've got two brand new servers (Dell PowerEdge R715, R210) with Ubuntu server 10.04 64bit installed on them. I'm running a stack of Cisco 3750 switches with two Juniper SRX240 firewall/routers. The setup is basically router on a stick, we have 3 vlans: one internal, one dmz, and one storage network (iSCSI) all on the same stack. No layer 3 switching is being done on the Cisco stack and DMZ is completely isolated from the stack on a different switch.

There are about 10 other Dell PoerEdge servers on this same network (and stack) that have been running for years without an issue. Most of them are running SLES 10 or openSUSE, but one is running Ubuntu server 10.04 64bit. I've unplugged all NICs on these new servers except those going to our internal vlan.

If I boot either (or both) machines and let them sit for about ten minutes, we start getting up to 20% packet loss from other machines on the network and up to 40-50% packet loss from the offending servers.

Does anyone have an idea as to why this might be happening or what I can to do troubleshoot the issue? I don't mind wiping these boxes if I have to, there isn't any production data on them yet.

Best Answer

I'd start by looking at the switch log buffers (or the syslog you're exporting them to, if you have one).

I've seen problems in the past with multi-NIC linux machines responding to ARP inappropriately (as in "not on the expected interface") and even more problems with blades in a blade-server chassis where there were multiple VLANs attached to the switch, but no (working) way of imposing VLANs on the actual blade switch. This ought to show up as MAC-related complaints in the logs.

As a second step, do things get better if you enable arp_filter on all interfaces on your new servers?