Networking – Mysterious ‘Fragmentation Required’ Rejections from Gateway VM

gatewaylinux-networkingnetworkingvmware-esxi

I've been troubleshooting a severe WAN speed issue. I fixed it, but for the benefit of others:

Via WireShark, logging, and simplifying the config I narrowed it down to some strange behaviour from a gateway doing DNAT to servers on the internal network. The gateway (a CentOS box) and servers are both running in the same VMware ESXi 5 host (and this turns out to be significant).

Here is the sequence of events that happened – quite consistently – when I attempted to download a file from an HTTP server behind the DNAT, using a test client connected directly to the WAN side of the gateway (bypassing the actual Internet connection normally used here):

  1. The usual TCP connection establishment (SYN, SYN ACK, ACK) proceeds normally; the gateway remaps the server's IP correctly both ways.

  2. The client sends a single TCP segment with the HTTP GET and this is also DNATted correctly to the target server.

  3. The server sends a 1460 byte TCP segment with the 200 response and part of the file, via the gateway. The size of the frame on the wire is 1514 bytes – 1500 in payload. This segment should cross the gateway but doesn't.

  4. The server sends a second 1460 byte TCP segment, continuing the file, via the gateway. Again, the link payload is 1500 bytes. This segment doesn't cross the gateway either and is never accounted for.

  5. The gateway sends an ICMP Type 3 Code 4 (destination unreachable – fragmentation needed) packet back to the server, citing the packet sent in Event 3. The ICMP packet indicates the next hop MTU is 1500. This appears to be nonsensical, as the network is 1500-byte clean and the link payloads in 3 and 4 already were within the stated 1500 byte limit. The server understandably ignores this response. (Originally, ICMP had been dropped by an overzealous firewall, but this was fixed.)

  6. After a considerable delay (and in some configurations, duplicate ACKs from the server), the server decides to resend the segment from Event 3, this time alone. Apart from the IP identification field and checksum, the frame is identical to the one in Event 3. They are the same length and the new one still has the Don't Fragment flag set. However, this time, the gateway happily passes the segment on to the client – in one piece – instead of sending an ICMP reject.

  7. The client ACKs this, and the transfer continues, albeit excruciatingly slowly, since subsequent segments go through roughly the same pattern of being rejected, timing out, being resent and then getting through.

The client and server work together normally if the client is moved to the LAN so as to access the server directly.

This strange behaviour varies unpredictably based on seemingly irrelevant details of the target server.

For instance, on Server 2003 R2, the 7MB test file would take over 7h to transmit if Windows Firewall was enabled (even if it allowed HTTP and all ICMP), while the issue would not appear at all, and paradoxically the rejection would never be sent by the gateway in the first place if Windows Firewall was disabled. On the other hand, on Server 2008 R2, disabling Windows Firewall had no effect whatsoever, but the transfer, while still being impaired, would occur much faster than on Server 2003 R2 with the firewall enabled. (I think this is because 2008 R2 is using smarter timeout heuristics and TCP fast retransmission.)

Even more strangely, the problem would disappear if WireShark were installed on the target server. As such, to diagnose the issue I had to install WireShark on a separate VM to watch the LAN side network traffic (probably a better idea anyway for other reasons.)

The ESXi host is version 5.0 U2.

Best Answer

You can't drop ICMP fragmentation required messages. They're required for pMTU discovery, which is required for TCP to work properly. Please LART the firewall administrator.

By the transparency rule, a packet-filtering router acting as a firewall which permits outgoing IP packets with the Don't Fragment (DF) bit set MUST NOT block incoming ICMP Destination Unreachable / Fragmentation Needed errors sent in response to the outbound packets from reaching hosts inside the firewall, as this would break the standards-compliant usage of Path MTU discovery by hosts generating legitimate traffic. -- Firewall Requirements - RFC2979 (emphasis in original)

This is a configuration that has been recognized as fundamentally broken for more than a decade. ICMP is not optional.

Related Topic