Should TCP Checksum Offload Be Disabled?

checksumhardware

In a work sheet for PCs that we deliver to customers, I found instructions to always "Disable TCP Checksum Offload" on the NICs. Via those NICs, the PCs are connected to the customer's LAN, and it is imperative that they work without problems.

The explanation by a seasoned colleague for why this instruction is needed was that "NICs are always buggy and are causing lots of problems because of this."

Now, I could just imagine that they've had such problems a couple of years ago and have been doing things this way ever since.

But since the offload is usually on by default, I could also imagine that the NICs we get these days have improved somewhat. Also, we only choose robust and well-made PCs made by known-good manufacturers.

Does it make sense to always "Disable TCP Checksum Offload" on the NICs of a PC I could buy today? Or could I strike the above work instruction from the work sheet without having to fear outages on the customer's site?

Best Answer

I've just spent about 2 days debugging and figuring how to workaround a problem that seems to have come down to buggy TCP offloading with linux and Intel ethernet drivers/chipsets in Intel NUCs. Simple google searches reveal 1000s of reports just in this one driver (e1000e) going back over 10 years, many of which appear to be unfixed but can be worked around by disabling offloading one feature or another to the NIC. (Of course, some of these are hardware issues like broken cables or bad switches, but many people have ruled that out, there's definitely a large number of real problems in there - and broken cables/bad switches wouldn't be fixed by disabling offload.)

The manifestation of the problem was simply that maybe about once a day, the network card would pause for 30 seconds or so, and some (but not all) ongoing TCP sessions would be aborted due to corrupt packets. That's in the "really annoying, but hardly worth the bother of the user calling the helpdesk, and very low chance of anyone ever investigating and figuring out the cause" category in many large companies I've experienced.

So the answer is modern day NICs still suck, and I'd previously viewed Intel as 'known good' ethernet NIC vendor...

It honestly leaves me greatly puzzled - why is Intel not fixing these problems, and why does even bleeding edge linux (which I tried and didn't fix the problem) default to enabling checksum offload on this driver/chipset when it's known to be the source of many problems?

The benefit of checksum offload on modern processors on gigabit ethernet connections is minimal.

So overall I'd definitely side with disabling offload by default. The benefits are at best minimal in the majority of non-server situations, the downsides pretty bad (imagine 1,000 desktop users all losing 30 seconds a day - even taking the low end figure of a minimum wage worker, ignoring opportunity cost and assuming they can immediately resume exactly back where they were at the end of 30 seconds you're looking at wasted productivity 100 USD a day or about 25K USD a year lost - in reality much higher as this is a most optimistic figure.