Very high tcp retransmit rate to Internet on devices inside LAN

kvm-virtualizationnetworkingperformancepfsensetcp

I work with Linux systems a lot as a developer and have a moderate level of networking knowledge, but the following situation is a real baffling mystery:

I'm running pfSense 2.4.3 (that's current) on a kvm virtual machine on a linux host. I intend to use it for my Internet routing, NAT, and firewall. The VM has a macvtap-based interface giving it essentially direct access to my cable modem through a physical interface on the host that the modem's Ethernet is connected to. It has another virtual interface that is a member of a network bridge on the linux host, to provide the LAN side of the Internet connection on the Linux host itself and to other devices on the LAN. The LAN network is 192.168.123.0/24; pfSense's LAN address is statically assigned to 192.168.123.2.

This setup seemed to work fine for the first four or five days, but then I noticed that machines on the LAN were getting very poor throughput, both upload and download, for TCP traffic destined for the Internet: around 1mbps. A packet capture shows TCP retransmit rates around 16%.

I've observed the following additional things:

  • All machines on the LAN are able to communicate with each other without apparent issue. In particular, if I transfer a file hosted on the pfSense instance to a machine on the LAN, the transfer proceeds quickly (200+ mbps) and a packet capture showed 1 retransmit in 77k packets.
  • TCP throughput to the Internet when running e.g. curl or wget right on the pfSense instance itself runs at the rated speed of my Internet connection, around 60 mbps. I've reproduced this multiple times.
  • The poor performance is the same whether using NATted IPv4 or IPv6.
  • My switch reports a receive error rate of about 1.2 percent on the port that the Linux host / pfSense bridge is connected to. I'm not seeing high dropped or errored Ethernet frame statistics on any of the hosts.
  • Packet captures taken by both pfSense and machines on the LAN show that when packets are retransmitted, the earlier packet with the same sequence number was received and, in a handful of cases I diffed, contain identical TCP segment payloads, seemingly suggesting that the retransmission was not necessary?

Then, I also noticed that the Linux host for pfSense had lost its ability to open outbound TCP connections at all, but:

  • it can still ping things out on the Internet just fine.
  • the pfSense firewall logs show that my outbound connection attempts are not getting blocked by some erroneous rule.
  • The apparently flawless LAN connectivity includes between Linux host for the pfSense VM and the pfSense VM — Linux can open TCP sessions to its default 192.168.123.2 gateway and transfer things to and from it quickly.
  • A packet capture taken by pfSense of its WAN interface shows TCP SYN packets going out to the macvtap interface when the Linux host tries to make outbound connections, as expected. They are never acknowledged.

Clues? Things I could try? All out guesses?

Best Answer

I traced both of my problems to incorrectly computed TCP segment checksums. They would always be incorrectly computed for traffic coming from the Linux host running the pfSense VM, and would intermittently be incorrectly computed for other traffic.

This seems to be because pfSense does not compute these checksums by default, offloading it to the network driver / hardware instead. You can set pfSense to compute them in software: System / Advanced / Networking, Disable hardware checksum offload