Windows Server 2016 Dropping Packet Retransmissions

amazon-web-servicesnetworkingtcpwindows-server-2016

We've come across an issue where we're able to narrow down root cause and hoping the collective here can provide insight.

We have an AWS EC2 instance, c4.8xlarge, that is dropping tcp retransmits causing the application to break. Has anyone seen anything like this happen?

We're unable to narrow down what's causing this to happen.

We have a system on a remote site application sending data to the EC2 instance, that is then returned back by the EC2 instance. During the course of the data exchanges, a packet is lost in transit. TCP by nature, then attempts to recover.

The EC2 instance sends a Fast Retransmission, but this retransmit never makes it out of the virtual nic of the EC2 instance.

We've gotten AWS to perform a packet capture right off the NIC of the EC2 instance and they don't see it hitting the wire. A packet capture on the EC2 instance shows the retransmit, but again, this never actually makes it to the virtual nic.

The EC2 instance then attempts 5 more retransmissions that also don't make it out, ending up with the EC2 instance issue a TCP reset.

Ping/mtr all look normal. We can readily reproduce when running these jobs that will eventually up because of lost retransmits.

Any insight would be helpful please!

Edit: We've attempted to duplicate the issue by simulating the traffic (http download / upload, scp transfer), but seems we're only able to duplicate with the original application.

Final update: We're unable to determine root cause. The team has re-built the servers using a new AMI and at this time, everything is working to the new EC2 instances.

Best Answer

I had run into exactly the same issue. Tried to enable jumbo packets support for both client and server by using powershell cmd Set-NetAdapterAdvancedProperty -Name "Ethernet 2" -RegistryKeyword '*JumboPacket' -RegistryValue '9014', the issue was gone and the retransmits made them through.