Debugging why network packets are being dropped

ethernetnetworkingvirtual-machinesvmware-workstationwireshark

Preface:

I have an application that I am currently testing out that runs on RHEL 6. The setup for my test is the application installed on a embedded device, connected by an Ethernet cable to PC that communicates with the virtual machine on that PC in which Linux is ran. The virtual machine (on VMWare Workstation) on the pc and the embedded device both have a static IP address since they need to communicate through each other over the Ethernet cable.

The application needs to communicate using a pub-sub tool in this case RTI DDS. This has been tested out in a wireless environment and another wired environment with a different PC but same virtual machine and in both of these environments the pub-sub has worked.

Problem:

When testing out the pub-sub on the current set up, we can see through wireshark all the fragmented packets delivered from the embedded device are delivered to the PC's main operating system (windows in this case). However when the fragmented packets are sent from the main operating system to the virtual machines operating system, the virtual machine is only receiving the last packet that was received as seen in wireshark and the rest are dropped.

We have so far tried to disable the firewalls and pinging devices from each other which all work correctly and had no issues. Thus, gave us no insight into why packets are being dropped.

What way is there to debug how and why network packets are being dropped, maybe even possible through wireshark since we are currently using that tool?

Best Answer

In a general sense I suspect MTU (frame size) is the root of the problem. I have a few reasons and a few suggestions.

First, this behavior varies by L2 (it only happens with the wired traffic as opposed to wireless). That in itself is suspicious and suggests that there is a problem at the interface level.

Second, packet fragmentation is a symptom of MTU misalignment. Packet fragmentation is not a problem per se but it is not optimal as it creates overhead and additional points of failure.

Third, only "the very last packet received" being received by your Linux guest VM, is a known issue with certain VMware NICs and versions.

Now, since the host is receiving any case, and since MTU size only affects packets sent, you cannot change your MTU on the VM and expect anything different. You can however do the following:

Suggestions

Determine if MTU is a problem

Run ping -f -l (your host vm adapter mtu, which is a #) your.guest.ip.or.name , like ping -f -l 1500 myguest.

If it works when you use a -l value of your current MTU, then I am wrong and ignore. Otherwise, keep lowering the -l value until it does respond, then set your host virtual adapter to have that MTU. See http://www.thincomputing.net/2011/06/28/mtu-size-mismatch-a-major-cause-of-disconnections/

Use a different vNic driver in vmware workstation

There are known issues with certain OS and certain vNic and certain hypervisors. I include some research of known vmware issues below but just try to use a different vNIC driver on the guest. If you are using E1000, try one of the newer ones. If you are using vmxnet3, try 2 or E1000. Etc. If this fixes it, you can either keep it or look up the specific driver you had before to find out how to fix it from vmware.

Experiment with a lower MTU on your host

Lower the MTU on your host from where ever it is now (probably about 1500) to somewhere around 1380. If the problem goes away, keep increasing it until you reach about 1468. Leave it.