Linux – Multi-host VM/Docker network communication is SLOW, any best practice


VM-1 on host-1 <[cable]> network router <[cable]> host-2 with VM-2

If I understand correctly, in case of file transfer from application on VM-1 to same application on VM-2 the data will go through following journey:

  • VM-1 application file read to memory buffer
    • programming language related calls
    • operating system level calls
    • seccomp/apparmor logic
    • file system permissions logic
    • operating system file handling and buffer
  • VM-1 application data sent to network socket buffer
    • operating system calls
    • seccomp/apparmor logic
  • VM-1 operating system network stack
    • routing tables
    • firewall logic
  • Host-1 hypervisor virtual network stack
    • virtual switch
    • routing tables
  • Host-1 operating system network stack
    • routing tables
    • firewall logic
  • Host-1 physical network card buffer
  • Network router
  • almost same stack of things mirrored goes here for VM-2 on host-2

Assuming that file will be large, then steps related to seccomp/apparmor, routing and firewall will be cached/omitted for already openned and transfering file.

But in case of frequent communication between virtual machines with messages small enough to fit into 1-2 tcp packets we have problem

Every call and logic processing will need several hundred CPU ticks and described overstack will put significant load on CPU and play role in latency.


  1. Will pre-openned communication socket between VMs ommit any steps in described list?
  2. Does SDN somehow mitigate such problems or does it add even more overlays and extra headers to packets?
  3. Do I really need described process to communicate between VM-1 and VM-2 or there is a special linux "less-secure-more-performance-use-on-your-own-risk" build?
  4. Do I have to stick with linux at all? Any faster *BSD-like systems with docker support?
  5. What are best practices to mitigate that bottleneck to fit more VMs with micro-services on same host as result?
  6. Do solutions like Project Calico help or it is more about lower level?

Best Answer

Will pre-openned communication socket between VMs ommit any steps in described list?

Pre-openned socket beween VMs/Containers will do a trick due to TCP handshake overhead; and even more, if there is a TLS.

Although it is accepted that handshake overhead is negligibly small, but when we speak of frequent communication, it starts to play significant role.

Having boundary state of M x N openned connections in case of mesh-like containers network is not very wise. Simple keep-alive solution with TTL based on your own containers communication statistics will be better.

Keep in mind that too many threads keeping TCP connections alive will cause another overhead, so make sure that you use epoll.

Does SDN somehow mitigate such problems or does it add even more overlays and extra headers to packets?

It does add more overlays, many are vendor-locked, but there is at least one pipework SDN related solution described below which is about Docker environment.

Do I really need described process to communicate between VM-1 and VM-2 or there is a special linux "less-secure-more-performance-use-on-your-own-risk" build?

I didn't find "special" linux build with enought-to-trust community and updates support, but problems with slow linux TCP stack are not new, and there are many options for kernel bypass. Cloudflare does that.

From articles I found, slow linux TCP stack is well-known and there is no option to drop-in linux server and win: you have to fine-tune that Torvald's child to solve your own problem this or that way every time.

Do I have to stick with linux at all? Any faster *BSD-like systems with docker support?

Have found no evidence where Windows, MacOS or *BSD-like system had better networking than latest linux with its slow TCP stack with kernel bypass applied.

What are best practices to mitigate that bottleneck to fit more VMs with micro-services on same host as result?

So, there are two bottlenecks: guest linux and host linux.

For host linux, in case if it is used not only for containers hosting, there is a kernel bypass strategy with big variety of options from descibed in Cloudflare blog and "Why do we use the Linux kernel's TCP stack?" article to writing your own application-focused TCP stack.

For guest linux Macvlan may be used to bypass Layer 3 and connect docker container directly to the NIC with its own MAC address. It is much better than bridge, because it mitigates a lot of both guest and host linux network bottlenecks, but make sure that you are ready to explode your router mac address table with another hundred or thousand records - most likely you will have to segment your network.

Also as per Virtual switching technologies and Linux bridge presentation there is a SR-IOV option which is even better that Macvlan. It is available for docker 1.9+ for Mellanox Ethernet Adapters as plugin, included as a mode in pipework SDN, has dedicated SRIOV plugin from Clear Containers - more than enough to start digging application-focused solution.

Do solutions like Project Calico help or it is more about lower level?

It is totally another level and will not have significant impact in comparison with SRIOV and Macvlan, but they help to simplify network managing with almost no overhead on top of bypass solution your will choose.

And yes...

Monitor your Docker closely, as it may do unexpected things. For example it modprobes nf_nat and xt_conntrack, where there is no reason with Macvlan turned on, it leads to extra CPU ticks spending.