Openvpn – Poor OpenVPN NFS performance

nfsopenvpn

I have EC2 application servers behind an elastic load balancer. All of the application servers have access to a shared storage server, notably for centralized cache files, logging, etc. The shared storage server implements NFS over OpenVPN to do its job. All of the servers are in the same availability zone and talk to each other over the internal network. However, the shared storage solution is leading to abnormally high CPU and latency that does not typically exist if storage is 100% local. Slight performance decreases with this setup are expected, but in this case CPU has gone up & output has slowed down by 2-3x

So, this is a 2 part question:

  1. What can I do to better understand what the culprit is? I know it's the combination of OpenVPN & NFS, but can I identify the specific files being read & written to the most? Or can I easily tell where the bottleneck is?

  2. Does anybody have advice based purely on the information above? Is there a better way to share files cross-server in a cloud-based environment? (Please don't respond with "use S3", as that is not a good fit for instant/temporary file requirements)

Thanks!

Best Answer

Make sure your link-MTU for the openvpn tunnel is set accurately so that you don't get fragmentation twice (one at the host from 8192 bytes to 1500 bytes and once at openvpn from 1500 bytes to 1400 bytes or whatever). openvpn handles setting the interface mtu very very poorly (actively hinders attempts to set the mtu correctly).

Check latency for different packet sizes going through and around the tunnel. Plot and look for problems.

Set up a test NFS outside the tunnel and do some performance measurements that way to isolate whether or not openvpn is the problem or not.

Try different versions of NFS, both TCP and UDP.

Try enabling aggressive caching and async I/O.


The following is a rant about the "active hindering" of openvpn WRT MTU (added by "request")

Yes. Setting tun-mtu causes openvpn to generate WARNING: normally if you use --mssfix and/or --fragment, you should also set --tun-mtu 1500 (currently it is 1413). I don't use --mssfix or --fragment.

Additionally, setting a static MTU in the configuration is stupid and errorprone, it needs to be dynamic. So, you use "mtu-disc yes", right? Well of course, except the value it passes into the startup script is off-by-14 (though I am using TAP for IPv6 support, which might mysteriously confuse it). So I need to /sbin/ifconfig $1 mtu $(($2-14)) up to get the proper value (proper meaning a value which will cause the tunnel packets to not be fragments or dropped because they are too big).

I have difficulty imagining a circumstance whereby setting the interface MTU to the correct value is not a good idea, at least if you don't have fragment set (and you should never have fragment set, least your network sins come to haunt you). It should also dynamically change the interface MTU if it later gets fragment needed errors due to network changes after initialization.

MSS is entirely the wrong network layer to do this work. If you have the interface MTU correctly configured, Path-MTU discovery, MSS, and everything simply works. If you don't, some things may kinda sort work, other things will not, and which things work depend on whether the real packet is sent from the openvpn host or some other host. Oh, and don't get me started on what can fail if the MTU is asymmetric (not entirely uncommon).

I think OpenVPN was written by people without lots of network and sysadmin experience and their poor choices got hardcoded in configuration and datastructures/API. They really didn't do a very good job with flexible and sane network configuration and integration (this is just one of a several examples). The sad thing is that it is hundreds of times better than the other solutions out there which makes me an OpenVPN supporter. The Cisco VPN is a blight, for example.