Should the UDP Timeout Value Be Increased on a VPN Router?

routerudpvpn

We have a SonicWALL NSA 2400. We have numerous (100+) personnel in different offices in different physical locations that establish VPN sessions successfully every day. Some of these client computers are behind routers. One of these offices is reporting that, with four office computers, they sometimes cannot get some or all of their four VPN sessions to connect.

When I check our SonicWALL logs, I see that our NSA 2400 is receiving packets from their office router. Our NSA then tries to acknowledge these packets, but this results in a time-out message for the ack. (Unfortunately, I don't have the specific error messages.)

I checked our settings and our UDP timeout value for this connection (LAN->VPN) is 30 seconds. This seems low and I would like to increase this in an attempt to remedy the problem with the timeout ack messages, but I don't know what effect this will have on the rest of the connecting VPN sessions. Our TCP timeout value is 900 minutes, by the way, for reference. I suspect some previous administrator changed the TCP settings but forgot/ignored the UDP settings.

The NSA logs also have messages about UDP packets being dropped, as well, both incoming and outgoing. The messages are for various port numbers and services (DNS, IKE (Transversal), etc.), so changing the UDP settings might help in removing these messages.

My fear is that changing these settings will affect the existing VPN connections. However, the NSA has 512 MB of RAM and our peak connection usage has been around 6200 connections out of 32000 max. We run about 1500 connections on average.

I have researched this on the web, but have only found general troubleshooting recommendations to "increase the UDP timeout and see if that fixes it".

So will increasing the UDP timeout to 300 seconds negatively affect the other existing VPN connections? It doesn't look like we are close to our connection limit.

Best Answer

NAT Traversal is the standard solution for what you are describing. Specifically, you need to verify the NAT Traversal keep alive setting. Let me explain.

More than likely, you are using IPsec in ESP mode. In a non-NAT traversal scenario, your data packets will look like this (after negotiation):

[Outer IP header][ESP Header][Encrypted/Authenticated Data][ESP Trailer]

The outer IP header will include the the Source/Destination IP of your VPN gateway device (your NSA 2400) and whatever WAN IP the remote office is using.

In your case, you have multiple clients sharing a single source IP -- which is a very common PAT configuration. As such, the Remote router needs to re-write the source IP of each of your clients, as well as the source port. If that router were to have encountered a packet like the one above (with no L4 header between the L3 header and the data), it would drop it claiming it is a malformed packet.

NAT Traversal inserts an additional L4 header in the packet. So it would look like this:

[Outer IP header][NAT-T UDP 4500 header][ESP Header][Encrypted/Authenticated Data][ESP Trailer]

This header starts with UDP source and destination port 4500 (by default, this is configurable). The destination port typically does not change (on route TO your VPN head end device). The source port typically does get changed by the transient NAT device (in this case, the remote office router).

Now. MOST Stateful Firewalls or Routers have a very low UDP connection time out. UDP is, by nature, connection-less, so it is very common for UDP connection timeouts to be less than a minute. Cisco ASA's are 30 seconds, if I recall (don't quote me, its from memory). By comparison, TCP timeouts are inherently high, because usually the stateful device is looking for the RST or FINs to purge an entry from its connection table. A TCP timeout is just a stop gap in case both ends of the communication exploded simultaneously and weren't able to send out a RST/FIN. So a higher TCP timeout than UDP timeout is normal, and even suggested behavior.

With that said, we can describe what is probably going on with your issue...

If the first time a remote office client connects, and their NAT-T header's source port gets re-written to 2222, the VPN Head end device is going to expect all future communication for that particular VPN connection to arrive with a source port of 2222. It will also encapsulate and send all return traffic back to the remote router's IP, with a destination port of 2222. This will work just fine, so long as this particular NAT translation stays active in the remote router.

But IF the Remote Router purges its NAT translation table, the next packet sent from Client to Server will get a new UDP source, say 3333. When that gets to the VPN Head end device, it will consider it a new "connection", and expect a full re-negotiation of IPsec (which from a security standpoint, is the right thing to do).

Additionally, if the Remote Router purges its NAT translation table, and the next packet sent is from Server to Client (aka, destined to port 2222), the Remote Router won't have any record of an outbound translation for that source port, so the packet will simply be dropped. Remember, PAT is uni-directional.

All that said. Since its common and perfectly normal for VPN connections to go silent. There is a function built into NAT Traversal called the "NAT Traversal Keepalive". It is simply an empty packet that is sent every 10-15 seconds (differs by implementation) from Client to Server, and the sole purpose of this packet is to keep the NAT translation alive in transient NAT devices, like the remote Router.

So it sounds to me like you have your NAT Traversal Keepalive set to too high. I would start by decreasing that one. Because its a software change you can make on your Head End device, and should affect all your clients, without forcing you to replace hardware on the other end.

Lastly, to re-iterate what Ricky mentioned. If any time out needs to be adjusted, it will be on the Remote router, not on the VPN Head end device.

Related Solutions

Router – Cause of high CPU load on Juniper peering router’s routing engine

There might be some helpful information for you at the Juniper Knowledge Center.

If RPD is consuming high CPU, then perform the following checks and verify the following parameters:

Check the interfaces: Check if any interfaces are flapping on the router. This can be verified by looking at the output of the show log messages and show interfaces ge-x/y/z extensive commands. Troubleshoot why they are flapping; if possible you can consider enabling the hold-time for link up and link down.
Check if there are syslog error messages related to interfaces or any FPC/PIC, by looking at the output of show log messages.
Check the routes: Verify the total number of routes that are learned by the router by looking at the output of show route summary. Check if it has reached the maximum limit.

Check the RPD tasks: Identify what is keeping the process busy. This can be checked by first enabling set task accounting on. Important: This itself might increase the load on CPU and its utilization; so do not forget to turn it off when you are done with the required output collection. Then run show task accounting and look for the thread with the high CPU time:

user@router> show task accounting
Task                       Started    User Time  System Time  Longest Run
Scheduler                   146051        1.085        0.090        0.000
Memory                           1        0.000            0        0.000  <omit>
BGP.128.0.0.4+179              268       13.975        0.087        0.328
BGP.0.0.0.0+179      18375163 1w5d 23:16:57.823    48:52.877        0.142
BGP RT Background              134        8.826        0.023        0.099

Find out why a thread, which is related to a particular prefix or a protocol, is taking high CPU.

You can also verify if routes are oscillating (or route churns) by looking at the output of the shell command: %rtsockmon –t
Check RPD Memory. Some times High memory utilization might indirectly lead to high CPU.

VPN Connection Between Two SonicWall Devices

According your last log, not even Phase 1 is established because both sides of tunnel got a timeout.

I would suggest to make a packet capture to find where the packet is stopping. You should filter in each device with public ip as filter. I suggest three scenarios:

The (returning) traffic is dropped at the firewall. You will see a red line with a drop code.
There are no returning traffic. Some device is dropping IKE packets in the middle.
You see normal returning traffic. There are another problem in your tunnel config.

Best Answer

Related Solutions

Router – Cause of high CPU load on Juniper peering router’s routing engine

VPN Connection Between Two SonicWall Devices

Related Topic