Performance Test and TCP tuning

nettcpwindows-server-2008

We are in the process of performance testing an application which receives tcp requests converts them to soap requests (WCF-httpBinding) which other services work on.

The server is Windows Server 2008 R2.
The TCP requests are received by TcpListener instance (.NET C#).
There are 3 http-binded WCF services running on the same server.

We have built a performance test client which goal is to simulate multiple concurrent requests(each request has to be different and recognizable by the application).

We built a test running 150 requests that run on the same time (by 150 different threads), and we noticed straight away that some requests get the TCP connection slowly, but once they get it, they act fast.

A single request writes twice on the same connection- request and an application ack.

Although a single request+ack can take about 150ms, the 150 test takes about 7 seconds.

The Problem

When we try to run this test from 2 different computers we lose requests. some clients requests are getting

no connection was made because the target machine actively refused it

So I got here and got convinced it was because of the backlog.
I changed the TcpListener parameters and did the registry AFD backlog changes written here
but it still didn't work, so I inserted all of the TCP tuning suggested plus some netsh commands which were recommended, but still no change, we still get that error.

Is there anything else I need to know? Are there any other solutions?

Best Answer

This is an unfortunate Windows annoyance. When a Windows server gets overwhelmed, it actively refuses connections (responding with a RST) rather than just not responding to them (not sending a SYN). To prevent this from being interpreted as failure, Windows clients typically retry connections (sending another SYN) even when they are actively refused (responded to with a RST).

You should retry the connection, even if it is actively refused. Obviously, if you attempt more connections per second than the server can handle, they can't all succeed. How you want to handle this case is up to you.

So the short answer is -- what do you want to happen if you get more connection attempts than the server can handle? Do you want them to just go really, really slow? Or do you want them to fail? If the former, retry forever, waiting longer and longer in-between retries. If the latter, then give up.

Increasing the backlog will help keep short bursts of load from triggering this error. But if the connection rate exceeds the rate the server can handle, you cannot make it work. No finite number of backed up connections would do because the connection rate exceeds the work completion rate. So at some point, you have to start failing unless the clients are specifically coded to retry forever.