How to improve Intel X520-DA2 10Gb NIC throughput without Jumbo packets

central-processing-unitintelnicwindows-server-2008-r2

Here's what I have done so far:

Using more Rx/Tx buffers boosts performance the most from defaults. I set RSS Queues to 4 on each adapter, and specified starting RSS CPU on the second port to something other than 0 (it's 16 on the PC that I use, with 16 cores, 32 HTs).

From watching ProcessExplorer, I am limited by CPU's ability to handle the large number of incoming interrupts, even with RSS enabled. I am using PCIe x8 (electrical) slot in 2.x mode. Each of the two adapters connects with a 5GT/sec x8 bus.

OS responsiveness does not matter, I/O throughput does. I am limited by clients' inability to process Jumbo packets.

What settings should I try next?

Details: Dual Xeon-E5 2665, 32 GB RAM, eight SSDs in RAID0 (RAMDrive used for NIC perf validation), 1TB data to be moved via IIS/FTP from 400 clients, ASAP.

In response to comments:

Actual read throughput is 650 MB/sec over a teamed pair of 10Gb/sec links, into RAM Drive

Antivirus and firewall are off, AFAICT. (I have fairly good control over what's installed on the PC, in this case. How can I be sure that no filters are reducing performance? I will have to follow up, good point.)

In Process Explorer, I see spells of time where CPU keeps going (red, kernel time), but network and disk I/O are stopped

Max RSS processors is at its default value, 16

Message-signaled intrrupts are supported on both instances of X520-DA2 device, with MessageNumberLimit set to 18. Here's what I see on my lowly desktop card

A way to check MSI support

ProcExp summary

enter image description here

Best Answer

One of the problems with high performance NIC's is that the modern PC architecture has a bit of trouble keeping up. But, in your case, this isn't so much the problem. Let me explain.

The CPU has to do a lot of work processing TCP packets. This affects the throughput. What's limiting things in your case is not the network hardware, but the ability of the server to saturate the network links.

In more recent times, we've seen processing move from the CPU to the NIC like checksum offload. Intel have also added features to help reduce the load further. That's cool and I'm sure all optimizing features are turned on.

As you've alluded to, jumbo frames - actually that helps throughput somewhat. But not as much as RDMA.

Most 10GBit ethernet hardware will have a very nice underutilized feature called RDMA or remote direct memory access. It allows the NIC to do memory to memory copies over the network without the intervention of the CPU. Well, OK the CPU tells the NIC what to do and then the NIC does the rest. The trouble is, it's not used much yet. But it's getting there. Apparently, in the most recent version of Microsoft Windows Server 2012, they have something called SMB Direct. It uses RDMA. So, if you want to increase throughput, you want to use that.

Are you able to put together some test hardware and install it onto there to see how it performs?

By the way, I'm not sure if you will see it at 10Gbit so much, but fast RAM helps with RDMA especially with 56Gbit Infiniband. In general, it's best to use the fastest RAM your server supports.

Also note this comment on the SMB Direct link I put above:

You should not team RDMA-capable network adapters if you intend to use the RDMA capability of the network adapters. When teamed, the network adapters will not support RDMA.


Update: Looks like not ALL 10GBit NIC's support RDMA for some reason. So check your model's features first.

Another thought I had was the type of protocol being used to do your testing may be affecting the results. i.e. protocol overhead on top of TCP overhead. I suggest you look into using something that can test without touching the hard drive such as iperf. There is a windows port of it somewhere.

Related Topic