The main principle behind interrupt moderation is to generate less than one interrupt per received frame (or one interrupt per transmit frame completion), reducing the OS overhead encountered when servicing interrupts. The BCM5709 controller supports a couple of methods in hardware for coalescing interrupts, including:
- Generate an interrupt after receiving X frames (rx-frames in ethtool)
- Generate an interrupt when no more frames are received after X usecs (rx-usecs in ethtool)
The problem with using these hardware methods is that you need to select them to optimize throughput or latency, you can't have both. Generating one interrupt for each received frame (rx-frames = 1) minimizes latency, but it does so at a high cost in terms of interrupt service overhead. Setting a larger value (say rx-frames = 10) reduces the number of CPU cycles consumed by generating only one interrupt for each ten frames received, but you'll also encounter a higher latency for the first frames in that group of ten.
The NAPI implementation attempts to leverage the fact that traffic comes in bunches, so that you generate an interrupt immediately on the first frame received, then you immediately switch into polling mode (i.e. disable interrupts) because more traffic will be close behind. After you've polled for some number of frames (16 or 64 in your question) or some time interval, then the driver will re-enable interrupts and start over again.
If you have a predictable workload then fixed values can be selected for any of the above (NAPI, rx-frames, rx-usecs) that give you the right trade-off, but most workloads vary and you end up making some sacrifices. This is where adaptive-rx/adaptive-tx come into play. The idea there is that the driver constantly monitors the workload (frames received per second, frame size, etc.) and tunes the hardware interrupt coalescing scheme to optimize for latency in low traffic situations or optimize for throughput in high traffic situations. It's a cool theory but may be difficult to implement in practice. Only a few drivers implement it (see http://fxr.watson.org/fxr/search?v=linux-2.6&string=use_adaptive_rx_coalesce) and the bnx2/e1000 drivers aren't on that list.
For a good description of how each ethtool coalescing field is supposed to work, have a look at the definitions for the ethtool_coalesce structure at the following address:
http://fxr.watson.org/fxr/source/include/linux/ethtool.h?v=linux-2.6#L111
For you particular situation (~400Mb/s throughput) I'd suggest tuning the rx-frames and the rx-usecs values for the best settings for your workload. Look at both the overhead of the ISR as well as the sensitivity of your application (httpd? etc.) to latency.
Dave
hifn
based crypto accelerator hardware has been in use in BSD for while now; quick Google shows Linux drivers as well. The Express DX 1845 card boasts 25Gbps throughput on their brochure, but YMMV, and obviously I'd want to talk to a product/sales engineer first to see if it would work for your purposes.
Best Answer
If someone else will be trying to find out how to make Linux Networking TCP/IP stack to scale on multiple CPU cores...
MSI can be exploited by two underlying NIC technologies to distribute packets across multiple queues. Each NIC queue is handled by a different Interrupt on a dedicated CPU core to achieve scalability:
The problem with RSS is that it always uses source IP to generate a hash. Hash is used to find to which queue this packet should go. This means that one can't control which packets should go to which queue unless he also has control over the source IPs.
VMDq seems to be more appropriate to my problem, because it distributes packets by destination MAC address. It could be as simple as assigning two different IP addresses to the same interface.
Source: