DPDK Performance – Advances in Hardware That Increased DPDK Performance on Packet Processing

hardware

I've heard that DPDK library will allow us to use more Software-switches/routers (and maybe even to replace them completely), because somehow it helps dramatically increase performance on processing packets for x86 CPUs.

I've tried to do a little research, and the most of what I've been able to undestand is that: DPDK allows application to bypass Linux kernel, reducing some copying and an amount of IRQ requests. Is that all? Or there are also hardware advances in newest x86 CPUs? Maybe some new instructions? Maybe some templating-blocks? Or another architecture optimisations?
What changed in hardware?

P.S. If bypassing Linux kernel – is all that DPDK does, I don't understand, what prevented these optimisations earlier, and why vendors needed to spend so much money for building ASICs and other staff, when they could just write new Operating System or library such as DPDK…

Best Answer

I've tried to do a little research, and the most of what I've been able to undestand is that: DPDK allows application to bypass Linux kernel, reducing some copying and an amount of IRQ requests. Is that all?

DPDK is a software library. As (@Zac67 mentioned) it has an EAL which allows an application to send and receive IP packets without having to know about the hardware doing the actual packet forwarding. The Kernel also provides this servers to applications but DPDK does is more efficient.

The two major efficiency gains that DPDK offers over sending packets through the Linux Kernel are;

  1. Copying the same data multiple times is very inefficient.

  2. DPDK operates in batches of packets to be cache efficient.

1: This makes a massive difference. For a typical application on Linux that sends a packet, when the application calls the send() syscall for example, the packet is copied from user-space memory into Kernel memory (into an skbuff). The skbuff is then copied into another section of memory by the Kernel that is accessible to the NIC. The Kernel signals to the NIC that there is a packet waiting to be sent and the NIC then copies the packet from this area of memory (via DMA transfer - Direct Memory Access) into it's hardware tx-ring buffer. This means that to get a packet from an application and into the NIC tx-ring buffer, the same packet is copied three times. DPDK allows for the NIC to DMA the packet directly from the application memory space (this works by DPDK disconnecting the NIC from the Kernel and mapping the DMA memory space into the user-land memory space the application is using, hence "Kernel bypass").

2: By batch processing packets DPDK can make very good use of the CPU cache. The first packet to be processed will incur an instruction cache miss whilst the i-cache warms up. All subsequent packets within that "batch" which be processed very fast using the same instructions which now are "hot" in the i-cache.

There are many other software improvements in DPDK over using the standard Kernel IP stack however, these 2 have a major impact on DPDK's performance.

Or there are also hardware advances in newest x86 CPUs? Maybe some new instructions? Maybe some templating-blocks? Or another architecture optimisations? What changed in hardware?

Yes there are hardware improvements, such as new copy instructions on x86 CPUs which are more efficient, and DPDK support these (e.g. the SSE3 instruction set). The major benefits are from how efficient the software is though and how well the software makes use of good hardware (e.g. making good use of fast CPU L1/L2 I-cache/D-cache). DPDK has support for various IPSEC and crypto features so if you have an Intel CPU with the AES-NI instruction set ("Advanced Encryption Standard - New Instructions") DPDK will use those instructions for faster AES encryption/decryption.

P.S. If bypassing Linux kernel - is all that DPDK does, I don't understand, what prevented these optimisations earlier,

The Kernel could be sending a packet over an Ethernet link so it will build Ethernet headers for the IP packet, or it could go over PPP, L2TP, IPSEC, GRE, MPLS, ATM, dial-up etc. The Linux Kernel is very flexible and can do many things, which means it has to make many checks and burn CPU cycles trying to accommodate for the many different networking scenarios it supports (A Linux host could be a router, firewall, switch, proxy, etc). DPDK supports IPv4 and IPv6 over Ethernet and that's about it (yeah OK IPSEC too, but not ATM for example!). DPDK provides only a sub-set of features the Kernel does so the code is more focused on it's task. Also on a side note, the process for getting updates approved into the Linux Kernel is more tricky but improvements are being made to the Linux Kernel all the time (see below).

and why vendors needed to spend so much money for building ASICs and other staff, when they could just write new Operating System or library such as DPDK...

ASICs are hardware that is tuned/dedicated to a specific task (as per the name, Application Specific Integrated Circuits). DPDK will never be as fast as an ASIC (that sentence doesn't really make sense as it's comparing hardware to software). ASICs are designed and built to do a specific task as efficiently as possible. DPDK runs on x86 general purpose CPUs, they have many weird and wonderful features because they could be used for any task/workload. DPDK tries to make the best use of a general purpose x86 process for the specific task of network processor whereas an ASIC is pre-built to be efficient at network processing.

Side note on Linux networking speed:
I've been writing this tool (still a work in progress) to compare the different Kernel methods for sending packets: https://github.com/jwbensley/EtherateMT
When using one CPU core and a frame size of 64 bytes I get the following frames per second rates for different Kernel methods:
send() 1.00Mfps
sendmsg() 1.01Mfps
sendmmsg() 1.2Mfps
TPACKET_V2 1.3Mfps
TPACKET_V3 1.3Mfps

To put those numbers in perspective, 10Gbps using 64 byte packets is 14.48Mpps, which DPDK can achieve on a single CPU core. So the native Kernel path is about an order of magnitude slower however, TPACKETV_v4 has just been released which includes a zero copy option, which should bring major speed increased.

EDIT: I forgot to say, there are downsides to DPDK, you need a NIC that is explicitly supported by DPDK for example. Hard work is being done to grow the list of supported NICs all the time. This is because DPDK uses its own driver. Linux supports thousands of more NICs though, because it's so flexible and has a standard method of driver implementation, so any NIC with a complimant driver should work. This is another trade off for the speed gained by using DPDK.

EDIT 2:

But isn't this "batch approach" to packet processing lead to increased latency?

The Linux Kernel has batch processing for packets built in. You can turn this off and for every packet received by the NIC a hardware interrupt will be generated and the Kernel signalled to read and process the packet. This has a high processing overhead but as you say, it will reduce latency. It will also reduce bandwidth because of the additional overhead of processing every packet separately, fewer packets per second can be processed. One can use interrupt coalescing on Linux to batch process packets only when needed. Interrupt coalescing automatically tunes the frequency that hardware interrupts are generated. Hardware interrupts can be rate limited to be generated only once every 30us for example, when there are 1 or more packets received by the NIC waiting for the Kernel to process them. This increases bandwidth and latency. Interrupt coalescing can increase or decrease the interrupt rate limiting automatically depending on how many packets are coming into the NIC waiting to be processed. If you need very low latency, like financial traders, you might turn interrupt coalescing off to reduce bandwidth to reduce latency.

When comparing batch processing vs. per-packet processing within the same packet processing framework, such as the Linux Kernel, there is a trade-off of latency vs. bandwidth as I have just described. However, in relation to this NE question, the batch processing in DPDK is so efficient, that when comparing DPDK batch packet processing to both batch processing and per-packet processing within the Linux Kernel, DPDK is still 10x faster and lower latency. The key point of DPDK is that it is purpose written high performance code and that the Linux Kernel runs of millions of machines across the world, meaning it is very flexible in exchange for performance.