Ethernet – Serialization Times in 40G/10G and 100G/25G Ethernet

I've recently been involved in discussions about lowest-latency requirements for a Leaf/Spine (or CLOS) network to host an OpenStack platform.

System architects are striving for the lowest possible RTT for their transactions (block storage and future RDMA scenarios), and the claim was that 100G/25G offered greatly reduced serialization delays compared to 40G/10G. All persons involved are aware that there's a lot more factors in the end to end game (any of which can hurt or help RTT) than just the NICs and switch ports serialization delays. Still, the topic about serialization delays keeps popping up, as they are one thing that is difficult to optimize without jumping a possibly very costly technology gap.

A bit over-simplified (leaving out the encoding schemes), serialization time can be calculated as number-of-bits/bit rate, which lets us start at ~1.2μs for 10G (also see wiki.geant.org).

For a 1518 byte frame with 12'144bits,
at 10G (assuming 10*10^9 bits/s), this will give us ~1.2μs
at 25G (assuming 25*10^9 bits/s), this would be reduced to ~0.48μs 
at 40G (assuming 40*10^9 bits/s), one might expect to see ~0.3μs
at 100G (assuming 100*10^9 bits/s), one might expect to see ~0.12μs

Now for the interesting bit. At the physical layer, 40G is commonly done as 4 lanes of 10G and 100G is done as 4 lanes of 25G. Depending on QSFP+ or QSFP28 variant, this is sometimes done with 4 pairs of fibre strands, sometimes it is split by lambdas on a single fibre pair, where the QSFP module does some xWDM on its own. I do know that there's specs for 1x 40G or or 2x 50G or even 1x 100G lanes, but let's leave those aside for the moment.

To estimate serialization delays in the context of multi-lane 40G or 100G, one needs to know how 100G and 40G NICs and switch ports actually "distribute the bits to the (set of) wire(s)", so to speak. What is being done here?

Is it a bit like Etherchannel/LAG? The NIC/switchports send frames of one "flow" (read: same hashing result of whatever hashing algorithm is used across which scope of the frame) across one given channel? In that case, we'd expect serialization delays like 10G and 25G, respectively. But essentially, that would make a 40G link just a LAG of 4x10G, reducing single flow throughput to 1x10G.

Is it something like bit-wise round-robin? Each bit is round-robin distributed across the 4 (sub)channels? That might actually result in lower serialization delays because of parallelization, but raises some questions about in-order-delivery.

Is it something like frame-wise round-robin? Entire ethernet frames (or other suitably sized chunks of bits) are sent over the 4 channels, distributed in round robin fashion?

Is it someting else entirely, such as…

Thanks for your comments and pointers.

Best Answer

The part that does the division to multiple lanes is called Physical Coding Sublayer in IEEE 802.3ba standard. This presentation by Gary Nicholl gives a good overview of it.

The short explanation is that the data is divided to multiple lanes in blocks of 64 bits each (encoded on wire as 66 bits for clock recovery). Therefore as soon as packet size exceeds N*64 bits (= 32 bytes for 4 lanes), it can fully utilize all lanes. There will be some delay in the encoding, but that is probably implementation-specific.

This diagram is from the presentation linked above: