Ethernet – How to a TCP window size be allowed to be larger than the maximum size of an ethernet packet

ethernetlayer4protocol-theorytcptransport-protocol

I know that TCP window sizes can be scaled to over 64KB, but looking at an ethernet packet datagram, such as this one:

Network Layer model

it looks like a layer 2 packet is limited in size to be much smaller than that. How does ACKing work at the TCP layer if a single TCP request requires several network requests to be assembled at a reciever?

Best Answer

The TCP window size is generally independent of the maximum segment size which depends on the maximum transfer unit which in turn depends on the maximum frame size.

Let's start low.

The maximum frame size is the largest frame a network (segment) can transport. For Ethernet, this is 1518 bytes by definition.

The frame encapsulates an IP packet, so the largest packet - the maximum transfer unit MTU - is the maximum frame size minus the frame overhead. For Ethernet, that's 1518 - 18 = 1500 bytes.

The IP packet encapsulates a TCP segment, so the maximum segment size MSS is the MTU minus the IP overhead minus the TCP overhead (MSS doesn't include the TCP header). For Ethernet and TCP over IPv4 without options, this is 1500 - 20 (IPv4 overhead) - 20 (TCP overhead) ) = 1460 bytes.

Now, TCP is a transport protocol that presents itself as a stream socket to the application. That means that an application can just transmit any arbitrarily sized amount of data across that socket - in a single flow or in any number of chunks. For that, TCP splits the data stream into said segments (0 to MSS bytes long {1}), transmits each segment over IP, and puts them back together at the destination. On the application level, the data stream looks the exact same way on the source and on the destination. All segmentation, packetization, fragmentation, etc is completely transparent.

TCP segments are acknowledged by the destination to guarantee delivery. Now imagine the source node would only send a single segment, wait for acknowledgment, and then send the next segment (send window = MSS). Regardless of the actual bandwidth, the throughput of this TCP connection would be limited to one packet per round-trip time (RTT, the time it takes for a packet to travel from source to destination and back again).

So, if you had a 1 Gbit/s connection between two nodes with an RTT of 10 ms, you could effectively send 1460 bytes every 10 ms, or 146 kB/s. That's not very satisfying.

TCP therefore uses a send window - multiple segments that can be "in flight" at the same time, being sent out and awaiting acknowledgment. It's also called a sliding window as it advances each time the segment at the beginning of the window is acknowledged, triggering the sending of the next segment that the window advances to. This way, the segment size doesn't matter, just the send window size does. With a window of traditionally 64 KiB we can have that amount in-flight and accordingly, transport 64 KiB in each 10 ms = 6.5 MB/s. Better, but still not really satisfying for a gigabit connection.

Modern TCP uses the window scale option that can increase the send window exponentially to up to 64 KiB * 214 = 1 GiB, providing for some future growth.

But why isn't all data just sent at once and why do we need this send window?

If you send everything as fast as you - locally - can and there's (very likely) a slower link somewhere along the path to the destination, significant amounts of data would need to be queued. No switch or router is able to buffer more than a few MB (if at all), so the excess traffic would need to be dropped. Failing acknowledgment, it would need to be resent, the excess being dropped again. This would be highly inefficient and it would severely congest the network. TCP handles this problem with its congestion control, adjusting the window size according to (estimated!) effective bandwidth and current round-trip time in a complex algorithm.

{1} Empty segments can be used to prevent connection timeouts using the keepalive option. Thx Deduplicator