I think you are misinterpreting the initial ACK. That first segment is 4150 bytes long, which is way more than you can fit into most standard segments, and is typically an indication that the host is doing TCP Large Segment Offload, which allows the application layer (including PCap) to think you are sending large segments, but the NIC is actually splitting them into smaller segments compatible with the MSS/MTU limitations.
So here instead of sending one segment with a data size of 4096 bytes, you are probably sending three: two of 1460 bytes, and one of 1176 bytes (or something like that). So frame 238 in this capture is actually the 4th segment actually sent.
The series of ACKs looks bizarre, as if you had out of order packet delivery in this network. It would have seemed more logical if you had first received segment 243 & 244 (you got the segment for bytes 1461 to 5557, so you SACK that) and then the first packet arrived and got ACKed (packet 239), and then at packet 245 everything gets caught up.
I honestly don't see how you can end up with this particular sequence of segments, unless:
- There are multiple paths of different latency being taken by packets
from the client to the server,
and
- There is something which is fiddling with the initial ACK, sending
it even though the server hasn't actually received that first
segment.
Could there be a WAN optimization device such as a Riverbed somewhere between client and server? They have mechanisms which try to ACK some segments as quickly as possibly in order to accelerate the session's throughput ramp-up.
I may be completely wrong here; that's a really unusual sequence!
Assuming this is a client-side capture, you need to take into account that there is a round-trip delay between your client and server, so these timestamps show when the client sent/received packets, not when the server received/sent them.
E.g. if client sends an ACK at T=0ms, it arrives at the server at T=50ms, suppose that even if the server responds in less than 1ms the response is sent at T=50ms and it arrives at the client at e.g. T=110ms (since the delay may not be constant and can also vary per direction). So in your client side capture you would see ACK at T=0ms and next data at T=110ms, but that doesn't mean the server waited 110ms to respond. It simply means that the total time of (transmission of ACK + processing on server + transmission of data) = 110ms.
So to know for sure, you would need to capture on the server (preferably at the same time as a capture on the client). Without a server-side capture we can only guess, but it seems quite likely that you have a delay of around 20ms between your client and your server, and hence a RTT (round-trip time) of around 40ms (but again, the delay can be different for each packet).
Edit: the above values of 20/40ms were based on the timestamps you mentioned, 15:40:31:864 to 15:40:31:901 (which is actually only 37ms), but really you need to look at the first ACK (sent at 15:40:31.862, frame 33) and the data packet in frame 47, received at 15:40:31.922 so the RTT there was 60ms.
Edit after Update 1 & 2
As already mentioned in the comments, I don't think that the server send 8 segments and then paused, I think it sent 10 segments (frames 31-32,34-35,37-38,40-41,43-44), then it paused until it received frame 33 ( the ACK to the first 2 segments); after it received this ACK it then sent frame 47 & 48 etc. Again this is an assumption and a server-side capture would be required to confirm, but this seems quite normal expected behaviour to me.
As for what happens later on in the connection, without looking at actual captures it seems reasonable to assume that because of TCP Slow Start, the cwin will increase in size, so the server can send more segments before waiting for an ACK, and at some point the cwin is large enough for the server to keep sending data without waiting for ACKs (as long as the ACKs keep being received at a relatively steady rate).
Best Answer
Ok, after some investigation I learnt that
1) data in syn handshake is legal, and
2) the weird segments are duplicate acks and receive window updates rolled into one: the first chunk of 256 bytes of payload was in fact dropped somewhere in the Linux kernel between receiving the frame and delivering it further up the stack.
Because of the window update, this does not trigger fast retransmit in our software, and the expected behaviour (since we don't implement SACK) is timeout retransmit. Alas, I commented out the segment retransmission function while debugging it and forgot to revert the changes.
I hope these findings are of some use to other people who stumble upon similar issues.