Well. Since you've captured packets that show that your Trading Server is sending the TCP ACK's with a window size of 0, you at least know the problem is definitely on your side. Which is actually a good thing, because you are in a position to fix it. (There is one thing that might be the issue which would be a problem on their end, I'll talk about that later)
You've also traced the issue to happening during times of increased throughput, also a good thing.
You said the CPU/RAM usage on your Trading Server reported normal. The application you are using, is it by chance configured to use a limited amount of RAM on the host OS? Maybe a limited percent? Because it would stand to reason that if so, as you had more connections and more throughput, there was less RAM available to the application, and therefore less resources available for TCP.
Either way, what OS is your Trading Server using? If you haven't already, you should look into tuning the OS to dedicate more RAM to TCP. In Windows, there are Registry values you can modify. In Linux, there are config files you can edit.
It would also be wise to make sure your Firewall (and nothing else in between) is trying to proxy your TCP sessions. That way you know you are dealing with the full "client to server" TCP connection, and not something in between.
The last thing I can offer is to study the TCP packets being sent from the Stock Exchange to your server just before your server sends a Window Size of 0. In particular, look for the incoming packets to have the value 11 in the IP Header's ECN field (Explicit Congestion Notification -- the last two bits in what used to be DSCP, bits 14 and 15 if you're looking at an IP Header). There is a chance that if both the Client and Server in the communication supported ECN, and a router in transit detected congestion, that it turned these bits on to tell the client and server to slow down their transfers. (This is that thing I said that might be a problem on their end)
I think that (tries to) answer questions 0,1,3. I'll have to dig around a bit more to give you a reliable answer for 2. But I'm pretty confident there is a way.
I teach TCP, and I often run into people who were mis-taught that the ACK is only sent when the Window Size is reached. This is not true. (To be really transparent, I too taught this incorrectly before I knew better as well, so I completely understand the mistake).
NOTE, I'll be using Receiver/Sender to describe it, but keep in mind TCP is bidirectional, and both parties maintain a Window Size.
The Window Size (that the Receiver sets) is a hard limit on how many bytes the Sender can send without being forced to stop to wait for an acknowledgement.
The Window Size does not determine how often the Receiver should be sending ACKnowledgements. Originally, the TCP protocol called for an acknowledgement to be sent after each segment was received. Later, TCP was optimized to allow the Receiver to skip ACKs and send an ACKnowledgment every other packet (or more).
The goal of TCP then, is for the Sender to continually be sending packets, without delay or interruption, because it continually receives ACKnowledgements, such that the count of "bytes in transit" is always less than the Window Size. If at any time, the Sender has sent a count of bytes equal to the window size without receiving an ACK, it is forced to pause sending and wait.
The important thing to consider in all this is the Round Trip Time. Often, when you are studying TCP in a wireshark, you are only seeing the perspective of one party in the TCP conversation, which makes it hard to infer, or truly "see", the effect of the RTT. To illustrate the effect of RTT, take a look at these two captures. They are both capturing the same conversation, a 2MB file download over HTTP, but one is from the perspective of the Client, and the other is from the perspective of the Server.
Note: its easier to analyse TCP if you turn off the Wireshark feature "Allow subdissector to reassemble TCP streams"
Notice from the Server side capture (who is the sender of the file), the Server sends 8 full sized packets in a row (packet#'s 6-13) before receiving the first ACK in packet# 14. If you drill down in that ACK, notice the Client's acknowledgement is for the segment sent in Packet#7. And the ACK the Client sent in packet 20 is from the segment sent in Packet#9.
See how the Client is indeed acknowledging every other packet. But it almost seems like it is acknowledging them "late". But in fact, this is just the effect of Round Trip Time. The Sender is able to send 7~ segments in the time it takes for the first segment to reach client and for the client's ACK to reach the server. If you take a look at the capture from the Client's perspective, it looks very 'clean', which is to say that every second packet it receives, it sends out an ACK.
Notice also what happens at Packet# 23. The Server has sent all it can, because the "bytes in transit" reaches the Window Size, so it is forced to stop sending. Until the next ACK arrives. Since the ACK's are coming in every other segment received. Each ACK allows the sender to again send two new segments, before the Window is full again, and the Server is again forced to pause. This happens up until Packet# 51, when the Client (Recever) increases the Window Size significantly, allowing the Server (sender) to start transmitting data uninhibited again... at least until Packet #175, when the new Window fills up.
Best Answer
Well I think you are in luck, as there is a wireshark forum convo that addresses this completely, and describes your situation.
https://ask.wireshark.org/questions/2365/tcp-window-size-and-scaling
Basically, it's not the network, it's more likely the server your traders are all trying to access. The server can't process the packets it's getting at the rate it's getting them (i.e., drinking from the firehose) and thus that message is the result.
Asking about tools, well, wireshark ;)
You need networking monitoring for long-term bandwidth and error graphing/tracking. Cacti and Nagios (or Icinga) are free applications that run on any unix/linux platform. For Windows, you're better off buying something from SolarWinds (you work for a trading firm so they should be able to spend a small amount on this); What's Up Gold would be a decent starting point. Not from SolarWinds but very good is PRTG.
For immediate tools, read more on ping and traceroute in windows. The defaults for those commands are absolutely horrible, and tweaking them can give you better and faster results. If you have a unix/linux system available, mtr is the trace/path-performance tool-of-choice for network administrators.