Sql-server – TCP packet loss against SQL-server

sql servertcp

Where I work, we have had some issues with connections to the database. We get a lot of failures which appear as "recv failed" on the client side. I have used Wireshark to try to debug the problem and gotten a bit further, but now I'm pretty much stuck.

First, a bit about the infrastructure:

  • The internal IT-systems are run as virtual machines (servers) on 1-2 physical machines. They consist of several applications on Glassfish, as well as a SQL-server.
  • In addition, there are some servers at a external provider, including a SQL-server here as well.
  • Between the internal systems and the external ones there is a firewall, that routes the traffic.

The problem is the communication between the applications in the internal zone, and the database in the external zone. The problem only occurs on some (at the time actually just one) of the virtual servers, and is not application-specific (it happens both through Glassfish and JDBC connection pools, as well as SQL-clients like SQuirreL.)

And, it gets stranger, since small SQL-statements are run, but once they get to a certain length, nothing happens, until the connection is closed on the client-side (with recv failed).

Here is what I found with Wireshark:

  • On the servers that work, a small SQL query appear in Wireshark as a single TDS-packet, with a direct TDS packet response (does not show TCP acks and such..) Large queries are typically first sent as TDS, then resent part by part as TCP, acked correctly and then result is returned. (In one case I saw that it first tried a 2400 byte TDS, then a 1514 byte TCP, and then 590 byte TCP and got acks for the last ones..)
  • On the server that doesn't work, small queries are sent as a TDS-packet (and this works.) Large queries on the other hand, it first tries to send the entire query as a TCP-packet, then gets an ACK with seq and ack = 1 (indicating no data received?), then it tries to send a 1514 byte TCP packet, getting no ACK back. It tries this a few times more, before the connection is broken.
  • If the maximum result is ticket in SQuirreL it first sends a small TDS-packet stating the restrictions on the result, then tries to send the query, but only recieves a resend of the ack of the first one. Resends of the query receive no acks.

I'm left pretty much blinded as to where to look next, does anyone have any pointers?

Update: Doing a restart on the virtual server solved the problem – for now at least..

Update #2: .. and now the problem is back..

Best Answer

We had a network expert look at this, and it turned out to be erroneous MTU-settings that caused this. From what I understand most of the network infrastructure supported jumbo frames, so the MTU was set to 9000. But there was one component (VPN-tunnel) who didn't support this, so frames over a certain size were truncated. Changing all MTU-settings 1500/1460 fixed the problem.

Related Topic