All connections from this network get stuck in SYN_RECV state, connections from the home or phone properly get ESTABLISHED

apache-2.2debuggingnetworkingtcp

My server (a linode VPS) suddenly started to timeout on every request yesterday.

I'm pretty inexperienced in networking and would love to learn a process for debugging these connectivity issues.

What confuses me is that yesterday, some people (my phone, me at home, friends at home) could consistently access the site and I see with netstat that a connection has been established. I disabled firwalls and set iptables to accept all connections to rule out any strange auto rules blacklisting our IP. I'm not sure if its relevant but a traceroute from the local network times out – traceroute from some machines outside find my server.

I've confirmed various settings are correct by comparing to the settings on my development server which is functioning properly.

The following files match my dev environment (except for their respective ip addresses):

/etc/hosts 
/etc/hosts.allow
/etc/hosts.deny
/etc/networking/interfaces 
ifconfig

Apache is listening on port 80 and the setup looks exactly the same as my functioning server.

# server that doesn't work:
tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN      22008/apache2
tcp        0      0 69.164.201.172:80       71.56.137.10:57487      SYN_RECV    -

# server that does work
tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN      3334/apache2
tcp        0      0 72.14.189.46:80         71.56.137.10:57490      ESTABLISHED 20931/apache2

My attempt at understanding

Every time I load the page once, netstat -an | grep :80 reveals all connections in SYN_RECV state.

tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN
tcp        0      0 69.164.201.172:80       71.56.137.10:56657      SYN_RECV
tcp        0      0 69.164.201.172:80       71.56.137.10:56669      SYN_RECV
tcp        0      0 69.164.201.172:80       71.56.137.10:56671      SYN_RECV

So the SYN_RECV means the server is waiting for an ACK to be sent back from the client.
How do I debug whether an ACK is being sent back? How do I debug where this communication is failing?

Here's what a tcpdump looks like when I attempt to load the page once.

In the paste below, my server is constantly sending packets to the client and not getting a response.

What does this mean? That the client isn't getting the response? Or perhaps I'm swallowing the response somewhere in the server? How do I know to narrow down the culprit further?

tcpdump -i eth0 -n -tttt port 80
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
2011-05-25 20:12:54.627417 IP 71.56.137.10.57160 > 69.164.201.172.80: Flags [S], seq 382527960, win 8192, options [mss 1460,nop,wscale 2,nop,nop,sackOK], length 0
2011-05-25 20:12:54.627512 IP 69.164.201.172.80 > 71.56.137.10.57160: Flags [S.], seq 1330600505, ack 382527961, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 6], length 0
2011-05-25 20:12:54.814463 IP 69.164.201.172.80 > 71.56.137.10.57157: Flags [S.], seq 604630211, ack 496040070, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 6], length 0
2011-05-25 20:12:55.214482 IP 69.164.201.172.80 > 71.56.137.10.57158: Flags [S.], seq 998358186, ack 2224730755, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 6], length 0
2011-05-25 20:12:57.624737 IP 71.56.137.10.57160 > 69.164.201.172.80: Flags [S], seq 382527960, win 8192, options [mss 1460,nop,wscale 2,nop,nop,sackOK], length 0
2011-05-25 20:12:57.624793 IP 69.164.201.172.80 > 71.56.137.10.57160: Flags [S.], seq 1330600505, ack 382527961, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 6], length 0
2011-05-25 20:12:59.014477 IP 69.164.201.172.80 > 71.56.137.10.57160: Flags [S.], seq 1330600505, ack 382527961, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 6], length 0
2011-05-25 20:13:03.618790 IP 71.56.137.10.57160 > 69.164.201.172.80: Flags [S], seq 382527960, win 8192, options [mss 1460,nop,nop,sackOK], length 0
2011-05-25 20:13:03.618866 IP 69.164.201.172.80 > 71.56.137.10.57160: Flags [S.], seq 1330600505, ack 382527961, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 6], length 0
2011-05-25 20:13:05.014514 IP 69.164.201.172.80 > 71.56.137.10.57160: Flags [S.], seq 1330600505, ack 382527961, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 6], length 0
2011-05-25 20:13:17.014504 IP 69.164.201.172.80 > 71.56.137.10.57160: Flags [S.], seq 1330600505, ack 382527961, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 6], length 0

tcpdump for functional server

Upon looking at the tcpdump for my functional server, I do see back and fourth communication between the server and the client.

00:00:00.000000 IP 71.56.137.10.57260 > 72.14.189.46.80: Flags [S], seq 34114118s [mss 1460,nop,wscale 2,nop,nop,sackOK], length 0
00:00:00.000110 IP 72.14.189.46.80 > 71.56.137.10.57260: Flags [S.], seq 2454858 win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 5], length 0
00:00:00.061827 IP 71.56.137.10.57260 > 72.14.189.46.80: Flags [.], ack 1, win 100:00:00.004292 IP 71.56.137.10.57260 > 72.14.189.46.80: Flags [P.], seq 1:597, ngth 596
00:00:00.000074 IP 72.14.189.46.80 > 71.56.137.10.57260: Flags [.], ack 597, win00:00:00.493990 IP 72.14.189.46.80 > 71.56.137.10.57260: Flags [.], seq 1:2921, ngth 2920
00:00:00.000024 IP 72.14.189.46.80 > 71.56.137.10.57260: Flags [P.], seq 2921:30, length 98
00:00:00.065135 IP 71.56.137.10.57260 > 72.14.189.46.80: Flags [.], ack 3019, wi00:00:00.034766 IP 71.56.137.10.57260 > 72.14.189.46.80: Flags [P.], seq 597:12925, length 699
00:00:00.000035 IP 72.14.189.46.80 > 71.56.137.10.57260: Flags [.], ack 1296, wi00:00:00.000457 IP 72.14.189.46.80 > 71.56.137.10.57260: Flags [P.], seq 3019:328, length 211
00:00:00.019196 IP 71.56.137.10.57262 > 72.14.189.46.80: Flags [S], seq 10674886s [mss 1460,nop,wscale 2,nop,nop,sackOK], length 0

Any suggestions, explanations, or comments would be hugely appreciated so that I can understand TCP a little more and hopefully be a little more useful next time I need to debug a problem like this.

Thank you!

Best Answer

To this jaded eye, it looks like there is some kind of routing issue close to the server in question. Packets come in along one path, but seem to depart through a different path and something stateful is on that path and dropping the weird "ACK without a SYN" packets.

I had this happen to me once. What ended up being the case was that the server had a bad network mask, so when traffic from off the subnet came in, it would issue an ARP request to get the MAC address of the node. Unfortunately for me, both the router and our load-balancer were enabled for Proxy-ARP, and the load-balancer was a bit faster on the trigger than the router. So the SYN packets came in via the router, but were attempting to leave the subnet via the load-balancer. As the LB didn't have a connection for that ACk packet, it dropped it on the floor.

In your case some judicious trace-routes may illuminate the network-path issues. From the affected server, attempt to traceroute out to the IPs that cause the problem, and do the same from those same IPs. If you're getting different paths, that may be where it is.

Related Topic