Random CONNECTION_RESET on apache2.4 debian 9

apache-2.4connectiondebian-stretchreset

My server has some strange behaviour and I just cant find the cause. I've been looking everywhere.

I will pay 200$ worth of bitcoin to anyone who can figure this out.

The problem:

When requesting any resource from apache (page, image, css, js), it sometimes takes a very long time to respond. About half of the time, the connection gets reset. (on Chrome: net::ERR_CONNECTION_RESET)
This happens rarely, randomly and is absolutely unpredictable.
More confusingly, while the one request seems to hang, I can make additional requests that work perfectly.

About the server:

I'm running apache2.4 mpm-prefork with php7.0 on debian 9. The apache module uses mod_rewrite and an ssl-certificate from certbot. On some occasions, php invokes inkscape to render svgs to png.

The server load is very low (0.02) and nothing but apache runs on it.

Things checked:

  • checked all server logs. (syslog, apache log) – nothing
  • increased the apache mpm-prefork limits – nope
  • checked for possible DNS problems – nothing
  • I even moved to a completely new root server (on a different provider) – still the same

I went on and analyzed the tcp traffic with Wireshark, and there is some suspicious behavoir. When the connection is freezing, there are some TCP Out-of-Order, Retransmission and ACKed unseen segment packets… but I don't have the necessary low-level knowledge to tell what's going on.

Any hints would be greatly appreaciated!

EDIT:

This is the mpm_prefork config:

<IfModule mpm_prefork_module>
    StartServers            10
    MinSpareServers         10
    MaxSpareServers         50
    MaxRequestWorkers       300
    MaxConnectionsPerChild  0
</IfModule>

EDIT EDIT:

I had luck and got a tcp sniffer running on both server and client when it happened once again.
Here are the pcap files, cropped to the last ~30 seconds.

serverside.pcap

clientside.pcap

If anyone with the knowledge could take a quick look at it and tell me what's going on, I'd be thrilled.

EDIT EDIT EDIT:

I managed to make the error reproducable, atleast with KeepAlive on.
When a request is finished and the content is served, the tcp connection closes with a FIN-ACK after 5 seconds. When making another request in the time window of 5-12 seconds after the FIN-ACK, the connection freezes.

With KeepAlive off however, this doesnt happen anymore, tho the error occures even more often when loading multiple resources at the same time. But then it's not reproducable anymore.

Best Answer

I would check the size of the TCP packets going between the server and client. IF they are nearing 1500 in size there is a possibility they getting dropped for numerous possibilities:

  1. If the DNF bit is set on the packet and the packet is getting fragmented somewhere this could be an issues that causes the packet to get dropped

  2. If the MTU is set to 1500 and packets are going through tunnels, encryption, etc that causes additional headers to be added to the packet, then this would also cause your packets to drop. Try setting the mtu on both ends on the interfaces your are using to something lower than 1500, possibly 1420 or even lower.