Nginx/php-fpm timeouts

fastcginginxphp-fpmtimeout

I have a setup that consists of a load balancer, two web servers running nginx/php-fpm7.1 and a database server running mariadb.

For the past few months I have been struggling to work out the cause and resolve irregular timeouts and am finally asking on here for thoughts. Nothing had changed as far as I'm aware around the time of this occuring. Additionally, I have seen php-fpm fail totally and have to restart the service.

I'm seeing errors like the following and am receiving alerts throughout the day via xymon:

2018/07/11 14:27:23 [error] 13461#13461: *920760 upstream timed out (110: Connection timed out) while reading response header from upstream, client: *.*.*.*, server: www.something.com, request: "GET /something/something HTTP/1.1", upstream: "fastcgi://unix:/run/php-fpm/something.com.sock", host: "www.something.com"

There are about 5 sites hosted in the setup, with only one running through the load balancer, all others are pointed at webserver 1, as I receive alerts for all the sites I am looking only at webserver 1.

The general nginx conf all the sites use is as follows:

worker_processes        2;

user    nginx www-data;
pid     /run/nginx.pid;
worker_rlimit_nofile     100000;

events {
    worker_connections  1024;
    multi_accept        on;
    use                 epoll;
}

http {
    include             mime.types;
    default_type        application/octet-stream;

    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;
    keepalive_timeout 65;
    keepalive_requests 200;
    client_max_body_size 16m;
    client_body_timeout 32;
    client_header_timeout 32;
    reset_timedout_connection   on;
    send_timeout   600;
    proxy_connect_timeout 600;
    proxy_send_timeout 600;
    proxy_read_timeout 600;

    fastcgi_buffers 8 128k;
    fastcgi_buffer_size 256k;

    open_file_cache max=10000 inactive=30s;
    open_file_cache_valid 60s;
    open_file_cache_min_uses 2;
    open_file_cache_errors on;

Additionally, I have locations similar to this in each vhost

    location ~ \.php$ {
        try_files $uri =404;
        fastcgi_pass unix:/run/php-fpm/something.com.sock;
        fastcgi_index index.php;
        fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
        fastcgi_read_timeout 30s;
        include fastcgi_params;
    }

And each site has it's own fpm pool, all of which have the following changes:

pm = ondemand
pm.max_children = 12
pm.start_servers = 4
pm.min_spare_servers = 4
pm.max_spare_servers = 8
pm.max_requests = 15000

The main site that runs through the load balancer has the following fpm changes to it's pool

pm = dynamic
pm.max_children = 100
pm.start_servers = 5
pm.min_spare_servers = 5
pm.max_spare_servers = 8
pm.max_requests = 15000

All the things I have attempted has resulted in no change, this includes updating all yum packages and rebooting. As it stands, there is no high load on these machines, although it can happen.

Any thoughts or help as to how to debug further would be very very useful!

Update

The slow log does report things like so:

[11-Jul-2018 14:53:12] WARNING: [pool something.com] child 53001, script '/var/www/something.com/index.php' (request: "GET /index.php?q=/404.html&") executing too slow (11.267915 sec), logging

Possibly more related to the maria server then?

Best Answer

The thing that has most likely changed over time is the database size on the web server.

This and possibly inefficient SQL statements / DB structure might cause the fact that DB queries take too long time, and therefore the timeout occurs.