Nginx – Recurrent 502, 504 errors despite low cpu and memory usage

502504nginxphp-fpmWordpress

I'm struggling with this problem since quite a while. I have t2.site instance on AWS hosting a very low traffic personal website built with wordpress, nothing extraordinary. I first used Apache2, and after a few months running perfectly I suddenly started to got some 502, 504 errors. At that time I decided to move to Nginx/php5-fpm, things went well no more 502, 504 errors for a while. And since yesterday it just all happened again, I noticed that if I restart php5-fpm or mysql the site is accessible again, but only for 5-10 minutes before giving a 502 or 504 errors again.

I have followed multiple threads with similar problems nothing really worked, so I surrender and ask for help here. I changed multiple time the nginx and php5-fpm configuration without any durable change. I can see in the htop that there is a lot of mysql process and start suspecting that something (wordpress or ?) is creating a bunch of unnecessary connections to mysql but I don't know how to investigate further.

Here first my htop when things go wrong:

enter image description here

My nginx config :

server {
        listen 80;
        root /var/www/html/mysite;
        index index.php index.html index.htm;
        server_name my-site.com;

        location / {
                try_files $uri $uri/ /index.php?q=$uri&$args;
        }

        location ~ \.php$ {
                try_files $uri =404;
                #fastcgi_pass unix:/var/run/php5-fpm.sock;
                fastcgi_read_timeout 150;
                fastcgi_pass 127.0.0.1:9000;
                fastcgi_buffer_size 16k;
                fastcgi_buffers 4 16k;
                fastcgi_index index.php;
                include fastcgi_params;
        }

        location = /favicon.ico {
                log_not_found off;
                access_log off;
        }

        location = /robots.txt {
                allow all;
                log_not_found off;
                access_log off;
        }

        location ~ /\. {
                deny all;
        }

        location ~* /(?:uploads|files)/.*\.php$ {
                deny all;
        }
}

My /etc/php5/fpm/pool.d/www.conf file for php5-fpm configuration

; Start a new pool named 'www'.
; the variable $pool can we used in any directive and will be replaced by the
; pool name ('www' here)
[www]

; Per pool prefix
; It only applies on the following directives:
; - 'slowlog'
; - 'listen' (unixsocket)
; - 'chroot'
; - 'chdir'
; - 'php_values'
; - 'php_admin_values'
; When not set, the global prefix (or /usr) applies instead.
; Note: This directive can also be relative to the global prefix.
; Default Value: none
;prefix = /path/to/pools/$pool

; Unix user/group of processes
; Note: The user is mandatory. If the group is not set, the default user's group
;       will be used.
user = www-data
group = www-data

; The address on which to accept FastCGI requests.
; Valid syntaxes are:
;   'ip.add.re.ss:port'    - to listen on a TCP socket to a specific address on
;                            a specific port;
;   'port'                 - to listen on a TCP socket to all addresses on a
;                            specific port;
;   '/path/to/unix/socket' - to listen on a unix socket.
; Note: This value is mandatory.
listen = 127.0.0.1:9000

; Set listen(2) backlog.
; Default Value: 65535 (-1 on FreeBSD and OpenBSD)
;listen.backlog = 65535

; Set permissions for unix socket, if one is used. In Linux, read/write
; permissions must be set in order to allow connections from a web server. Many
; BSD-derived systems allow connections regardless of permissions.
; Default Values: user and group are set as the running user
;                 mode is set to 0660
listen.owner = www-data
listen.group = www-data
;listen.mode = 0660

; List of ipv4 addresses of FastCGI clients which are allowed to connect.
; Equivalent to the FCGI_WEB_SERVER_ADDRS environment variable in the original
; PHP FCGI (5.2.2+). Makes sense only with a tcp listening socket. Each address
; must be separated by a comma. If this value is left blank, connections will be
; accepted from any ip address.
; Default Value: any
;listen.allowed_clients = 127.0.0.1

; Specify the nice(2) priority to apply to the pool processes (only if set)
; The value can vary from -19 (highest priority) to 20 (lower priority)
; Note: - It will only work if the FPM master process is launched as root
;       - The pool processes will inherit the master process priority
;         unless it specified otherwise
; Default Value: no set
; priority = -19

; Choose how the process manager will control the number of child processes.
; Possible Values:
;   static  - a fixed number (pm.max_children) of child processes;
;   dynamic - the number of child processes are set dynamically based on the
;             following directives. With this process management, there will be
;             always at least 1 children.
;             pm.max_children      - the maximum number of children that can
;                                    be alive at the same time.
;             pm.start_servers     - the number of children created on startup.
;             pm.min_spare_servers - the minimum number of children in 'idle'
;                                    state (waiting to process). If the number
;                                    of 'idle' processes is less than this
;                                    number then some children will be created.
;             pm.max_spare_servers - the maximum number of children in 'idle'
;                                    state (waiting to process). If the number
;                                    of 'idle' processes is greater than this
;                                    number then some children will be killed.
;  ondemand - no children are created at startup. Children will be forked when
;             new requests will connect. The following parameter are used:
;             pm.max_children           - the maximum number of children that
;                                         can be alive at the same time.
;             pm.process_idle_timeout   - The number of seconds after which
;                                         an idle process will be killed.
; Note: This value is mandatory.
pm = dynamic

; The number of child processes to be created when pm is set to 'static' and the
; maximum number of child processes when pm is set to 'dynamic' or 'ondemand'.
; This value sets the limit on the number of simultaneous requests that will be
; served. Equivalent to the ApacheMaxClients directive with mpm_prefork.
; Equivalent to the PHP_FCGI_CHILDREN environment variable in the original PHP
; CGI. The below defaults are based on a server without much resources. Don't
; forget to tweak pm.* to fit your needs.
; Note: Used when pm is set to 'static', 'dynamic' or 'ondemand'
; Note: This value is mandatory.
pm.max_children = 5

; The number of child processes created on startup.
; Note: Used only when pm is set to 'dynamic'
; Default Value: min_spare_servers + (max_spare_servers - min_spare_servers) / 2
pm.start_servers = 2

; The desired minimum number of idle server processes.
; Note: Used only when pm is set to 'dynamic'
; Note: Mandatory when pm is set to 'dynamic'
pm.min_spare_servers = 1

; The desired maximum number of idle server processes.
; Note: Used only when pm is set to 'dynamic'
; Note: Mandatory when pm is set to 'dynamic'
pm.max_spare_servers = 3

; The number of seconds after which an idle process will be killed.
; Note: Used only when pm is set to 'ondemand'
; Default Value: 10s
;pm.process_idle_timeout = 10s;

; The number of requests each child process should execute before respawning.
; This can be useful to work around memory leaks in 3rd party libraries. For
; endless request processing specify '0'. Equivalent to PHP_FCGI_MAX_REQUESTS.
; Default Value: 0
pm.max_requests = 500

; The URI to view the FPM status page. If this value is not set, no URI will be
; recognized as a status page. It shows the following informations:
;   pool                 - the name of the pool;
;   process manager      - static, dynamic or ondemand;
;   start time           - the date and time FPM has started;
;   start since          - number of seconds since FPM has started;
;   accepted conn        - the number of request accepted by the pool;
;   listen queue         - the number of request in the queue of pending
;                          connections (see backlog in listen(2));
;   max listen queue     - the maximum number of requests in the queue
;                          of pending connections since FPM has started;
;   listen queue len     - the size of the socket queue of pending connections;
;   idle processes       - the number of idle processes;
;   active processes     - the number of active processes;
;   total processes      - the number of idle + active processes;
;   max active processes - the maximum number of active processes since FPM
;                          has started;
;   max children reached - number of times, the process limit has been reached,
;                          when pm tries to start more children (works only for
;                          pm 'dynamic' and 'ondemand');
; Value are updated in real time.
; Example output:
;   pool:                 www
;   process manager:      static
;   start time:           01/Jul/2011:17:53:49 +0200
;   start since:          62636
;   accepted conn:        190460
;   listen queue:         0
;   max listen queue:     1
;   listen queue len:     42
;   idle processes:       4
;   active processes:     11
;   total processes:      15
;   max active processes: 12
;   max children reached: 0
;
; By default the status page output is formatted as text/plain. Passing either
; 'html', 'xml' or 'json' in the query string will return the corresponding
; output syntax. Example:
;   http://www.foo.bar/status
;   http://www.foo.bar/status?json
;   http://www.foo.bar/status?html
;   http://www.foo.bar/status?xml
;
; By default the status page only outputs short status. Passing 'full' in the
; query string will also return status for each pool process.
; Example:
;   http://www.foo.bar/status?full
;   http://www.foo.bar/status?json&full
;   http://www.foo.bar/status?html&full
;   http://www.foo.bar/status?xml&full
; The Full status returns for each process:
;   pid                  - the PID of the process;
;   state                - the state of the process (Idle, Running, ...);
;   start time           - the date and time the process has started;
;   start since          - the number of seconds since the process has started;
;   requests             - the number of requests the process has served;
;   request duration     - the duration in µs of the requests;
;   request method       - the request method (GET, POST, ...);
;   request URI          - the request URI with the query string;
;   content length       - the content length of the request (only with POST);
;   user                 - the user (PHP_AUTH_USER) (or '-' if not set);
;   script               - the main script called (or '-' if not set);
;   last request cpu     - the %cpu the last request consumed
;                          it's always 0 if the process is not in Idle state
;                          because CPU calculation is done when the request
;                          processing has terminated;
;   last request memory  - the max amount of memory the last request consumed
;                          it's always 0 if the process is not in Idle state
;                          because memory calculation is done when the request
;                          processing has terminated;
; If the process is in Idle state, then informations are related to the
; last request the process has served. Otherwise informations are related to
; the current request being served.
; Example output:
;   ************************
;   pid:                  31330
;   state:                Running
;   start time:           01/Jul/2011:17:53:49 +0200
;   start since:          63087
;   requests:             12808
;   request duration:     1250261
;   request method:       GET
;   request URI:          /test_mem.php?N=10000
;   content length:       0
;   user:                 -
;   script:               /home/fat/web/docs/php/test_mem.php
;   last request cpu:     0.00
;   last request memory:  0
;
; Note: There is a real-time FPM status monitoring sample web page available
;       It's available in: ${prefix}/share/fpm/status.html
;
; Note: The value must start with a leading slash (/). The value can be
;       anything, but it may not be a good idea to use the .php extension or it
;       may conflict with a real PHP file.
; Default Value: not set
;pm.status_path = /status

; The ping URI to call the monitoring page of FPM. If this value is not set, no
; URI will be recognized as a ping page. This could be used to test from outside
; that FPM is alive and responding, or to
; - create a graph of FPM availability (rrd or such);
; - remove a server from a group if it is not responding (load balancing);
; - trigger alerts for the operating team (24/7).
; Note: The value must start with a leading slash (/). The value can be
;       anything, but it may not be a good idea to use the .php extension or it
;       may conflict with a real PHP file.
; Default Value: not set
;ping.path = /ping

; This directive may be used to customize the response of a ping request. The
; response is formatted as text/plain with a 200 response code.
; Default Value: pong
;ping.response = pong

; The access log file
; Default: not set
;access.log = log/$pool.access.log

; The access log format.
; The following syntax is allowed
;  %%: the '%' character
;  %C: %CPU used by the request
;      it can accept the following format:
;      - %{user}C for user CPU only
;      - %{system}C for system CPU only
;      - %{total}C  for user + system CPU (default)
;  %d: time taken to serve the request
;      it can accept the following format:
;      - %{seconds}d (default)
;      - %{miliseconds}d
;      - %{mili}d
;      - %{microseconds}d
;      - %{micro}d
;  %e: an environment variable (same as $_ENV or $_SERVER)
;      it must be associated with embraces to specify the name of the env
;      variable. Some exemples:
;      - server specifics like: %{REQUEST_METHOD}e or %{SERVER_PROTOCOL}e
;      - HTTP headers like: %{HTTP_HOST}e or %{HTTP_USER_AGENT}e
;  %f: script filename
;  %l: content-length of the request (for POST request only)
;  %m: request method
;  %M: peak of memory allocated by PHP
;      it can accept the following format:
;      - %{bytes}M (default)
;      - %{kilobytes}M
;      - %{kilo}M
;      - %{megabytes}M
;      - %{mega}M
;  %n: pool name
;  %o: output header
;      it must be associated with embraces to specify the name of the header:
;      - %{Content-Type}o
;      - %{X-Powered-By}o
;      - %{Transfert-Encoding}o
;      - ....
;  %p: PID of the child that serviced the request
;  %P: PID of the parent of the child that serviced the request
;  %q: the query string
;  %Q: the '?' character if query string exists
;  %r: the request URI (without the query string, see %q and %Q)
;  %R: remote IP address
;  %s: status (response code)
;  %t: server time the request was received
;      it can accept a strftime(3) format:
;      %d/%b/%Y:%H:%M:%S %z (default)
;  %T: time the log has been written (the request has finished)
;      it can accept a strftime(3) format:
;      %d/%b/%Y:%H:%M:%S %z (default)
;  %u: remote user
;
; Default: "%R - %u %t \"%m %r\" %s"
;access.format = "%R - %u %t \"%m %r%Q%q\" %s %f %{mili}d %{kilo}M %C%%"

; The log file for slow requests
; Default Value: not set
; Note: slowlog is mandatory if request_slowlog_timeout is set
;slowlog = /var/log/php5-fpm.log

; The timeout for serving a single request after which a PHP backtrace will be
; dumped to the 'slowlog' file. A value of '0s' means 'off'.
; Available units: s(econds)(default), m(inutes), h(ours), or d(ays)
; Default Value: 0
;request_slowlog_timeout = 10s

; The timeout for serving a single request after which the worker process will
; be killed. This option should be used when the 'max_execution_time' ini option
; does not stop script execution for some reason. A value of '0' means 'off'.
; Available units: s(econds)(default), m(inutes), h(ours), or d(ays)
; Default Value: 0
request_terminate_timeout = 30s

; Set open file descriptor rlimit.
; Default Value: system defined value
;rlimit_files = 1024
; Set max core size rlimit.
; Possible Values: 'unlimited' or an integer greater or equal to 0
; Default Value: system defined value
;rlimit_core = 0

; Chroot to this directory at the start. This value must be defined as an
; absolute path. When this value is not set, chroot is not used.
; Note: you can prefix with '$prefix' to chroot to the pool prefix or one
; of its subdirectories. If the pool prefix is not set, the global prefix
; will be used instead.
; Note: chrooting is a great security feature and should be used whenever
;       possible. However, all PHP paths will be relative to the chroot
;       (error_log, sessions.save_path, ...).
; Default Value: not set
;chroot =

; Chdir to this directory at the start.
; Note: relative path can be used.
; Default Value: current directory or / when chroot
chdir = /

; Redirect worker stdout and stderr into main error log. If not set, stdout and
; stderr will be redirected to /dev/null according to FastCGI specs.
; Note: on highloaded environement, this can cause some delay in the page
; process time (several ms).
; Default Value: no
;catch_workers_output = yes

; Limits the extensions of the main script FPM will allow to parse. This can
; prevent configuration mistakes on the web server side. You should only limit
; FPM to .php extensions to prevent malicious users to use other extensions to
; exectute php code.
; Note: set an empty value to allow all extensions.
; Default Value: .php
;security.limit_extensions = .php .php3 .php4 .php5

; Pass environment variables like LD_LIBRARY_PATH. All $VARIABLEs are taken from
; the current environment.
; Default Value: clean env
;env[HOSTNAME] = $HOSTNAME
;env[PATH] = /usr/local/bin:/usr/bin:/bin
;env[TMP] = /tmp
;env[TMPDIR] = /tmp
;env[TEMP] = /tmp

; Additional php.ini defines, specific to this pool of workers. These settings
; overwrite the values previously defined in the php.ini. The directives are the
; same as the PHP SAPI:
;   php_value/php_flag             - you can set classic ini defines which can
;                                    be overwritten from PHP call 'ini_set'.
;   php_admin_value/php_admin_flag - these directives won't be overwritten by
;                                     PHP call 'ini_set'
; For php_*flag, valid values are on, off, 1, 0, true, false, yes or no.
; Pass environment variables like LD_LIBRARY_PATH. All $VARIABLEs are taken from
; the current environment.
; Default Value: clean env
;env[HOSTNAME] = $HOSTNAME
;env[PATH] = /usr/local/bin:/usr/bin:/bin
;env[TMP] = /tmp
;env[TMPDIR] = /tmp
;env[TEMP] = /tmp

; Additional php.ini defines, specific to this pool of workers. These settings
; overwrite the values previously defined in the php.ini. The directives are the
; same as the PHP SAPI:
;   php_value/php_flag             - you can set classic ini defines which can
;                                    be overwritten from PHP call 'ini_set'.
;   php_admin_value/php_admin_flag - these directives won't be overwritten by
;                                     PHP call 'ini_set'
; For php_*flag, valid values are on, off, 1, 0, true, false, yes or no.

; Defining 'extension' will load the corresponding shared extension from
; extension_dir. Defining 'disable_functions' or 'disable_classes' will not
; overwrite previously defined php.ini values, but will append the new value
; instead.

; Note: path INI options can be relative and will be expanded with the prefix
; (pool, global or /usr)

; Default Value: nothing is defined by default except the values in php.ini and
;                specified at startup with the -d argument
;php_admin_value[sendmail_path] = /usr/sbin/sendmail -t -i -f www@my.domain.com
;php_flag[display_errors] = off
;php_admin_value[error_log] = /var/log/fpm-php.www.log
;php_admin_flag[log_errors] = on
;php_admin_value[memory_limit] = 32M

My /var/log/nginx/error.log

2016/09/15 16:48:11 [error] 29608#0: *62681 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 180.76.15.19, server: my-site.com, request: "GET / HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "www.my-site.com"
2016/09/15 16:49:29 [error] 29608#0: *62801 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 180.76.15.141, server: my-site.com, request: "GET / HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "www.my-site.com"
2016/09/15 22:46:55 [error] 29607#0: *84028 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 180.76.15.139, server: my-site.com, request: "GET / HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "www.my-site.com"
2016/09/15 22:47:23 [error] 29607#0: *84244 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 180.76.15.145, server: my-site.com, request: "GET / HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "www.my-site.com"
2016/09/15 23:26:12 [error] 29607#0: *90756 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 66.249.71.17, server: my-site.com, request: "GET /category/distribution/ HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "my-site.com"
2016/09/15 23:49:28 [error] 29608#0: *94579 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 180.76.15.33, server: my-site.com, request: "GET / HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "www.my-site.com"
2016/09/15 23:50:50 [error] 29608#0: *94786 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 180.76.15.163, server: my-site.com, request: "GET / HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "www.my-site.com"

EDIT :
After checking php5-fpm log I found this :

My /var/log/upstart/php5-fpm.log

[16-Sep-2016 00:05:02] NOTICE: fpm is running, pid 32006
[16-Sep-2016 00:05:02] NOTICE: ready to handle connections
[16-Sep-2016 00:05:02] NOTICE: systemd monitor interval set to 10000ms
[16-Sep-2016 00:05:07] WARNING: [pool www] server reached pm.max_children setting (5), consider raising it
[16-Sep-2016 01:03:39] NOTICE: Terminating ...
[16-Sep-2016 01:03:39] NOTICE: exiting, bye-bye!
[16-Sep-2016 01:03:39] NOTICE: fpm is running, pid 32699
[16-Sep-2016 01:03:39] NOTICE: ready to handle connections
[16-Sep-2016 01:03:39] NOTICE: systemd monitor interval set to 10000ms
[16-Sep-2016 01:03:43] WARNING: [pool www] server reached pm.max_children setting (5), consider raising it

So I updated the php5-fpm config for the following parameters :
Currently I change the php5-fpm configuration, especially increasing the the pm.max_children value :

pm = dynamic
pm.max_children = 50
pm.start_servers = 5
pm.min_spare_servers = 5
pm.max_spare_servers = 35

The site is running better since a few moments, but I doubt this is a long term solution to the problem.

EDIT 2 :

After a long investigation of mysql.log (all queries), and of the access I found that I was under a brute force attack.
You can find information about this attack here: https://www.digitalocean.com/community/tutorials/how-to-protect-wordpress-from-xml-rpc-attacks-on-ubuntu-14-04

The access log was showing repeatedly this :

191.96.249.75 - - [16/Sep/2016:05:28:52 +0000] "POST /xmlrpc.php HTTP/1.0" 499 0 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"
191.96.249.75 - - [16/Sep/2016:05:28:54 +0000] "POST /xmlrpc.php HTTP/1.0" 499 0 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"
191.96.249.75 - - [16/Sep/2016:05:28:57 +0000] "POST /xmlrpc.php HTTP/1.0" 499 0 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"
191.96.249.75 - - [16/Sep/2016:05:28:58 +0000] "POST /xmlrpc.php HTTP/1.0" 499 0 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"
191.96.249.75 - - [16/Sep/2016:05:29:01 +0000] "POST /xmlrpc.php HTTP/1.0" 499 0 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"

Best Answer

Looks like PHP is falling over. Not much you can do about that really, but Monit can help. You need to look in the PHP log to see if it says why it's failing - suggest you edit your question and add those. Could be running out of memory. Check everything up to date of course, sudo yum update.

I have exactly the same t2.micro setup for my websites, but I run four websites including one moderately high volume on a t2.micro with RDS back end. I have a tutorial on how I set it up, including downloadable configurations. Page caching can be really useful to reduce load - the only time my server goes over 2% CPU is during the nightly offsite backups. I sit on full CPU credits 99% of the time.

Update

I note that you have at least 20 mysqld processes showing on the htop output, taking at least 60% of your memory.. I don't know if these are separate MySQL instances (unlikely) or processes/threads that accept connections (likely). I would suggest the first thing to do would be to reduce this down to a smaller number - 5 is probably sufficient, or maybe you can do min/expected/max connections - I'm not a MySQL expert.

Even better, move your server to a t2.nano or t2.micro and use Amazon RDS. RDS is in the free tier for a year, after that it's around $10 a month and means you don't have to run a database yourself. A t2.micro web server and t2.micro RDS server will cost about the same as a t2.small instance doing both.

I have higher volume websites than you and I'm pretty sure I can run web and database on the one server no problems. You just have to ensure you don't go over RAM available. That's pretty easy if you have a mostly read site, you just cache all the pages with Nginx page caching.

I still need to see your PHP logs to see if that's what's failing.

Update2

These lines are interesting. The first is simply information, you should probably follow its advice. The second, the termination, is around an hour later, but it doesn't say why it's terminating. I wonder if you can turn verbose logging on to help work that out.

[16-Sep-2016 00:05:07] WARNING: [pool www] server reached pm.max_children setting (5), consider raising it
[16-Sep-2016 01:03:39] NOTICE: Terminating ...

You could also check CloudWatch. Make sure you've set up the custom metrix "memory in use". Let it run until the problem happens, then look at the MAX value for consumed memory. If you're running out of memory it might show that, but it might not be granular enough. My theory, which to be honest probably isn't correct, is that the process can't allocate enough memory and is shutting down because of that. While it might not be the problem it's worth ruling out.

Do consider reducing MySQL RAM usage as well, like I said previously.

Related Topic