Apache mpm-worker + mod_fcgid + php5_cgi partially and sporadically down

apache-2.2mpm-workerphp-fpm

Recently, I have changed from Apache mpm-prefork (PHP module) to mpm-worker (PHP-FPM) due to memory issues. I am running a quite-large PHP application that requires ~20-30M per prefork process.

Overall, the server runs stable and fast. However, from time to time, the page is unavailable to some users for a few minutes.

Working hypothesis 1 (=rough idea) is that one of the processes (usually 2, sometime up to 5 or 6) hangs and each client assigned to this process (e.g. 50% of the clients) receives an error message.

Working hypothesis 2 is that MaxRequestsPerProcess is responsible. After 500 calls, the process tries to shut down, mod_fcgid does not gracefully kill and while the process is waiting for the kill, further clients are assigned to (and rejected by) the process. But I cannot really imaging that Apache would be so stupid.

My problem is: There is nothing in the error logs except some

[warn] mod_fcgid: process ???? graceful kill fail, sending SIGKILL

I am running out of ideas where to trace the problem. It appears sporadically and I have not yet managed to provoke it. Server performance (CPU/RAM) shall not be an issue, as the overall load has been in the lower range the recent weeks.

Thanks for any hints. Any comments on my hypotheses (that did not help my to find a solution, yet – I tried to disable the MaxRequestsPerProcess but do not yet know if it helped)? I would greatly appreciate some ideas how to trace this problem.

Apache configuration

    <Directory /var/www/html>
           ...

            # PHP FCGI
            <FilesMatch \.php$>
                    SetHandler fcgid-script
            </FilesMatch>
            Options +ExecCGI
    </Directory>

    <IfModule mod_fcgid.c>
            FcgidWrapper /var/www/php-fcgi-starter .php
            # Allow request up to 33 MB
            FcgidMaxRequestLen 34603008
            FcgidIOTimeout 300
            FcgidBusyTimeout 3600
            # Set 1200 (>1000) for PHP_FCGI_MAX_REQUESTS to avoid problems
            FcgidMaxRequestsPerProcess 1000
    </IfModule>

Apache module configuration

<IfModule mod_fcgid.c>
  AddHandler    fcgid-script .fcgi
  FcgidConnectTimeout 20
  FcgidBusyTimeout 7200

  DefaultMinClassProcessCount 0
  IdleTimeout 600
  IdleScanInterval 60
  MaxProcessCount 20

  MaxRequestsPerProcess 500
  PHP_Fix_Pathinfo_Enable 1
</IfModule>

Note: The timeout was set to 2 hours because rarely, the application may require some time to run (e.g. the nightly cronjob that does a database optimization).

Starter script

#!/bin/sh
PHP_FCGI_MAX_REQUESTS=1200
export PHP_FCGI_MAX_REQUESTS

export PHPRC="/etc/php5/cgi"
exec /usr/bin/php5-cgi

#PHP_FCGI_CHILDREN=10
#export PHP_FCGI_CHILDREN

Package versions

  • System: Ubuntu 12.04.2 LTS
  • apache2-mpm-worker: 2.2.22-1ubuntu1.4
  • libapache2-mod-fcgid: 1:2.3.6-1.1
  • php5-common: 5.3.10-1ubuntu3.7

Best Answer

I'd regard 20-30MB per process as quite small. It's all relative really, but for example most CMS applications will require at least 100MB. Also your maximum upload size will be constrained by the maximum process size if that matters.

When your server is unavailable, it's likely that the php worker processes are all busy, however that's only a proximate cause. Something is slowing down your server such that for a while at least, the php processes can't keep up with the incoming requests. What is slowing down your server is hard to judge, but the 'graceful kill fail' makes me think the process that was to be killed is likely waiting on disk.

Have you logged in while this is happening? Does the system feel responsive?

In top, look at the process states, and look for the 'D' ones, which are waiting on IO. Are there many of these? The 'wa' in the summary up the top is the total amount of time that processes spend waiting on IO. (It says percent, but that's likely a percentage of one processor's time). Tools like iotop, atop, and vmstat may also be useful for getting a view on what processes are disk bound, and the extent to which the disk is limiting your overall performance.

Your understanding of what happens when a worker process is not available to take new requests is incorrect. New requests will not be assigned to it.

1000 requests before killing the worker is high. I'd suggest dropping it to somewhere between 10 and 50.