High-traffic Drupal site apache errors

apache-2.2drupal

I'm getting a bunch of apache errors that I'm having problems tracing down. They're on a RHEL system that runs a very high-volume Drupal website.

[Mon Sep 14 12:48:44 2009] [info] [client xx.xx.xxx.xx] (70007)The timeout specified has expired: core_output_filter: writing data to the network
[Mon Sep 14 12:50:19 2009] [info] [client xx.xxx.xx.xx] (104)Connection reset by peer: core_output_filter: writing data to the network
[Mon Sep 14 12:51:28 2009] [info] [client xx.xxx.xx.xx] (32)Broken pipe: core_output_filter: writing data to the network

Occasionally (every 24 to 36 hours) there will be a load spike and the site will become completely unresponsive. Load average climbs from a normal 1-1.5 to 200. Most of the httpd processes that are running will show as 'D' — deadlocked — and the only way to get the server to get back down to "interactive" is to three-finger-salute or wait until you get a prompt and killall -9 httpd.

Obviously, the site can't be taken down for me to do a bunch of strace work. I've checked the apache configuration and (again) as far as I can tell, EnableMMAP and EnableSendFile are disabled. The files are on an NFS v3 mount, but neither the NFS server, nor the mysql server, nor anything else, is reporting errors. Nothing appropriate in the system log or dmesg. The site is also too high of a load to reconcile individual requests with errors resulting from them.

At this point, I'm thinking network hardware error and I'd prefer to bring the site up on a second machine. Anyone have any thoughts before I do this?

Best Answer

This is a wild ass guess but have you checked how many on-disk temporary tables Drupal is creating?

I have seen this cause iowait (load) problems.

mysqladmin -u root -p ext -ri 30 | grep Created_tmp_disk

First run will tell you how many on-disk temporary tables were created since last restart of MySQL. Then it will tell you how many are created in the 30 seconds time window (until you Control-C out of it).

The (band-aid) solution is to put MySQL's tmpdir on a RAM based file system (e.g. tmpfs).

I guess what I'm suggesting is that this starts the cascade - and the messages you're seeing are just abandoned connections.

Cheers

Related Topic