Looks OK to me.
You've got 12 cores.. across 2x 6-core CPUs. So at 100% performance, your load average should be 12.
Load average is funny. I do not think it means what you think it means.
Load average is really an indication of how many processes are running at any one time, averaged over 1, 5 and 15 minute windows.
Looks to me like you're a little overcommitted, but not drastically.
Perhaps use http://mysqltuner.pl/mysqltuner.pl to get some idea of how your mysqld settings equate to real usage amounts.
The next logical step of course is to separate MySQL and Apache onto different boxes. I'm not sure you're at that level yet, because you've still got a pantload of RAM free for MySQL to suck it up into. You might find some benefit from making query caches and key buffers bigger, and probably have a deeper look at MySQL's slow query log, and see if you can optimise the tables at all .
There's loads of information about how to read load averages, and really it's more sensible to divide the load average number by the number of cores, so you've got some idea of how utilised the server actually is.
I can see now you've got 33% iowait. I suspect.. that you've got a fairly write-heavy database, and this is causing tables to be locked when you're writing, meaning that concurrent writes cannot happen.
Having had a sniff at my.cnf, it looks like the max_connections is quite high, but that's not a huge concern, but it does mean that if you're using all of them, you'll need 27GB of RAM to allow this. Which is loads, but not a huge concern, again.
Consider turning on PHP APC Opcode caching.
** Edit **
Having seen the query log now, I'm inclined to think that there's a few things that might benefit the server.
- PHP APC Opcode caching (makes apache more efficient generally)
- Convert all tables to InnoDB unless you've got a really good reason. IF that reason is fulltext searching, find a better way to do it, and move to InnoDB.
- Buy another server, and make it a dedicated DB host. Fit it with SAS disks, and separate it into partitions so that logging and data are on seperate spindles (or rather, RAID arrays).
Without a much deeper look into what the hell is going on, it's difficult to actually say.
Might be worth a trial run with NewRelic for PHP. It's free for a month, and does tend to give good insight into bad code smells.
Alternatively, I am available for consultancy ;)
You have set up pm.max_children
to 30, which means that there can be only 30 concurrent PHP scripts running at the same time.
When more users visit your sites, there aren't any free PHP processes to serve the request. nginx
waits for some time, before returning the 504 Gateway Time-out
error.
You seem to have plenty of free memory, as your cached
column shows 2.9 GB of free memory.
You should check the average memory usage of your PHP processes with top
command. The memory usage we are interested in is the RES
column. Divide 2GB with that number, and you'll get a safe number for the pm.max_children
setting.
You should also consider raising the value for pm.start_servers
, pm.min_spare_servers
and pm.max_spare_servers
.
Spare servers are processes that are available to serve requests immediately. Otherwise the PHP process manager needs to launch a process separately, which takes some time.
Best Answer
So what changed? To answer the question, almost always by being bogged down in backend code processing, and sometimes by things like tcp socket exhaustion, and other times it is just simple pool/etc limits between apache and php.
Follow this process:
"How do I get nondescript php application to meet x benchmark?" is a very broad question but the process is generally, get your frontend static serving to beat your requirements, then add your backend code to the mix. Either ix slowness in the backend code or eliminate it.
Once you start hitting thousands of concurrent users.. You're going to brush up against certain limits (filesystem speed, memory, cpu speed, communication latency, inefficient wordpress code and slow database calls as well as a ton of context switching on the cpu) that are going to compound and quickly result in exponential degradation of service.