Ubuntu – How to troubleshoot performance issues of PHP, MySQL and generic I/O

performancePHPUbuntuvirtual-machinesWordpress

I have a WordPress based website running on a shared hosting. Its response time is very decent (around 2s to retrieve the HTML page and 5s to load all the resources).

I was planning to move it to a dedicated virtual server (Ubuntu 12.04 LTS), which should theoretically improve things and make them more consistent given its not shared. However I observed severe performance degredation, with the page taking 10seconds to be generated.

I ruled out network issues by editing /etc/hosts on the server and mapping the domain to 127.0.0.1. I used the Apache load tester ab to get the HTML, so JS, CSS and images are all excluded. It still took 10 seconds.

I have Zpanel installed on the server which also uses MySQL, and its pages come up quite fast (1.5s) and also phpMyAdmin. Performing some queries on the wordpress database directly through phpMyAdmin returns them quite fast too, with query times in the 10 to 30 millisecond region.

Memory is also sufficient, with only 800Mb being used of the 1Gb physical memory available, so it doesn't seem to be a swap issue either. I have also installed APC to try to improve the PHP performance, but it didn't have any effect.

What else should I look for? What could be causing this degradation in performance? Could it be some kind of I/O issue since I am running on a cloud based virtual server?

I wish to be able to raise the issue with my provider but without showing actual data from some diagnosis I am afraid he will just blame my application.

UPDATE with sar output (every second) when I did an HTTP request:

02:31:29        CPU     %user     %nice   %system   %iowait    %steal     %idle
02:31:30        all      0.00      0.00      0.00      0.00      0.00    100.00
02:31:31        all      2.22      0.00      2.22      0.00      0.00     95.56
02:31:32        all     41.67      0.00      6.25      0.00      2.08     50.00
02:31:33        all     86.36      0.00     13.64      0.00      0.00      0.00
02:31:34        all     75.00      0.00     25.00      0.00      0.00      0.00
02:31:35        all     93.18      0.00      6.82      0.00      0.00      0.00
02:31:36        all     90.70      0.00      9.30      0.00      0.00      0.00
02:31:37        all     71.05      0.00      0.00      0.00      0.00     28.95
02:31:38        all     14.89      0.00     10.64      0.00      2.13     72.34
02:31:39        all      2.56      0.00      0.00      0.00      0.00     97.44
02:31:40        all      0.00      0.00      0.00      0.00      0.00    100.00
02:31:41        all      0.00      0.00      0.00      0.00      0.00    100.00

UPDATE 2 After josten's suggestions.

I/O:

iotop fails with OSError: Netlink error: No such file or directory (2) and sar -d also fails with Requested activities not available in file /var/log/sysstat/sa14. I think this is because this is a virtual machine, just like iostat also fails. Could it be the reason why %iowait reported by sar 1 10 is always 0%?

CPU Load:

The process that is topping the CPU% in htop is actually apache2. I was expecting this to maybe be the database, but its not. It goes up to 94% for a few seconds when I do a fresh HTTP request. Seems this is the culprit.

I have done an strace -f -t and one summary strace -c -f. There seems to be an awful lot of lstat calls (57786), with 2455 resulting in errors. No idea if this is normal.
Other than that the topmost call was wait4 which I presume is normal (its just waiting), and munmap. Top 5 below.

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 51.06    0.124742         897       139         6 wait4
 14.90    0.036388           1     57786      2455 lstat
  9.67    0.023622          13      1857           munmap
  7.69    0.018790          37       514           brk
  6.70    0.016361         481        34           clone
  2.87    0.006999          74        94        12 select

strace itself slowed down apache by a factor of 2. I am trying to understand the long trace now to see if there is anything indicative of what was causing the CPU to spike for a few seconds.

What is the typical time for lstat for a good performing server? I wish to gather some information so that I can complain in a constructive manner to the provider if it is the storage access fault.

UPDATE Output of fio random read test:

random-read: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
fio 1.59
Starting 1 process
random-read: Laying out IO file(s) (1 file(s) / 128MB)
Jobs: 1 (f=1): [r] [100.0% done] [12185K/0K /s] [2975 /0  iops] [eta 00m:00s]
random-read: (groupid=0, jobs=1): err= 0: pid=24264
  read : io=131072KB, bw=10298KB/s, iops=2574 , runt= 12728msec
    clat (usec): min=119 , max=162219 , avg=380.34, stdev=957.37
     lat (usec): min=119 , max=162219 , avg=380.89, stdev=957.40
    bw (KB/s) : min= 7200, max=13424, per=99.89%, avg=10285.72, stdev=1608.68
  cpu          : usr=2.80%, sys=18.65%, ctx=33511, majf=0, minf=23
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=32768/0/0, short=0/0/0
     lat (usec): 250=45.57%, 500=37.17%, 750=3.41%, 1000=7.83%
     lat (msec): 2=5.67%, 4=0.27%, 10=0.08%, 20=0.01%, 250=0.01%

Run status group 0 (all jobs):
   READ: io=131072KB, aggrb=10297KB/s, minb=10545KB/s, maxb=10545KB/s, mint=12728msec, maxt=12728msec

The only hint I have now is that the CPU line of the fio output seems to show quite a bit of activity when compared to other systems. I ran it on my local Ubuntu machine and the output was:

cpu          : usr=0.19%, sys=0.59%, ctx=32923, majf=0, minf=23

The usr percentage seems to be a small fraction of what is being reported on my server.

UPDATE Re PHP APC. Yes it is installed.
Output from phpinfo:

APC Support enabled
Version 3.1.7
APC Debugging   Disabled
MMAP Support    Enabled
MMAP File Mask  no value
Locking type    pthread mutex Locks
Serialization Support   php
Revision    $Revision: 307215 $
Build Date  May 2 2011 19:00:42

Is there any specific setting I should check for? These are the settings I have (local value, Master value):

apc.cache_by_default    On  On
apc.canonicalize    On  On
apc.coredump_unmap  Off Off
apc.enable_cli  Off Off
apc.enabled On  On
apc.file_md5    Off Off
apc.file_update_protection  2   2
apc.filters no value    no value
apc.gc_ttl  3600    3600
apc.include_once_override   Off Off
apc.lazy_classes    Off Off
apc.lazy_functions  Off Off
apc.max_file_size   1M  1M
apc.mmap_file_mask  no value    no value
apc.num_files_hint  1000    1000
apc.preload_path    no value    no value
apc.report_autofilter   Off Off
apc.rfc1867 Off Off
apc.rfc1867_freq    0   0
apc.rfc1867_name    APC_UPLOAD_PROGRESS APC_UPLOAD_PROGRESS
apc.rfc1867_prefix  upload_ upload_
apc.rfc1867_ttl 3600    3600
apc.serializer  default default
apc.shm_segments    1   1
apc.shm_size    32M 32M
apc.slam_defense    On  On
apc.stat    On  On
apc.stat_ctime  Off Off
apc.ttl 0   0
apc.use_request_time    On  On
apc.user_entries_hint   4096    4096
apc.user_ttl    0   0
apc.write_lock  On  On

UPDATE Increased apc.shm_size to 96M. Cache full count is now 0, and there are 96.5% hits to the cache after a few refreshes of the website here and there. APC memory usage is 25.4MB free.

It seems to have reduced the loading time by 3 seconds or so, now down to around 4 to 5 seconds if I do a pure wget from the server itself without getting any images etc. Still more than twice slower than the other hosting but definitely was an improvement.

I am still finding it strange why it was taking so long to render those pages when the server is totally idle (I don't have APC installed on my development PC and it doesn't have that kind of behaviour). And its still strange where those extra remaining seconds are being wasted.

Best Answer

This looks like other cases I've seen where Apache is spending a lot of its time compiling PHP. Have you made sure an opcode cache (e.g. APC) is installed? It'll show as a loaded module in the output of phpinfo(), if that helps. Otherwise, to track what Apache's doing within mod_php, your best bet is going to be XHProf.

To anyone other than jbx arriving here via Google: the other answers are excellent, by the way. Go read them. But those answers, and jbx's responses to them, have helped me arrive at this conclusion.

Related Topic