I'm really struggling with a performance issue currently, where all my database queries takes up to a 1000 times more than normally, when my database server is under "heavy" load.
I have a server that runs ubuntu 12.04 and hosts my postgresql 9.2 database. The server is on Hetzner.de and it is a EX6s (8 core xenon with 32Gb ram and two 3T HDD's in a RAID1 setup using ubuntus software raid).
When I run a high volume of queries (which I often done in the night) then I see that almost all CPU usage is spend on CPU I/O wait. I have new relic monitoring installed, and I can't seem to find any other indications of what could be the root cause of this CPU I/O wait, which clearly is a bottleneck to my performance and throughput.
So the question is, what is this CPU I/O wait and what is it waiting for?
I added All the overview charts I can get hold on from new relic beneath. What am I overlooking? There must be an obvious bottleneck? Where should I dive in?
Database server CPU usage – The evil one that tells me something is wrong
https://rpm.newrelic.com/public/charts/cEdIvvoQZCr
Database server load average
https://rpm.newrelic.com/public/charts/cMNdrYW51QJ
Database server physical memory
https://rpm.newrelic.com/public/charts/c3dZBntNpa1
Database server disk I/O utulization – as you can see the disk doesn't seem to be utilized very much (almost nothing)
https://rpm.newrelic.com/public/charts/9YEVw6RekFG
Database server network I/O (Mb/s) – the network is a gigabit internal network that all communications are made on.
https://rpm.newrelic.com/public/charts/lKiZ0Szmwe7
Top 5 database operations by wall clock time
https://rpm.newrelic.com/public/charts/dCt45YH12FK
Database throughput
https://rpm.newrelic.com/public/charts/bIbtQ1mDzMI
Database response time
https://rpm.newrelic.com/public/charts/fPcNL8WA6xx
UPDATE:
After doing a sudo iostat -k 1
I'm starting to get suspicious.
I get a lot of output like this, that I don't see in NR:
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 0.00 0.00 0.00 0 0
sdb 0.00 0.00 0.00 0 0
sdc 546.00 2296.00 6808.00 2296 6808
sdd 593.00 1040.00 7416.00 1040 7416
md1 0.00 0.00 0.00 0 0
md0 0.00 0.00 0.00 0 0
md2 1398.00 3328.00 13064.00 3328 13064
md3 0.00 0.00 0.00 0 0
Best Answer
In short, you need faster disks.
When processes are in iowait, this means that they have issued an IO request, and are waiting for the results of that request. With adequate RAM, most of your working set will be cached in RAM, and as such, reads will not contribute to much of this IO contention, so writes are typically the culprit.
Regarding your disk IO graph, I suspect that New Relic is, for some reason, not collecting IO metrics correctly. I'd recommend possibly looking into a secondary data collection system like Munin. Short of that, you can watch disk IO in realtime by issuing the following command:
This will output realtime disk reads and writes, in kbps. I suspect you'll see quite a bit of activity there that New Relic is not seeing.