Postgresql – Strange high CPU I/O wait on postgresql server

I'm really struggling with a performance issue currently, where all my database queries takes up to a 1000 times more than normally, when my database server is under "heavy" load.

I have a server that runs ubuntu 12.04 and hosts my postgresql 9.2 database. The server is on Hetzner.de and it is a EX6s (8 core xenon with 32Gb ram and two 3T HDD's in a RAID1 setup using ubuntus software raid).
When I run a high volume of queries (which I often done in the night) then I see that almost all CPU usage is spend on CPU I/O wait. I have new relic monitoring installed, and I can't seem to find any other indications of what could be the root cause of this CPU I/O wait, which clearly is a bottleneck to my performance and throughput.

So the question is, what is this CPU I/O wait and what is it waiting for?

I added All the overview charts I can get hold on from new relic beneath. What am I overlooking? There must be an obvious bottleneck? Where should I dive in?

Database server CPU usage – The evil one that tells me something is wrong
https://rpm.newrelic.com/public/charts/cEdIvvoQZCr

Database server load average
https://rpm.newrelic.com/public/charts/cMNdrYW51QJ

Database server physical memory
https://rpm.newrelic.com/public/charts/c3dZBntNpa1

Database server disk I/O utulization – as you can see the disk doesn't seem to be utilized very much (almost nothing)
https://rpm.newrelic.com/public/charts/9YEVw6RekFG

Database server network I/O (Mb/s) – the network is a gigabit internal network that all communications are made on.
https://rpm.newrelic.com/public/charts/lKiZ0Szmwe7

Top 5 database operations by wall clock time
https://rpm.newrelic.com/public/charts/dCt45YH12FK

Database throughput
https://rpm.newrelic.com/public/charts/bIbtQ1mDzMI

Database response time
https://rpm.newrelic.com/public/charts/fPcNL8WA6xx

UPDATE:

After doing a sudo iostat -k 1 I'm starting to get suspicious.
I get a lot of output like this, that I don't see in NR:

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               0.00         0.00         0.00          0          0
sdb               0.00         0.00         0.00          0          0
sdc             546.00      2296.00      6808.00       2296       6808
sdd             593.00      1040.00      7416.00       1040       7416
md1               0.00         0.00         0.00          0          0
md0               0.00         0.00         0.00          0          0
md2            1398.00      3328.00     13064.00       3328      13064
md3               0.00         0.00         0.00          0          0

Best Answer

In short, you need faster disks.

When processes are in iowait, this means that they have issued an IO request, and are waiting for the results of that request. With adequate RAM, most of your working set will be cached in RAM, and as such, reads will not contribute to much of this IO contention, so writes are typically the culprit.

Regarding your disk IO graph, I suspect that New Relic is, for some reason, not collecting IO metrics correctly. I'd recommend possibly looking into a secondary data collection system like Munin. Short of that, you can watch disk IO in realtime by issuing the following command:

$ iostat -k 1

This will output realtime disk reads and writes, in kbps. I suspect you'll see quite a bit of activity there that New Relic is not seeing.

Best Answer

Related Solutions

CPU or network I/O bound

Postgresql – How to see active connections and “current activity” in PostgreSQL 8.4

Related Topic