Postgresql – Postgres suddenly slower and higher CPU usage

amazon ec2postgresql

We're running a Postgres DB that's about 38GB, hosted on a 68GB EC2 instance. It's been humming along, with load at ~0.7 (on an 8 core machine) and little CPU usage until about 16 hours ago, when CPU usage ramped up over the course of a couple of hours and is now much higher than before (around 20% of total), and load average has increased accordingly (now between 5 and 8).

The machine sees between 100 and 300 simultaneous connections, according to pg_stat_activity. Our largest table is about 9.0GB including indexes.

Here's what we've ruled out:

  • pg_top shows that there are always a couple of threads running with high CPU percentage, while the rest of the threads below them look fine. There's nothing extraordinary about the SELECTs and INSERTs there.
  • IOWait hasn't increased at all, and it doesn't seem like we're hitting disk more often
  • As far as we can tell, usage patterns haven't changed; in fact, Monday is slower than weekends
  • We increased the size of our memcached instance, to no avail
  • We turned off synchronous_commit, but that also did not help much
  • We've tried both increasing and decreasing shared buffers (currently set to 2GB); neither 4GB nor 1GB improved things
  • Restarting postgres and all our apache servers hasn't helped
  • Restarting pgbouncer, which our Apache/Django boxes use to talk to Postgres, also proved unfruitful
  • All of our major queries have the correct indexes, and we've verified that postgres is using them. We ran a VACUUM ANALYZE just to be safe, and the query planner looks like it's doing the right thing
  • Queries will be fine for a few seconds, then seemingly lock up; an index scan that usually takes 200 ms or less will then take several whole seconds.

Any ideas on how to proceed, or track down this problem?

Best Answer

Turn on full logging for a short period (~10 minutes) if you can. Save the log, and have pgfouine (http://pgfouine.projects.postgresql.org/) analyse it. It's possible you're doing more queries than before, or you're doing a query which isn't very efficient.

Next it's possible you have some slow queries impacting things badly. Take a look at which queries are running right now:

SELECT pg_stat_activity.procpid AS pid, pg_stat_activity.usename AS username, pg_stat_activity.waiting, now() - pg_stat_activity.query_start AS age, pg_stat_activity.current_query AS query FROM pg_stat_activity WHERE pg_stat_activity.current_query <> '<IDLE>'::text ORDER BY now() - pg_stat_activity.query_start DESC;

Look for the queries at the top of the list. Has one been running for a very long time? If you have a transaction which has been open for 16+ hours, it will definitely slow things down a lot.

Related Topic