Postgresql – Why is Postgres sitting 95% idle, with no file I/O

bottleneckdiagnosticperformancepostgispostgresql

I have a TileMill/PostGIS stack running on an 8 core Ubuntu 12.04 VM on an OpenStack cloud. It's a rebuild of a very similar system which was running nicely on very similar hardware (same cloud, but different physical hardware, I believe) last week. I've attempted to rebuild the stack exactly the same as it was (using some scripts I had built).

Everything runs, but the database is carrying out queries excruciatingly slowly, which manifests itself ultimately with very slow tile generation. An example query (count the number of pubs within a radius of every town in Australia), which previously took something like 10-20 seconds is now taking over 10 minutes:

explain (analyze, buffers) update places set pubs = 
(select count(*) from planet_osm_point p where p.amenity = 'pub' and st_dwithin(p.way,places.way,scope)) +
(select count(*) from planet_osm_polygon p where p.amenity = 'pub' and st_dwithin(p.way,places.way,scope)) ;
 Update on places  (cost=0.00..948254806.93 rows=9037 width=160) (actual time=623321.558..623321.558 rows=0 loops=1)
   Buffers: shared hit=132126300
   ->  Seq Scan on places  (cost=0.00..948254806.93 rows=9037 width=160) (actual time=68.130..622931.130 rows=9037 loops=1)
         Buffers: shared hit=132107781
         SubPlan 1
           ->  Aggregate  (cost=12.95..12.96 rows=1 width=0) (actual time=0.187..0.188 rows=1 loops=9037)
                 Buffers: shared hit=158171
                 ->  Index Scan using planet_osm_point_index on planet_osm_point p  (cost=0.00..12.94 rows=1 width=0) (actual time=0.163..0.179 rows=0 loops=9037)
                       Index Cond: (way && st_expand(places.way, (places.scope)::double precision))
                       Filter: ((amenity = 'pub'::text) AND (places.way && st_expand(way, (places.scope)::double precision)) AND _st_dwithin(way, places.way, (places.scope)::double precision))
                       Buffers: shared hit=158171
         SubPlan 2
           ->  Aggregate  (cost=104917.24..104917.25 rows=1 width=0) (actual time=68.727..68.728 rows=1 loops=9037)
                 Buffers: shared hit=131949237
                 ->  Seq Scan on planet_osm_polygon p  (cost=0.00..104917.24 rows=1 width=0) (actual time=68.138..68.716 rows=0 loops=9037)
                       Filter: ((amenity = 'pub'::text) AND (way && st_expand(places.way, (places.scope)::double precision)) AND (places.way && st_expand(way, (places.scope)::double precision)) AND _st_dwithin(way, places.way, (places.scope)::double precision))
                       Buffers: shared hit=131949237
 Total runtime: 623321.801 ms

(I'm including this query as a symptom, not directly the problem to be solved. This particular query only gets run once a week or so.)

The server has 32 GB of RAM, and I've configured Postgres as follows (following advice found on the web):

shared_buffers = 8GB
autovacuum = on
effective_cache_size = 8GB
work_mem = 128MB
maintenance_work_mem = 64MB
wal_buffers = 1MB
checkpoint_segments = 10

iostat shows nothing being read, a bit of data being written (no idea where or why), and 95% idle CPU:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.40    0.00    0.00    0.11    0.00   94.49

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda               0.20         0.00         0.80          0          8
vdb               2.30         0.00        17.58          0        176

Sample output from vmstat:

  procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
...
 1  0      0 18329748 126108 12600436    0    0     0    18  148  140  5  0 95  0
 2  0      0 18329400 126124 12600436    0    0     0     9  173  228  5  0 95  0

Clutching at straws, I moved the Postgres data directory from vda to vdb but of course that made no difference.

So I'm at a loss. Why is Postgres only using 5% of available CPU when it's not waiting for any I/O? I'd welcome any suggestions for further investigation, other tools, random things to try.

Update

I snapshotted the server and launched it on a different part of the same cloud (a different availability zone). The results were a bit weird. vmstat on this server reports 12% CPU usage (which I now understand as the expected value for a single Postgres query on an 8 core VM) – although the actual query execution time is virtually identical (630 seconds vs 623).

I now realise that this particular query is probably not a good sample for this reason: it can only use one core, and it's an update (whereas tile rendering is just selects).

I also didn't notice in the explain that apparently planet_osm_polygon isn't using an index. That could well be the cause, so I'll chase that next.

Update2

The problem definitely seems to be that the planet_osm_polygon index(es) is/aren't being used. There are two (one created by osm2pgsql, one created by me following some random guide):

CREATE INDEX idx_planet_osm_polygon_tags
  ON planet_osm_polygon
  USING gist
  (tags);


CREATE INDEX planet_osm_polygon_pkey
  ON planet_osm_polygon
  USING btree
  (osm_id);

The stats on planet_osm_polygon and planet_osm_point are pretty revealing, I think:

planet_osm_polygon:

Sequential Scans    194204  
Sequential Tuples Read  60981018608 
Index Scans 1574    
Index Tuples Fetched    0

planet_osm_point:

Sequential Scans    1142    
Sequential Tuples Read  12960604    
Index Scans 183454  
Index Tuples Fetched    43427685

If I read that right, Postgres has searched the planet_osm_polygon 1574 times, but never actually found anything, so has done a ridiculously large number of brute force searches.

The new question: why?

Mystery solved

Thanks to Frederik Ramm's answer, the answer turns out to be fairly simple: there was no spatial index, for some reason. It was trivial to regenerate them:

create index planet_osm_polygon_polygon on planet_osm_polygon using gist(way);
create index planet_osm_polygon_point on planet_osm_point using gist(way);

Running that query now takes 4.6 seconds. Spatial indexes matter! 🙂

Best Answer

Running your Explain Anlayze output through explain.depesz.com highlights that the bulk of the slowness comes from this action:

Seq Scan on planet_osm_polygon p 

Was that indexed before? Can you index it now?

By searching for that problem area, I also found a related Q&A on an Open Street Map site: