PostgreSQL Performance – Why SELECT is Extremely Slow but EXPLAIN ANALYZE is Not

performancepostgresql

Some time ago my team noticed extreme slowness in SELECT queries on our production server. For a table with over 1 million rows, we're talking about a couple of minutes. These are simple queries, e.g.

SELECT id, name, serno FROM public.product_facts;

This, by contrast, is extremely fast:

EXPLAIN ANALYZE SELECT id, name, serno FROM public.product_facts;

The latter returns stats like the following:

Seq Scan on product_facts (cost=0.00..58167.86 rows=1339186 width=24) (actual time=0.012..435.753 rows=1399186 loops=1)
Planning time: 0.076 ms
Execution time: 708.078 ms

Likewise, this is very fast:

SELECT COUNT (id) FROM public.product_facts;

Server memory and CPU do not appear taxed. If I run the barebones SELECT in a tab in PGadmin and leave it to do its thing, I can run a separate operation in another tab or elsewhere, and it runs fine. Furthermore, examining processes and system consumption via top, I can see no sign of excessive CPU or memory consumption.

Network lag appears also an unlikely culprit. The network response times are within normal limits. This odd slowness started a few weeks ago, as mentioned, on our production server, then appeared on our UAT, and now on our non-production instance as well. Slowness seems proportional to record count but applies across all tables.

The EXPLAIN ANALYZE case is to me the most perplexing part of this whole situation. From my reading of the Postgres docs, I gather that the ANALYZE keyword causes the server to actually execute the command in question, then return the analysis of that execution. So how could a SELECT query take half a second in that case but over one minute in another case, with no indication of memory or CPU overload?

Best Answer

EXPLAIN ANALYZE SELECT... executes the query, but sends to the client a single row consisting of the execution plan, which means a few hundred bytes to ship across the network and process on the client side.

The actual SELECT without the EXPLAIN ANALYZE sends all the results, which are over a million rows here. And when the client receives these results, it has to do something with them.

So it is to be expected that the second form is slower overall, even if several minutes versus 708ms is indeed a very significant difference.

There are two factors to consider to explain this difference: how big is the result in relation to how fast the network is, and how costly is the per-row client-side process. For instance, some graphical clients take a lot of time just putting results into columns. Or the psql command-line client formats the output by default to align the contents inside columns with blank-padding, which on some contents consume a lot of time.

If you're using psql, see how much time it takes with this sequence:

\set timing on
\pset format unaligned
SELECT id, name, serno FROM public.product_facts \g /dev/null

For the size of the resultset that is sent across the network, the result of SELECT sum(octet_length(concat(id,name,serno))) FROM public.product_facts should be a good ballpark estimate.

So how could a SELECT query take half a second in that case but over one minute in another case, with no indication of memory or CPU overload?

I think you're looking at the server, when what is consuming the time happens on the network and/or on the client machine. The server is just probably pushing results to the network socket as fast as the network is accepting them, and that generally does not keep a CPU busy.

Related Topic