I want a random selection of rows in PostgreSQL, I tried this:
select * from table where random() < 0.01;
But some other recommend this:
select * from table order by random() limit 1000;
I have a very large table with 500 Million rows, I want it to be fast.
Which approach is better? What are the differences? What is the best way to select random rows?
Best Answer
Given your specifications (plus additional info in the comments),
The query below does not need a sequential scan of the big table, only an index scan.
First, get estimates for the main query:
The only possibly expensive part is the
count(*)
(for huge tables). Given above specifications, you don't need it. An estimate will do just fine, available at almost no cost (detailed explanation here):As long as
ct
isn't much smaller thanid_span
, the query will outperform other approaches.Generate random numbers in the
id
space. You have "few gaps", so add 10 % (enough to easily cover the blanks) to the number of rows to retrieve.Each
id
can be picked multiple times by chance (though very unlikely with a big id space), so group the generated numbers (or useDISTINCT
).Join the
id
s to the big table. This should be very fast with the index in place.Finally trim surplus
id
s that have not been eaten by dupes and gaps. Every row has a completely equal chance to be picked.Short version
You can simplify this query. The CTE in the query above is just for educational purposes:
Refine with rCTE
Especially if you are not so sure about gaps and estimates.
We can work with a smaller surplus in the base query. If there are too many gaps so we don't find enough rows in the first iteration, the rCTE continues to iterate with the recursive term. We still need relatively few gaps in the ID space or the recursion may run dry before the limit is reached - or we have to start with a large enough buffer which defies the purpose of optimizing performance.
Duplicates are eliminated by the
UNION
in the rCTE.The outer
LIMIT
makes the CTE stop as soon as we have enough rows.This query is carefully drafted to use the available index, generate actually random rows and not stop until we fulfill the limit (unless the recursion runs dry). There are a number of pitfalls here if you are going to rewrite it.
Wrap into function
For repeated use with varying parameters:
Call:
You could even make this generic to work for any table: Take the name of the PK column and the table as polymorphic type and use
EXECUTE
... But that's beyond the scope of this question. See:Possible alternative
IF your requirements allow identical sets for repeated calls (and we are talking about repeated calls) I would consider a materialized view. Execute above query once and write the result to a table. Users get a quasi random selection at lightening speed. Refresh your random pick at intervals or events of your choosing.
Postgres 9.5 introduces
TABLESAMPLE SYSTEM (n)
Where
n
is a percentage. The manual:Bold emphasis mine. It's very fast, but the result is not exactly random. The manual again:
The number of rows returned can vary wildly. For our example, to get roughly 1000 rows:
Related:
Or install the additional module tsm_system_rows to get the number of requested rows exactly (if there are enough) and allow for the more convenient syntax:
See Evan's answer for details.
But that's still not exactly random.