There may be a better way to do this - I would be curious to see it too! One way I can think of would be to use an outer join. (I'm not exactly sure about how your data is structured, so forgive the contrived example, but I hope it would translate ok.) How about this?
with
a as (select *
from (values
(1,'cookie_n',10,'cookie_2'),
(2,'cookie_n',11,'cookie_1'),
(3,'cookie_m',12,'cookie_1'),
(4,'cookie_m',12,'cookie_1'),
(5,'cookie_q',13,'cookie_1'),
(6,'cookie_n',13,'cookie_1'),
(7,'cookie_m',14,'cookie_3')
) as db_ids(first_party_id, first_party_type, third_party_id, third_party_type)
),
b as (select first_party_type
from a where third_party_type = 'cookie_2'),
c as (select a.third_party_id, b.first_party_type as exclude_first_party_type
from a left join b on a.first_party_type = b.first_party_type
where a.third_party_type = 'cookie_1')
select count(distinct third_party_id) from c
where exclude_first_party_type is null;
Hope this helps!
Athena is basically managed Presto. Since Presto 311 you can use OFFSET m LIMIT n
syntax or ANSI SQL equivalent: OFFSET m ROWS FETCH NEXT n ROWS ONLY
.
You can read more in Beyond LIMIT, Presto meets OFFSET and TIES.
For older versions (and this includes AWS Athena as of this writing), you can use row_number()
window function to implement OFFSET + LIMIT.
For example, instead of
SELECT * FROM elb_logs
OFFSET 5 LIMIT 5 -- this doesn't work, obviously
You can execute
SELECT * FROM (
SELECT row_number() over() AS rn, * FROM elb_logs)
WHERE rn BETWEEN 5 AND 10;
Note: the execution engine will still need to read offset+limit rows from the underlying table, but this is still much better than sending all these rows back to the client and taking a sublist there.
Warning: see https://stackoverflow.com/a/45114359/65458 for explanation why avoiding OFFSET in queries is generally a good idea.
Best Answer
Using
OFFSET
for pagination is very inefficient, especially for an analytic database like Presto that often has to perform a full table or partition scan. Additionally, the results will not necessarily be consistent between queries, so you can have duplicate or missing results when navigating between pages.In an
OLTP
database like MySQL or PostgreSQL, it's better to use a range query over an index, where you keep track of the last value seen on the previous page.In an
OLAP
database like Presto, it's better to cache the result set and perform pagination using the cached data. You don't want to run an expensive query over billions or trillions of rows each time the user clicks to go to a different page.See these articles for a longer explanation of the problem and the index approach: