Amazon-web-services – OFFSET on AWS Athena

amazon-athenaamazon-web-servicespresto

I would like to run a query on AWS Athena with both a LIMIT and an OFFSET clause. I take it the former is supported while the latter is not. Is there any way of emulating this functionality using other methods?

Best Answer

Using OFFSET for pagination is very inefficient, especially for an analytic database like Presto that often has to perform a full table or partition scan. Additionally, the results will not necessarily be consistent between queries, so you can have duplicate or missing results when navigating between pages.

In an OLTP database like MySQL or PostgreSQL, it's better to use a range query over an index, where you keep track of the last value seen on the previous page.

In an OLAP database like Presto, it's better to cache the result set and perform pagination using the cached data. You don't want to run an expensive query over billions or trillions of rows each time the user clicks to go to a different page.

See these articles for a longer explanation of the problem and the index approach:

Related Solutions

Sql – Nested Query Alternatives in AWS Athena

There may be a better way to do this - I would be curious to see it too! One way I can think of would be to use an outer join. (I'm not exactly sure about how your data is structured, so forgive the contrived example, but I hope it would translate ok.) How about this?

with 
  a as (select * 
       from (values 
       (1,'cookie_n',10,'cookie_2'), 
       (2,'cookie_n',11,'cookie_1'),
       (3,'cookie_m',12,'cookie_1'),
       (4,'cookie_m',12,'cookie_1'),
       (5,'cookie_q',13,'cookie_1'),
       (6,'cookie_n',13,'cookie_1'),
       (7,'cookie_m',14,'cookie_3')
       ) as db_ids(first_party_id, first_party_type, third_party_id, third_party_type)
      ),
  b as (select first_party_type 
        from a where third_party_type = 'cookie_2'),
  c as (select a.third_party_id, b.first_party_type as exclude_first_party_type 
        from a left join b on a.first_party_type = b.first_party_type 
        where a.third_party_type = 'cookie_1')
select count(distinct third_party_id) from c 
       where exclude_first_party_type is null;

Hope this helps!

Sql – AWS Athena (Presto) OFFSET support

Athena is basically managed Presto. Since Presto 311 you can use OFFSET m LIMIT n syntax or ANSI SQL equivalent: OFFSET m ROWS FETCH NEXT n ROWS ONLY.

You can read more in Beyond LIMIT, Presto meets OFFSET and TIES.

For older versions (and this includes AWS Athena as of this writing), you can use row_number() window function to implement OFFSET + LIMIT.

For example, instead of

SELECT * FROM elb_logs
OFFSET 5 LIMIT 5 -- this doesn't work, obviously

You can execute

SELECT * FROM (
    SELECT row_number() over() AS rn, * FROM elb_logs)
WHERE rn BETWEEN 5 AND 10;

Note: the execution engine will still need to read offset+limit rows from the underlying table, but this is still much better than sending all these rows back to the client and taking a sublist there.

Warning: see https://stackoverflow.com/a/45114359/65458 for explanation why avoiding OFFSET in queries is generally a good idea.

Best Answer

Related Solutions

Sql – Nested Query Alternatives in AWS Athena

Sql – AWS Athena (Presto) OFFSET support

Related Topic