Cassandra cql select sorting

cassandracqlcqlsh

If I define a table like this using cql:

CREATE TABLE scores (
 name text,
 age int,
 score int,
 date timestamp,
 PRIMARY KEY (name, age, score)
);

And do a SELECT in cqlsh like this:

select * from mykeyspace.scores;

The result displayed seems to always be sorted by 'age', then 'score' automatically in ascending order regardless of input-data ordering (as expected, return rows are not sorted by the partition key 'name'). I have the following questions:

Does SELECT automatically sort return rows by the clustering keys?
If yes, what's the purpose of using the ORDER BY clause when using SELECT?
If no, how do I get the return rows to sort by the clustering keys since cql doesn't allow ORDER BY on a select *?

Best Answer

Your clustering columns define the order (in your case age then score)

http://cassandra.apache.org/doc/cql3/CQL.html#createTableStmt

On a given physical node, rows for a given partition key are stored in the order induced by the clustering columns, making the retrieval of rows in that clustering order particularly efficient (see SELECT).

http://cassandra.apache.org/doc/cql3/CQL.html#selectStmt

The ORDER BY option allows to select the order of the returned results. It takes as argument a list of column names along with the order for the column (ASC for ascendant and DESC for descendant, omitting the order being equivalent to ASC). Currently the possible orderings are limited (which depends on the table CLUSTERING ORDER):

if the table has been defined without any specific CLUSTERING ORDER, then the allowed orderings are the order induced by the clustering columns and the reverse of that one.

otherwise, the orderings allowed are the order of the CLUSTERING ORDER option and the reversed one.

Related Solutions

Clustering Keys in Cassandra

Suppose your clustering keys are

k1 t1, k2 t2, ..., kn tn

where ki is the ith key name and ti is the ith key type. Then the order data is stored in is lexicographic ordering where each dimension is compared using the comparator for that type.

So (a1, a2, ..., an) < (b1, b2, ..., bn) if a1 < b1 using t1 comparator, or a1=b1 and a2 < b2 using t2 comparator, or (a1=b1 and a2=b2) and a3 < b3 using t3 comparator, etc..

This means that it is efficient to find all rows with a certain k1=a, since the data is stored together. But it is inefficient to find all rows with ki=x for i > 1. In fact, such a query isn't allowed - the only clustering key constraints that are allowed specify zero or more clustering keys, starting from the first with none missing.

For example, consider the schema

create table clustering (
    x text,
    k1 text,
    k2 int,
    k3 timestamp,
    y text,
    primary key (x, k1, k2, k3)
);

If you did the following inserts:

insert into clustering (x, k1, k2, k3, y) values ('x', 'a', 1, '2013-09-10 14:00+0000', '1');
insert into clustering (x, k1, k2, k3, y) values ('x', 'b', 1, '2013-09-10 13:00+0000', '1');
insert into clustering (x, k1, k2, k3, y) values ('x', 'a', 2, '2013-09-10 13:00+0000', '1');
insert into clustering (x, k1, k2, k3, y) values ('x', 'b', 1, '2013-09-10 14:00+0000', '1');

then they are stored in this order on disk (the order select * from clustering where x = 'x' returns):

 x | k1 | k2 | k3                       | y
---+----+----+--------------------------+---
 x |  a |  1 | 2013-09-10 14:00:00+0000 | 1
 x |  a |  2 | 2013-09-10 13:00:00+0000 | 1
 x |  b |  1 | 2013-09-10 13:00:00+0000 | 1
 x |  b |  1 | 2013-09-10 14:00:00+0000 | 1

k1 ordering dominates, then k2, then k3.

Clustering Order using timeuuid CQL

CREATE TABLE example ( a int, b int, c int, d int, PRIMARY KEY (a,b,c)) WITH CLUSTERING ORDER BY (b DESC , c ASC ) ;

Is the correct syntax for ordering with multiple columns.

For your particular application you actually are trying to get results from distinctly different types of queries. In Cassandra it is best to shape each table to be the response to a particular query.

For example (not knowing that much about your application)

select * from demo where oid = 100 and cid = 100 and ts > minTimeuuid('something');
select * from demo where oid = 100 and cid = 10;

Might be better served by a table structure like

create table demo_oct(oid int,cid int,ts timeuuid, body, other ...., PRIMARY KEY ((oid,cid),ts)) WITH CLUSTERING ORDER BY (ts DESC);

This way each set of time-series for a pair of oid and cid data will reside in it's own partiton and be easy to retrieve. This is because i'm using a Parition key made up of both oid and cid. This is why there is an extra set of parenthesizes in the key. The clustering key ts ensures that the data is in the order that you want.

But as you noticed, you cannot perform select * from table oid == 10 on this table because that would require scanning over the entire database (because of the partition structure)

For the queries like

select * from demo where oid = 100;

you need a second table(again not knowing your particular application)

create table demo_ot(oid int,cid int,ts timeuuid, body, other ...., PRIMARY KEY (oid,ts)) WITH CLUSTERING ORDER BY (ts DESC);

This table would keep time series for each OID in a single partiton allowing for extremely fast slices. Here the partition key is just OID while ts is still the clustering key.

On the application side you would be inserting to both of these tables simultaneously.

More info on Datamodeling

Best Answer

Related Solutions

Clustering Keys in Cassandra

Clustering Order using timeuuid CQL

Related Topic