I have a data.table with about 3 million rows and 40 columns. I would like to sort this table by descending order within groups like the following sql mock code:
sort by ascending Year, ascending MemberID, descending Month
Is there an equivalent way in data.table to do this? So far I have to break it down into 2 steps:
setkey(X, Year, MemberID)
This is very fast and takes only a few second.
X <- X[,.SD[order(-Month)],by=list(Year, MemberID)]
This step takes so much longer (5 minutes).
Update:
Someone made a comment to do X <- X[sort(Year, MemberID, -Month)]
and later deleted. This approach seems to be much faster:
user system elapsed
5.560 11.242 66.236
My approach: setkey() then order(-Month)
user system elapsed
816.144 9.648 848.798
My question is now: if I want to summarize by Year, MemberId and Month after sort(Year, MemberID, Month), does data.table recognize the sort order?
Update 2: to response to Matthew Dowle:
After setkey with Year, MemberID and Month, I still have multiple records per group. What I would like is to summarize for each of the groups. What I meant was: if I use X[order(Year, MemberID, Month)], does the summation utilizes binary search functionality of data.table:
monthly.X <- X[, lapply(.SD[], sum), by = list(Year, MemberID, Month)]
Update 3: Matthew D proposed several approaches. Run time for the first approach is faster than order() approach:
user system elapsed
7.910 7.750 53.916
Matthew: what surprised me was converting the sign of Month takes most of the time. Without it, setkey is blazing fast.
Best Answer
Update June 5 2014:
The current development version of data.table v1.9.3 has two new functions implemented, namely:
setorder
andsetorderv
, which does exactly what you require. These functions reorder thedata.table
by reference with the option to choose either ascending or descending order on each column to order by. Check out?setorder
for more info.In addition,
DT[order(.)]
is also by default optimised to usedata.table
's internal fast order instead ofbase:::order
. This, unlikesetorder
, will make an entire copy of the data, and is therefore less memory efficient, but will still be orders of magnitude faster than operating using base's order.Benchmarks:
Here's an illustration on the speed differences using
setorder
, data.table's internal fast order and withbase:::order
:On this data, benchmarks indicate that data.table's order is about ~79x faster than
base:::order
andsetorder
is ~135x faster thanbase:::order
here.data.table
always sorts/orders in C-locale. If you should require to order in another locale, only then do you need to resort to usingDT[base:::order(.)]
.All these new optimisations and functions together constitute FR #2405. bit64::integer64 support also has been added.