Available since data.table
1.8.3 you can use .I
in the j
of a data.table
to get the row indices by groups...
DT[ , list( yidx = list(.I) ) , by = y ]
# y yidx
#1: 1 1,4,7
#2: 3 2,5,8
#3: 6 3,6,9
Minor update: Please refer to the new HTML vignettes as well. This issue highlights the other vignettes that we plan to.
I've updated this answer again (Feb 2016) in light of the new on=
feature that allows ad-hoc joins as well. See history for earlier (outdated) answers.
What exactly does setkey(DT, a, b)
do?
It does two things:
- reorders the rows of the data.table
DT
by the column(s) provided (a, b) by reference, always in increasing order.
- marks those columns as key columns by setting an attribute called
sorted
to DT
.
The reordering is both fast (due to data.table's internal radix sorting) and memory efficient (only one extra column of type double is allocated).
When is setkey()
required?
For grouping operations, setkey()
was never an absolute requirement. That is, we can perform a cold-by or adhoc-by.
## "cold" by
require(data.table)
DT <- data.table(x=rep(1:5, each=2), y=1:10)
DT[, mean(y), by=x] # no key is set, order of groups preserved in result
However, prior to v1.9.6
, joins of the form x[i]
required key
to be set on x
. With the new on=
argument from v1.9.6+, this is not true anymore, and setting keys is therefore not an absolute requirement here as well.
## joins using < v1.9.6
setkey(X, a) # absolutely required
setkey(Y, a) # not absolutely required as long as 'a' is the first column
X[Y]
## joins using v1.9.6+
X[Y, on="a"]
# or if the column names are x_a and y_a respectively
X[Y, on=c("x_a" = "y_a")]
Note that on=
argument can be explicitly specified even for keyed
joins as well.
The only operation that requires key
to be absolutely set is the foverlaps() function. But we are working on some more features which when done would remove this requirement.
This leads to the question, what advantage does keying a data.table have anymore?
Is there an advantage to keying a data.table?
Keying a data.table physically reorders it based on those column(s) in RAM. Computing the order is not usually the time consuming part, rather the reordering itself. However, once we've the data sorted in RAM, the rows belonging to the same group are all contiguous in RAM, and is therefore very cache efficient. It's the sortedness that speeds up operations on keyed data.tables.
It is therefore essential to figure out if the time spent on reordering the entire data.table is worth the time to do a cache-efficient join/aggregation. Usually, unless there are repetitive grouping / join operations being performed on the same keyed data.table, there should not be a noticeable difference.
In most cases therefore, there shouldn't be a need to set keys anymore. We recommend using on=
wherever possible, unless setting key has a dramatic improvement in performance that you'd like to exploit.
Question: What do you think would be the performance like in comparison to a keyed join, if you use setorder()
to reorder the data.table and use on=
? If you've followed thus far, you should be able to figure it out :-).
Best Answer
.SD
stands for something like "S
ubset ofD
ata.table". There's no significance to the initial"."
, except that it makes it even more unlikely that there will be a clash with a user-defined column name.If this is your data.table:
Doing this may help you see what
.SD
is:Basically, the
by=y
statement breaks the original data.table into these two sub-data.tables
and operates on them in turn.
While it is operating on either one, it lets you refer to the current sub-
data.table
by using the nick-name/handle/symbol.SD
. That's very handy, as you can access and operate on the columns just as if you were sitting at the command line working with a single data.table called.SD
... except that here,data.table
will carry out those operations on every single sub-data.table
defined by combinations of the key, "pasting" them back together and returning the results in a singledata.table
!