I am not very clear about use of .SD
and by
.
For instance, does the below snippet mean: 'change all the columns in DT
to factor except A
and B
?' It also says in data.table
manual: ".SD
refers to the Subset of the data.table
for each group (excluding the grouping columns)" – so columns A
and B
are excluded?
DT = DT[ ,lapply(.SD, as.factor), by=.(A,B)]
However, I also read that by
means like 'group by' in SQL when you do aggregation. For instance, if I would like to sum (like colsum
in SQL) over all the columns except A
and B
do I still use something similar? Or in this case, does the below code mean to take the sum and group by values in columns A
and B
? (take sum and group by A,B
as in SQL)
DT[,lapply(.SD,sum),by=.(A,B)]
Then how do I do a simple colsum
over all the columns except A
and B
?
Best Answer
Just to illustrate the comments above with an example, let's take
Compare the following:
A few notes:
DT
..."The answer is no, and this is very important for
data.table
. The object returned is a newdata.table
, and all of the columns inDT
are exactly as they were before running the code.Referring to the point above again, note that your code (
DT[ , lapply(.SD, as.factor)]
) returns a newdata.table
and does not changeDT
at all. One (incorrect) way to do this, which is done withdata.frame
s inbase
, is to overwrite the olddata.table
with the newdata.table
you've returned, i.e.,DT = DT[ , lapply(.SD, as.factor)]
.This is wasteful because it involves creating copies of
DT
which can be an efficiency killer whenDT
is large. The correctdata.table
approach to this problem is to update the columns by reference using`:=`
, e.g.,DT[ , names(DT) := lapply(.SD, as.factor)]
, which creates no copies of your data. Seedata.table
's reference semantics vignette for more on this.lapply(.SD, sum)
to that ofcolSums
.sum
is internally optimized indata.table
(you can note this is true from the output of adding theverbose = TRUE
argument within[]
); to see this in action, let's beef up yourDT
a bit and run a benchmark:Results: