I'm trying to use data.table to speed up processing of a large data.frame (300k x 60) made of several smaller merged data.frames. I'm new to data.table. The code so far is as follows
library(data.table)
a = data.table(index=1:5,a=rnorm(5,10),b=rnorm(5,10),z=rnorm(5,10))
b = data.table(index=6:10,a=rnorm(5,10),b=rnorm(5,10),c=rnorm(5,10),d=rnorm(5,10))
dt = merge(a,b,by=intersect(names(a),names(b)),all=T)
dt$category = sample(letters[1:3],10,replace=T)
and I wondered if there was a more efficient way than the following to summarize the data.
summ = dt[i=T,j=list(a=sum(a,na.rm=T),b=sum(b,na.rm=T),c=sum(c,na.rm=T),
d=sum(d,na.rm=T),z=sum(z,na.rm=T)),by=category]
I don't really want to type all 50 column calculations by hand and a eval(paste(...))
seems clunky somehow.
I had a look at the example below but it seems a bit complicated for my needs. thanks
Best Answer
You can use a simple
lapply
statement with.SD
If you only want to summarize over certain columns, you can add the
.SDcols
argumentThis of course, is not limited to
sum
and you can use any function withlapply
, including anonymous functions. (ie, it's a regularlapply
statement).Lastly, there is no need to use
i=T
andj= <..>
. Personally, I think that makes the code less readable, but it is just a style preference.Documentation
See
?.SD
,?data.table
and its.SDcols
argument, and the vignette Using .SD for Data Analysis.Also have a look at
data.table
FAQ 2.1.