R – determining row indices of data.table group members

data.tabler

The data.table package in R provides the option:

which: ‘TRUE’ returns the integer row numbers of ‘x’ that ‘i’
matches to.

However, I see no way of obtaining, within j, the integer row numbers of 'x' within the groups established using by.

For example, given…

DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6))

…I would like to know the indices into DT for each value of y.

The value to me is that I am using a data.table in parallel with Another Data Structure (ADS) to which I intend to perform groupwise computations based on the efficiently computed groupings of the data.table.

For example, assuming ADS is a vector with a value for each row in DT:

ADS<-sample(100,nrow(DT))

I can, as a workaround, compute the groupwise mean of ADS determined by DT$y the group if I first add a new sequence column to the data.table.

DT[,seqNum:=seq_len(nrow(DT))]
DT[,mean(ADS[seqNum]),by=y]

Which gives the result I want at the cost of adding a new column.

I realize that in this example I can get the same answer using tapply:

tapply(ADS,DT$y,mean)

However, I will not then get the performance benefit of data.tables efficient grouping (especially when the 'by' columns are indexed).

Perhaps there is some syntax I am overlooking???

Perhaps this is an easy feature to add to data.table and I should request it (wink, wink)???

Proposed syntax: optionally set '.which' to the group indices, allowing to write:

DT[,mean(ADS[.which]),by=y,which=TRUE]

Best Answer

Available since data.table 1.8.3 you can use .I in the j of a data.table to get the row indices by groups...

DT[ , list( yidx = list(.I) ) , by = y ]
#   y  yidx
#1: 1 1,4,7
#2: 3 2,5,8
#3: 6 3,6,9
Related Topic