I am trying to figure out an elegant way to use :=
assignment to replace many columns at once in a data.table
by applying a shared function. A typical use of this might be to apply a string function (e.g., gsub
) to all character columns in a table. It is not difficult to extend the data.frame
way of doing this to a data.table
, but I'm looking for a method consistent with the data.table
way of doing things.
For example:
library(data.table)
m <- matrix(runif(10000), nrow = 100)
df <- df1 <- df2 <- df3 <- as.data.frame(m)
dt <- as.data.table(df)
head(names(df))
head(names(dt))
## replace V20-V100 with sqrt
# data.frame approach
# by column numbers
df1[20:100] <- lapply(df1[20:100], sqrt)
# by reference to column numbers
v <- 20:100
df2[v] <- lapply(df2[v], sqrt)
# by reference to column names
n <- paste0("V", 20:100)
df3[n] <- lapply(df3[n], sqrt)
# data.table approach
# by reference to column names
n <- paste0("V", 20:100)
dt[, n] <- lapply(dt[, n, with = FALSE], sqrt)
I understand it is more efficient to loop over a vector of column names using :=
to assign:
for (col in paste0("V", 20:100)) dt[, col := sqrt(dt[[col]]), with = FALSE]
I don't like this because I don't like reference the data.table
in a j
expression. I also know that I can use :=
to assign with lapply
given that I know the column names:
dt[, c("V20", "V30", "V40", "V50", "V60") := lapply(list(V20, V30, V40, V50, V60), sqrt)]
(You could extend this by building an expression with unknown column names.)
Below are the ideas I tried on this, but I wasn't able to get them to work. Am I making a mistake, or is there another approach I'm missing?
# possible data.table approaches?
# by reference to column names; assignment works, but not lapply
n <- paste0("V", 20:100)
dt[, n := lapply(n, sqrt), with = FALSE]
# by (smaller for example) list; lapply works, but not assignment
dt[, list(list(V20, V30, V40, V50, V60)) := lapply(list(V20, V30, V40, V50, V60), sqrt)]
# by reference to list; neither assignment nor lapply work
l <- parse(text = paste("list(", paste(paste0("V", 20:100), collapse = ", "), ")"))
dt[, eval(l) := lapply(eval(l), sqrt)]
Best Answer
Yes, you're right in question here :
Aside: note that the new way of doing that is :
because the
with = FALSE
wasn't easy to read whether it referred to the LHS or the RHS of:=
. End aside.As you know, that's efficient because that does each column one by one, so working memory is only needed for one column at a time. That can make a difference between it working and it failing with the dreaded out-of-memory error.
The problem with
lapply
on the RHS of:=
is that the RHS (thelapply
) is evaluated first; i.e., the result for the 80 columns is created. That's 80 column's worth of new memory which has to be allocated and populated. So you need 80 column's worth of free RAM for that operation to succeed. That RAM usage dominates vs the subsequently instant operation of assigning (plonking) those 80 new columns into the data.table's column pointer slots.As @Frank pointed to, if you have a lot of columns (say 10,000 or more) then the small overhead of dispatching to the
[.data.table
method starts to add up). To eliminate that overhead that there isdata.table::set
which under?set
is described as a "loopable":=
. I use afor
loop for this type of operation. It's the fastest way and is fairly easy to write and read.Although with just 80 columns, it's unlikely to matter. (Note it may be more common to loop
set
over a large number of rows than a large number of columns.) However, loopedset
doesn't solve the problem of the repeated reference to thedt
symbol name that you mentioned in the question :Agreed. So the best I can do is revert to your looping of
:=
but useget
instead.However, I fear that using
get
inj
carry an overhead. Benchmarking made in #1380. Also, perhaps it is confusing to useget()
on the RHS but not on the LHS. To address that we could sugar the LHS and allowget()
as well, #1381 :Also, maybe
value
ofset
could be run within scope ofDT
, #1382.