I have a large data.table, with many missing values scattered throughout its ~200k rows and 200 columns. I would like to re code those NA values to zeros as efficiently as possible.
I see two options:
1: Convert to a data.frame, and use something like this
2: Some kind of cool data.table sub setting command
I'll be happy with a fairly efficient solution of type 1. Converting to a data.frame and then back to a data.table won't take too long.
Best Answer
Here's a solution using data.table's
:=
operator, building on Andrie and Ramnath's answers.Note that f_dowle updated dt1 by reference. If a local copy is required then an explicit call to the
copy
function is needed to make a local copy of the whole dataset. data.table'ssetkey
,key<-
and:=
do not copy-on-write.Next, let's see where f_dowle is spending its time.
There, I would focus on
na.replace
andis.na
, where there are a few vector copies and vector scans. Those can fairly easily be eliminated by writing a small na.replace C function that updatesNA
by reference in the vector. That would at least halve the 20 seconds I think. Does such a function exist in any R package?The reason
f_andrie
fails may be because it copies the whole ofdt1
, or creates a logical matrix as big as the whole ofdt1
, a few times. The other 2 methods work on one column at a time (although I only briefly looked atNAToUnknown
).EDIT (more elegant solution as requested by Ramnath in comments) :
I wish I did it that way to start with!
EDIT2 (over 1 year later, now)
There is also
set()
. This can be faster if there are a lot of column being looped through, as it avoids the (small) overhead of calling[,:=,]
in a loop.set
is a loopable:=
. See?set
.