Edit 2019: This question was asked prior to changes in data.table
in November 2016, see the accepted answer below for both the current and previous methods.
I have a data.table
table with about 2.5 million rows. There are two columns. I want to remove any rows that are duplicated in both columns. Previously for a data.frame I would have done this:
df -> unique(df[,c('V1', 'V2')])
but this doesn't work with data.table. I have tried unique(df[,c(V1,V2), with=FALSE])
but it seems to still only operate on the key of the data.table and not the whole row.
Any suggestions?
Cheers,
Davy
Example
>dt
V1 V2
[1,] A B
[2,] A C
[3,] A D
[4,] A B
[5,] B A
[6,] C D
[7,] C D
[8,] E F
[9,] G G
[10,] A B
in the above data.table where V2
is the table key, only rows 4,7, and 10 would be removed.
> dput(dt)
structure(list(V1 = c("B", "A", "A", "A", "A", "A", "C", "C",
"E", "G"), V2 = c("A", "B", "B", "B", "C", "D", "D", "D", "F",
"G")), .Names = c("V1", "V2"), row.names = c(NA, -10L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x7fb4c4804578>, sorted = "V2")
Best Answer
For v1.9.8+ (released November 2016)
From
?unique.data.table
By default all columns are being used (which is consistent with?unique.data.frame
)Or using the
by
argument in order to get unique combinations of specific columns (like previously keys were used for)Prior v1.9.8
From
?unique.data.table
, it is clear that callingunique
on a data table only works on the key. This means you have to reset the key to all columns before callingunique
.Calling
unique
with one column as key: