R – Joining data.table with by argument

data.tabler

I have two data.table dx and dy

dx <- data.table(a = c(1,1,1,1,2,2), b = 3:8)
dy <- data.table(a = c(1,1,2), c = 7:9)

I want to join dy to each row of dx, and below is the desired output

data.table(plyr::ddply(dx, c("a", "b"), function(d) merge(d, dy, by = "a")))
    a b c
 1: 1 3 7
 2: 1 3 8
 3: 1 4 7
 4: 1 4 8
 5: 1 5 7
 6: 1 5 8
 7: 1 6 7
 8: 1 6 8
 9: 2 7 9
10: 2 8 9

However, I failed to make the output only using operation inside [] of data.table or merge? I have tired

merge(dx, dy, by = "a", all = TRUE)

Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
Join results in 10 rows; more than 9 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.

dy[dx,on="a"]

Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
Join results in 10 rows; more than 9 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.

dx[, merge(dy, by = "a"), by = c("a", "b")]

Error in is.data.table(y) : argument "y" is missing, with no default

dx[, merge(.SD, dy, by = "a"), by = c("a", "b")]

Error in merge.data.table(.SD, dy, by = "a") :
Elements listed in by must be valid column names in x and y

How can I do it actually?

Thanks!

Best Answer

Because of multiple 1s, the error is coming. In default merge, i.e the implementation for data.frames, allow.cartesian=TRUE by default. But in the data.table implemetation, this is not the case. Hence, if you run the below code it will give you merge output.

merge(dx, dy, by = "a", all = TRUE, allow.cartesian=TRUE)

The above code will give you the output that you want.

Related Topic