如何处理不同长度和变量排序的连接键?

时间:2016-12-20 02:37:59

标签: r data.table

考虑两个关键列数不同的数据表:

library(data.table)
tmp_dt <- data.table(group1 = letters[1:5], group2 = c(1, 1, 2, 2, 2), a = rnorm(5), key = c("group1", "group2"))
tmp_dt2 <- data.table(group2 = c(1, 2, 3), color = c("r", "g", "b"), key = "group2")

我希望tmp_dt加入tmp_dt2group2,但以下内容失败:

tmp_dt[tmp_dt2]

> tmp_dt[tmp_dt2]
Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
  x.'group1' is a character column being joined to i.'group2' which is type 'double'. Character columns must join to factor or character columns.

这很有意义,因为它尝试在第一个键变量上连接数据表。如何修复它以使行为与dplyr::inner_join相同,而不会因重置tmp_dt上的密钥而产生两倍的费用?

> inner_join(tmp_dt, tmp_dt2, by = "group2")
  group1 group2          a color
1      a      1  0.2501413     r
2      b      1  0.6182433     r
3      c      2 -0.1726235     g
4      d      2 -2.2239003     g
5      e      2 -1.2636144     g

2 个答案:

答案 0 :(得分:1)

使用lapply

tmp_dt[,color:=unlist(lapply(.BY, function(x) tmp_dt2[group2==x, color])), by=group2]

正如弗兰克在评论中指出的那样,使用on

tmp_dt[tmp_dt2, on="group2"]

tmp_dt2[tmp_dt, on="group2"]

使用on的速度大约是使用lapply的{​​{1}}的两倍。虽然第一个示例返回第.BY

答案 1 :(得分:0)

您应该使用此代码

tmp_dt2[tmp_dt, on = 'group2']