Question

我有两个数据框，如下所示：

d1 <- data.frame(h1 = c("foo","foo","bar","bar"), h2= c("a","b","c","d"), h3=c("x1","x2","x3","x4"))

打印哪些：

   h1 h2 h3
1 foo  a x1
2 foo  b x2
3 bar  c x3
4 bar  d x4

并且

 d2 <- data.frame(t1= c("a","b","c","d"), t2=c("x1","x2","x3","x4"),val=(rnorm(4)))

产生：

   t1 t2      val
1  a x1 -1.183606
2  b x2 -1.358457
3  c x3 -1.512671
4  d x4 -1.253105
# surely the val columns will differ since we use rnorm()

我想要做的是在d2中的d1和t1-t2列中基于h2-h3组合d1和d2，导致

foo  a x1 -1.183606
foo  b x2 -1.358457
bar  c x3 -1.512671
bar  d x4 -1.253105

这样做的方法是什么？

Answer 1

合并使用多个键，并且可以为每一侧使用不同的列名。对于by规范，x是第一个数据框，y是第二个数据框：

merge(d1, d2, by.x=c('h2', 'h3'), by.y=c('t1', 't2'))
##   h2 h3  h1         val
## 1  a x1 foo -0.04356036
## 2  b x2 foo  0.56975774
## 3  c x3 bar  0.03251157
## 4  d x4 bar -0.67823770

Answer 2

我认为这应该可以解决问题 - 您为每个数据框的列对创建一个密钥，然后在该密钥上合并：

d1$key = paste(d1$h2, d1$h3)
d2$key = paste(d2$t1, d2$t2)
merged = merge(d1, d2)

Answer 3

这是使用data tables的另一种方法。

联接在数据表方面非常有效。即使使用这些微小的数据集，数据表连接的速度也要快两倍，尽管您不会注意到它。对于较大的数据集，差异是巨大的。

# data frames with 200,000 rows, same structure as OP's example
df1 <- data.frame(h1=rep(c("foo","foo","bar","bar"),each=50000),
                  h2=rep(letters[1:20],1e4),
                  h3=rep(1:1e4,each=20))
df2 <- data.frame(t1=rep(letters[1:20],1e4),
                  t2=rep(1:1e4,each=20),
                  val=rnorm(2e5))
# time the merge (~8.4 sec)
system.time(df.result <-merge(df1, df2, by.x=c('h2', 'h3'), by.y=c('t1', 't2')))
#  user  system elapsed 
#  8.41    0.02    8.42 

# convert to data tables and set keys
library(data.table)
dt1 <- data.table(df1, key="h2,h3")
dt2 <- data.table(df2, key="t1,t2")
# time the join (~0.2 sec)
system.time(dt.result <- dt1[dt2])
#  user  system elapsed 
#  0.19    0.00    0.18

底线：数据表连接是＆gt;大型数据集的速度提高了40倍。

如何基于两列连接data.frames

3 个答案: