R根据其参考列将特定列从一个数据帧合并到另一个数据帧

时间:2014-07-28 19:46:04

标签: r merge dataframe

我是初学R用户。 我有两个巨大的数据帧,我想在hkdata.2添加名为Vaccine的新列,根据hkdata.2(hhID和成员)的2个参考列,数据来自另一个DF依从性,有人可以帮助我吗? / p>

hkdata.2
hhID    member  T0  delta   X_hh    X_fm    ILI age
1          1    7      0    0       0        0  44
1          2    7      0    0       0        0  36
2          1    8      0    1       0        0  39
2          2    8      0    1       0        0  39

adherence
hhID member mask soap vaccine
1      0      1    0    1   
1      1      1    1    1
1      2      0    0    1
2      0      1    0    0
2      1      0    0    0
2      2      1    0    1

所以最后我可以得到这样的东西。 在hkdata.2中有一个名为疫苗的额外列。

hkdata.2
hhID    member  T0  delta   X_hh    X_fm    ILI age vaccine
1          1    7      0    0       0        0  44    1
1          2    7      0    0       0        0  36    1
2          1    8      0    1       0        0  39    0
2          2    8      0    1       0        0  39    1

3 个答案:

答案 0 :(得分:4)

更新v1.9.6使用on=语法。查看旧代码的历史记录。

require(data.table) # v1.9.6+
setDT(hkdata.2)[setDT(adherence), vaccine := i.vaccine, on=c("hhID", "member")]
#    hhID member T0 delta X_hh X_fm ILI age vaccine
# 1:    1      1  7     0    0    0   0  44       1
# 2:    1      2  7     0    0    0   0  36       1
# 3:    2      1  8     0    1    0   0  39       0
# 4:    2      2  8     0    1    0   0  39       1
  1. setDT通过引用将data.frame转换为data.table

  2. on=指定的列上执行联接。您需要的Note that this join is both a) fast *and* b) memory efficient. a) *fast* because they're binary search based joins, and no copy is being made here at all. The疫苗column is directly added to your hkdata.2 data.table. b) *memory efficient* because only the column疫苗`用于连接,而不是其他列(对于非常大的数据集而言,这是非常好的)。


  3. 这是一个基准,假设每个hhID内有100,000 member和200 hhID个:

    require(data.table) # v1.9.6
    require(dplyr) # 0.4.3.9000
    
    set.seed(98192L)
    N = 40e6 # 40 million rows
    hkdata.2 = data.frame(hhID   = rep(1:1e5, each=200), 
                          member = 1:200,
                          T0     = sample(10),
                          delta  = sample(0:1), 
                          X_hh   = sample(0:1), 
                          X_fm   = sample(0:1), 
                          ILI    = sample(0:1),
                          age    = sample(30:100, N/2, TRUE))
    
    # let's go with 100,000 hhIDs and 400 members here:
    adherence = data.frame(hhID    = rep(1:1e5, each=400), 
                           member  = 1:400,
                           mask    = sample(0:1),
                           soap    = sample(0:1),
                           vaccine = sample(0:1))
    
    ## dplyr timing
    system.time(ans1 <- left_join(hkdata.2, select(adherence, -soap, -mask)))
    #   user  system elapsed
    # 16.977   2.163  19.605
    ## data.table timing
    system.time(setDT(hkdata.2)[setDT(adherence), vaccine := i.vaccine, on=c("hhID", "member")])
    #   user  system elapsed
    #  1.186   0.233   1.427
    

    dplyr的峰值内存使用量为4.7GB,完成时间为19.6秒,而data.table耗时1.4秒,峰值内存使用量为2.2GB。

      

    总结:data.table速度提高约14倍,内存效率提高约2.1倍。

答案 1 :(得分:1)

 library(dplyr)
 left_join(hkdata.2, adherence)
 #    Joining by: c("hhID", "member")
 #  hhID member T0 delta X_hh X_fm ILI age mask soap vaccine
 #1    1      1  7     0    0    0   0  44    1    1       1
 #2    1      2  7     0    0    0   0  36    0    0       1
 #3    2      1  8     0    1    0   0  39    0    0       0
 #4    2      2  8     0    1    0   0  39    1    0       1

如果您不需要masksoap

  left_join(hkdata.2, adherence) %>% select(-soap, -mask)

或者

  left_join(hkdata.2, adherence[,c("hhID", "member", "vaccine")])

答案 2 :(得分:0)

您可以使用plyr库来执行此操作。

library(plyr)
new_frame=join(hkdata.2,adherence,by=c('hhID','member'))