我是初学R用户。 我有两个巨大的数据帧,我想在hkdata.2添加名为Vaccine的新列,根据hkdata.2(hhID和成员)的2个参考列,数据来自另一个DF依从性,有人可以帮助我吗? / p>
hkdata.2
hhID member T0 delta X_hh X_fm ILI age
1 1 7 0 0 0 0 44
1 2 7 0 0 0 0 36
2 1 8 0 1 0 0 39
2 2 8 0 1 0 0 39
adherence
hhID member mask soap vaccine
1 0 1 0 1
1 1 1 1 1
1 2 0 0 1
2 0 1 0 0
2 1 0 0 0
2 2 1 0 1
所以最后我可以得到这样的东西。 在hkdata.2中有一个名为疫苗的额外列。
hkdata.2
hhID member T0 delta X_hh X_fm ILI age vaccine
1 1 7 0 0 0 0 44 1
1 2 7 0 0 0 0 36 1
2 1 8 0 1 0 0 39 0
2 2 8 0 1 0 0 39 1
答案 0 :(得分:4)
更新:v1.9.6
使用on=
语法。查看旧代码的历史记录。
require(data.table) # v1.9.6+
setDT(hkdata.2)[setDT(adherence), vaccine := i.vaccine, on=c("hhID", "member")]
# hhID member T0 delta X_hh X_fm ILI age vaccine
# 1: 1 1 7 0 0 0 0 44 1
# 2: 1 2 7 0 0 0 0 36 1
# 3: 2 1 8 0 1 0 0 39 0
# 4: 2 2 8 0 1 0 0 39 1
setDT
通过引用将data.frame转换为data.table 。
在on=
指定的列上执行联接。您需要的Note that this join is both a) fast *and* b) memory efficient. a) *fast* because they're binary search based joins, and no copy is being made here at all. The
疫苗column is directly added to your
hkdata.2 data.table. b) *memory efficient* because only the column
疫苗`用于连接,而不是其他列(对于非常大的数据集而言,这是非常好的)。
这是一个基准,假设每个hhID
内有100,000 member
和200 hhID
个:
require(data.table) # v1.9.6
require(dplyr) # 0.4.3.9000
set.seed(98192L)
N = 40e6 # 40 million rows
hkdata.2 = data.frame(hhID = rep(1:1e5, each=200),
member = 1:200,
T0 = sample(10),
delta = sample(0:1),
X_hh = sample(0:1),
X_fm = sample(0:1),
ILI = sample(0:1),
age = sample(30:100, N/2, TRUE))
# let's go with 100,000 hhIDs and 400 members here:
adherence = data.frame(hhID = rep(1:1e5, each=400),
member = 1:400,
mask = sample(0:1),
soap = sample(0:1),
vaccine = sample(0:1))
## dplyr timing
system.time(ans1 <- left_join(hkdata.2, select(adherence, -soap, -mask)))
# user system elapsed
# 16.977 2.163 19.605
## data.table timing
system.time(setDT(hkdata.2)[setDT(adherence), vaccine := i.vaccine, on=c("hhID", "member")])
# user system elapsed
# 1.186 0.233 1.427
dplyr
的峰值内存使用量为4.7GB,完成时间为19.6秒,而data.table
耗时1.4秒,峰值内存使用量为2.2GB。
总结:
data.table
速度提高约14倍,内存效率提高约2.1倍。
答案 1 :(得分:1)
library(dplyr)
left_join(hkdata.2, adherence)
# Joining by: c("hhID", "member")
# hhID member T0 delta X_hh X_fm ILI age mask soap vaccine
#1 1 1 7 0 0 0 0 44 1 1 1
#2 1 2 7 0 0 0 0 36 0 0 1
#3 2 1 8 0 1 0 0 39 0 0 0
#4 2 2 8 0 1 0 0 39 1 0 1
如果您不需要mask
,soap
left_join(hkdata.2, adherence) %>% select(-soap, -mask)
或者
left_join(hkdata.2, adherence[,c("hhID", "member", "vaccine")])
答案 2 :(得分:0)
您可以使用plyr
库来执行此操作。
library(plyr)
new_frame=join(hkdata.2,adherence,by=c('hhID','member'))