假设我有以下两个data.tables
:
mult.year <- data.table(id=c(1,1,1,2,2,2,3,3,3),
time=rep(1:3, 3),
A=rnorm(9),
B=rnorm(9))
setkey(mult.year, id)
single <- data.table(id=c(1,2,3),
C.3=rnorm(3))
setkey(single, id)
我想加入这两个表,以便变量C.3
仅显示mult.year[time == 3]
我可以通过指定一个新列来执行此操作:
mult.year[time == 3, C := single[,C.3]]
但我失去了join
功能:它要求所有id
都在两个数据集中。有没有办法在保持连接功能的同时执行此操作?
使用上面的表格,我试图得到这个:
id time A B C.3
1: 1 1 -1.0460085 0.0896452 NA
2: 1 2 0.2054772 1.5631978 NA
3: 1 3 -1.7574449 0.5661457 0.6495645
4: 2 1 0.4171095 -0.2182779 NA
5: 2 2 -0.9238671 0.8263605 NA
6: 2 3 -0.5452715 -0.5842541 -1.5233764
7: 3 1 0.1793009 1.4399366 NA
8: 3 2 0.3438980 1.7419869 NA
9: 3 3 0.1067989 0.7630496 1.9658157
答案 0 :(得分:4)
如果您愿意在data.table的密钥中加入time
,则可以执行以下操作:
## Add time ...
setkeyv(mult.year, c("id", "time")) ## ... to mult.year's key
single <- data.table(id=c(1,2,3), time=3, C.3=rnorm(3)) ## ... and to indexing dt
## Which will set up a simple call to [.data.table
mult.year[single, C.3:=C.3]
mult.year
# id time A B C.3
# 1: 1 1 -0.6264538 -0.30538839 NA
# 2: 1 2 0.1836433 1.51178117 NA
# 3: 1 3 -0.8356286 0.38984324 0.61982575
# 4: 2 1 1.5952808 -0.62124058 NA
# 5: 2 2 0.3295078 -2.21469989 NA
# 6: 2 3 -0.8204684 1.12493092 -0.05612874
# 7: 3 1 0.4874291 -0.04493361 NA
# 8: 3 2 0.7383247 -0.01619026 NA
# 9: 3 3 0.5757814 0.94383621 -0.15579551
或者,要保持single
和当前密钥的完整,请使用上面mnel评论中建议的方法:
mult.year[single, C.3 := ifelse(time==3,C.3,NA)]
mult.year
# id time A B C.3
# 1: 1 1 -0.6264538 -0.30538839 NA
# 2: 1 2 0.1836433 1.51178117 NA
# 3: 1 3 -0.8356286 0.38984324 0.8212212
# 4: 2 1 1.5952808 -0.62124058 NA
# 5: 2 2 0.3295078 -2.21469989 NA
# 6: 2 3 -0.8204684 1.12493092 0.5939013
# 7: 3 1 0.4874291 -0.04493361 NA
# 8: 3 2 0.7383247 -0.01619026 NA
# 9: 3 3 0.5757814 0.94383621 0.9189774
答案 1 :(得分:2)
更有效的解决方案是考虑连接类型(sql样式,http://en.wikipedia.org/wiki/Join_(SQL)在这里阅读,因为它帮助我极大地利用data.table
功能;然后,看看data.table
常见问题解答(第2.16点)如何实现所需的连接。
然后,事实上,你想要的是data.table single
将按以下方式构建:
single <- data.table(id=c(1,2,3),time=3,C.3=rnorm(3))
正确?然后,你想要的是关于mult.year
的左连接,即:
mult.year<-single[mult.year]
给出你想要的东西。这种方法既清晰又有效。比较:
> system.time(mult.year[single, C.3:=C.3])
user system elapsed
0.02 0.00 0.01
而我的方法产生:
> system.time(mult.year<-single[mult.year])
user system elapsed
0 0 0
差异虽然是按列的顺序返回,但我相信如果我们考虑一个非常庞大的数据块,这与速度提升相比是一个小问题。希望有所帮助!
编辑:我忘了提到你需要正确设置密钥:
mult.year <- data.table(id=...,time=...,A=...,B=...,key=c("id","time"))
single <- data.table(id=...,time=3 ,C.3=...,key=c("id","time"))
最终结果中的输出是:
> print(mult.year)
id time C.3 A B
1: 1 1 NA 0.02556433 -0.4525380
2: 1 2 NA 0.37282039 -1.5151395
3: 1 3 0.1769263 -1.48347426 0.5536820
4: 2 1 NA 0.85327700 -0.4924897
5: 2 2 NA -1.10516056 0.8360339
6: 2 3 -0.3698935 1.45610643 -0.9189147
7: 3 1 NA -0.53218378 -0.6740748
8: 3 2 NA 0.34124242 -1.1458312
9: 3 3 -1.3997742 0.32009017 0.4333386