Data.table:使用单个时间点表连接“长格式”多时间点表

时间:2014-03-10 03:03:31

标签: r data.table

假设我有以下两个data.tables

mult.year <- data.table(id=c(1,1,1,2,2,2,3,3,3), 
                        time=rep(1:3, 3),
                        A=rnorm(9),
                        B=rnorm(9))
setkey(mult.year, id)
single <- data.table(id=c(1,2,3), 
                     C.3=rnorm(3))
setkey(single, id)

我想加入这两个表,以便变量C.3仅显示mult.year[time == 3]

我可以通过指定一个新列来执行此操作:

mult.year[time == 3, C := single[,C.3]]

但我失去了join功能:它要求所有id都在两个数据集中。有没有办法在保持连接功能的同时执行此操作?

使用上面的表格,我试图得到这个:

   id time          A          B        C.3
1:  1    1 -1.0460085  0.0896452         NA
2:  1    2  0.2054772  1.5631978         NA
3:  1    3 -1.7574449  0.5661457  0.6495645
4:  2    1  0.4171095 -0.2182779         NA
5:  2    2 -0.9238671  0.8263605         NA
6:  2    3 -0.5452715 -0.5842541 -1.5233764
7:  3    1  0.1793009  1.4399366         NA
8:  3    2  0.3438980  1.7419869         NA
9:  3    3  0.1067989  0.7630496  1.9658157

2 个答案:

答案 0 :(得分:4)

如果您愿意在data.table的密钥中加入time,则可以执行以下操作:

## Add time ...
setkeyv(mult.year, c("id", "time"))                     ## ... to mult.year's key
single <- data.table(id=c(1,2,3), time=3, C.3=rnorm(3)) ## ... and to indexing dt

## Which will set up a simple call to [.data.table
mult.year[single, C.3:=C.3]
mult.year
#    id time          A           B         C.3
# 1:  1    1 -0.6264538 -0.30538839          NA
# 2:  1    2  0.1836433  1.51178117          NA
# 3:  1    3 -0.8356286  0.38984324  0.61982575
# 4:  2    1  1.5952808 -0.62124058          NA
# 5:  2    2  0.3295078 -2.21469989          NA
# 6:  2    3 -0.8204684  1.12493092 -0.05612874
# 7:  3    1  0.4874291 -0.04493361          NA
# 8:  3    2  0.7383247 -0.01619026          NA
# 9:  3    3  0.5757814  0.94383621 -0.15579551

或者,要保持single和当前密钥的完整,请使用上面mnel评论中建议的方法:

mult.year[single, C.3 := ifelse(time==3,C.3,NA)]
mult.year
#    id time          A           B       C.3
# 1:  1    1 -0.6264538 -0.30538839        NA
# 2:  1    2  0.1836433  1.51178117        NA
# 3:  1    3 -0.8356286  0.38984324 0.8212212
# 4:  2    1  1.5952808 -0.62124058        NA
# 5:  2    2  0.3295078 -2.21469989        NA
# 6:  2    3 -0.8204684  1.12493092 0.5939013
# 7:  3    1  0.4874291 -0.04493361        NA
# 8:  3    2  0.7383247 -0.01619026        NA
# 9:  3    3  0.5757814  0.94383621 0.9189774

答案 1 :(得分:2)

更有效的解决方案是考虑连接类型(sql样式,http://en.wikipedia.org/wiki/Join_(SQL)在这里阅读,因为它帮助我极大地利用data.table功能;然后,看看data.table常见问题解答(第2.16点)如何实现所需的连接。

然后,事实上,你想要的是data.table single将按以下方式构建:

single <- data.table(id=c(1,2,3),time=3,C.3=rnorm(3))

正确?然后,你想要的是关于mult.year的左连接,即:

mult.year<-single[mult.year]

给出你想要的东西。这种方法既清晰又有效。比较:

> system.time(mult.year[single, C.3:=C.3])
user  system elapsed 
0.02    0.00    0.01

而我的方法产生:

> system.time(mult.year<-single[mult.year])
user  system elapsed 
0       0       0 

差异虽然是按列的顺序返回,但我相信如果我们考虑一个非常庞大的数据块,这与速度提升相比是一个小问题。希望有所帮助!

编辑:我忘了提到你需要正确设置密钥:

mult.year <- data.table(id=...,time=...,A=...,B=...,key=c("id","time"))
single <- data.table(id=...,time=3 ,C.3=...,key=c("id","time"))

最终结果中的输出是:

> print(mult.year)
   id time        C.3           A          B
1:  1    1         NA  0.02556433 -0.4525380
2:  1    2         NA  0.37282039 -1.5151395
3:  1    3  0.1769263 -1.48347426  0.5536820
4:  2    1         NA  0.85327700 -0.4924897
5:  2    2         NA -1.10516056  0.8360339
6:  2    3 -0.3698935  1.45610643 -0.9189147
7:  3    1         NA -0.53218378 -0.6740748
8:  3    2         NA  0.34124242 -1.1458312
9:  3    3 -1.3997742  0.32009017  0.4333386