Data.table - 多个表上的左外连接

时间:2015-07-17 16:35:38

标签: r data.table

假设您有

之类的数据
fruits <- data.table(FruitID=c(1,2,3), Fruit=c("Apple", "Banana", "Strawberry"))
colors <- data.table(ColorID=c(1,2,3,4,5), FruitID=c(1,1,1,2,3), Color=c("Red","Yellow","Green","Yellow","Red"))
tastes <- data.table(TasteID=c(1,2,3), FruitID=c(1,1,3), Taste=c("Sweeet", "Sour", "Sweet"))

setkey(fruits, "FruitID")
setkey(colors, "ColorID")
setkey(tastes, "TasteID")

fruits
   FruitID      Fruit
1:       1      Apple
2:       2     Banana
3:       3 Strawberry

colors
   ColorID FruitID  Color
1:       1       1    Red
2:       2       1 Yellow
3:       3       1  Green
4:       4       2 Yellow
5:       5       3    Red

tastes
   TasteID FruitID  Taste
1:       1       1 Sweeet
2:       2       1   Sour
3:       3       3  Sweet

我通常需要对这样的数据执行左外连接。例如,&#34;给我所有的水果和颜色&#34;要求我写(也许还有更好的方法吗?)

setkey(colors, "FruitID")
result <- colors[fruits, allow.cartesian=TRUE]
setkey(colors, "ColorID")

这么简单而频繁的任务的三行代码似乎过多,所以我写了一个方法myLeftJoin

myLeftJoin <- function(tbl1, tbl2){
  # Performs a left join using the key in tbl1 (i.e. keeps all rows from tbl1 and only matching rows from tbl2)

  oldkey <- key(tbl2)
  setkeyv(tbl2, key(tbl1))
  result <- tbl2[tbl1, allow.cartesian=TRUE]
  setkeyv(tbl2, oldkey)
  return(result)
}

我可以使用

myLeftJoin(fruits, colors)
   ColorID FruitID  Color      Fruit
1:       1       1    Red      Apple
2:       2       1 Yellow      Apple
3:       3       1  Green      Apple
4:       4       2 Yellow     Banana
5:       5       3    Red Strawberry

如何扩展此方法以便我可以将任意数量的表传递给它并获得所有这些表的链式左外连接?像myLeftJoin(tbl1, ...)

这样的东西

例如,我希望myleftJoin(fruits, colors, tastes)的结果等同于

setkey(colors, "FruitID")
setkey(tastes, "FruitID")
result <- tastes[colors[fruits, allow.cartesian=TRUE], allow.cartesian=TRUE]
setkey(tastes, "TasteID")
setkey(colors, "ColorID")

result
   TasteID FruitID  Taste ColorID  Color      Fruit
1:       1       1 Sweeet       1    Red      Apple
2:       2       1   Sour       1    Red      Apple
3:       1       1 Sweeet       2 Yellow      Apple
4:       2       1   Sour       2 Yellow      Apple
5:       1       1 Sweeet       3  Green      Apple
6:       2       1   Sour       3  Green      Apple
7:      NA       2     NA       4 Yellow     Banana
8:       3       3  Sweet       5    Red Strawberry

也许有一个优雅的解决方案,使用我错过的data.table包中的方法?感谢

(编辑:修正了我的数据中的错误)

2 个答案:

答案 0 :(得分:9)

我刚刚在data.table, v1.9.5中提交了一项新功能,我们可以在不设置密钥的情况下加入该功能(即,直接指定要加入的列,而不必先使用require(data.table) # v1.9.5+ fruits[tastes, on="FruitID"][colors, on="FruitID"] # no setkey required # FruitID Fruit TasteID Taste ColorID Color # 1: 1 Apple 1 Sweeet 1 Red # 2: 1 Apple 2 Sour 1 Red # 3: 1 Apple 1 Sweeet 2 Yellow # 4: 1 Apple 2 Sour 2 Yellow # 5: 1 Apple 1 Sweeet 3 Green # 6: 1 Apple 2 Sour 3 Green # 7: 2 NA NA NA 4 Yellow # 8: 3 Strawberry 3 Sweet 5 Red ):

有了这个,这很简单:

grid:::absolute.units.unit.arithmetic(u)

答案 1 :(得分:6)

您可以同时使用基础R RewriteRule .* - [F] Reduce (来自left_join dplyr个对象的列表您正在使用常用列名加入表格,并且 愿意避免多次为data.table对象设置keys

data.table

另一种选择纯数据。表格为@Frank提到 (注意,这需要将所有library(data.table) # <= v1.9.4 library(dplyr) # left_join Reduce(function(...) left_join(...), list(fruits,colors,tastes)) # Source: local data table [8 x 6] # FruitID Fruit ColorID Color TasteID Taste #1 1 Apple 1 Red 1 Sweeet #2 1 Apple 1 Red 2 Sour #3 1 Apple 2 Yellow 1 Sweeet #4 1 Apple 2 Yellow 2 Sour #5 1 Apple 3 Green 1 Sweeet #6 1 Apple 3 Green 2 Sour #7 2 Banana 4 Yellow NA NA #8 3 Strawberry 5 Red 3 Sweet 个对象的密钥设置为fruitID

data.table