两列上的data.table子集,其中任一键匹配

时间:2017-12-10 19:07:06

标签: r data.table

我是data.table的新手,似乎错过了一些明显的东西。我有一张桌子:

DT = data.table(A = c("x","y","y","z"), B = c("y","x","x","z"), value = 1:4)
setkey(DT, A, B)

现在我想查找AB"y"的所有行(使用二进制搜索,我的实际表格更大,操作必须执行数百万次) 。我无法在一个陈述中弄清楚如何做到这一点,因为,

DT[.("y", "y"), nomatch=0]

仅向我提供(A & B) == "y"行(但我希望(A | B) == "y")。我现在正在做的是:

uA <- unique(DT[, A])
rbind(DT[.(uA, "y"), nomatch=0], DT[.("y"), nomatch=0])

但我觉得必须有一种更直观的方式。

谢谢你的帮助!

基准

@Frank's comment

中加入改编自Binary search DT with key on two columns using alternative (OR) instead of a conjunction的代码
n = 1e6
DT = data.table(A = sample(letters, n, replace = TRUE), 
                B = sample(letters, n, replace = TRUE), value = 1:n)
setkey(DT, A, B)
uA <- unique(DT[, A])

library(microbenchmark)
Union = function(){
   mya = DT[A=="y", which=TRUE]
   myb = DT[B=="y", which=TRUE]
   DT[union(mya,myb)] 
} 
microbenchmark(
    "reduce" = DT[DT[, Reduce('|', lapply(.SD, '==', 'y')), .SDcols = A:B]],
    "rbind" = rbind(DT[.(uA, "y"), nomatch=0], DT[.("y"), nomatch=0]),
    "union" = Union()
)

Unit: milliseconds
   expr      min        lq      mean    median        uq       max neval
 reduce 9.922728 10.116613 11.422823 10.226871 11.803204 25.453557   100
  rbind 2.596139  2.734751  2.916620  2.850199  3.113995  3.453326   100
  union 5.393815  5.725917  6.221544  5.906222  6.758622 14.019206   100

1 个答案:

答案 0 :(得分:1)

我们可以将Reduce|一起使用,以获得一个逻辑vector,用于检查.SDcols中提及的哪一列具有值&#39; y&# 39;并用它来对行进行子集化

DT[DT[, Reduce('|', lapply(.SD, '==', 'y')), .SDcols = A:B]]

基准

set.seed(24)
DT = data.table(A = sample(letters, 1e7, replace = TRUE), 
                B = sample(letters, 1e7, replace = TRUE), value = 1:1e7)

DT1 <- copy(DT)
system.time({
      setkey(DT, A, B)
    uA <- unique(DT[, A])
    rbind(DT[.(uA, "y"), nomatch=0], DT[.("y"), nomatch=0])
     })
# user  system elapsed 
#  1.14    0.19    0.87 

system.time({
   DT1[DT1[, Reduce('|', lapply(.SD, '==', 'y')), .SDcols = A:B]]
   })
#  user  system elapsed 
#  0.17    0.02    0.19