我是data.table
的新手,似乎错过了一些明显的东西。我有一张桌子:
DT = data.table(A = c("x","y","y","z"), B = c("y","x","x","z"), value = 1:4)
setkey(DT, A, B)
现在我想查找A
或B
为"y"
的所有行(使用二进制搜索,我的实际表格更大,操作必须执行数百万次) 。我无法在一个陈述中弄清楚如何做到这一点,因为,
DT[.("y", "y"), nomatch=0]
仅向我提供(A & B) == "y"
行(但我希望(A | B) == "y"
)。我现在正在做的是:
uA <- unique(DT[, A])
rbind(DT[.(uA, "y"), nomatch=0], DT[.("y"), nomatch=0])
但我觉得必须有一种更直观的方式。
谢谢你的帮助!
n = 1e6
DT = data.table(A = sample(letters, n, replace = TRUE),
B = sample(letters, n, replace = TRUE), value = 1:n)
setkey(DT, A, B)
uA <- unique(DT[, A])
library(microbenchmark)
Union = function(){
mya = DT[A=="y", which=TRUE]
myb = DT[B=="y", which=TRUE]
DT[union(mya,myb)]
}
microbenchmark(
"reduce" = DT[DT[, Reduce('|', lapply(.SD, '==', 'y')), .SDcols = A:B]],
"rbind" = rbind(DT[.(uA, "y"), nomatch=0], DT[.("y"), nomatch=0]),
"union" = Union()
)
Unit: milliseconds
expr min lq mean median uq max neval
reduce 9.922728 10.116613 11.422823 10.226871 11.803204 25.453557 100
rbind 2.596139 2.734751 2.916620 2.850199 3.113995 3.453326 100
union 5.393815 5.725917 6.221544 5.906222 6.758622 14.019206 100
答案 0 :(得分:1)
我们可以将Reduce
与|
一起使用,以获得一个逻辑vector
,用于检查.SDcols
中提及的哪一列具有值&#39; y&# 39;并用它来对行进行子集化
DT[DT[, Reduce('|', lapply(.SD, '==', 'y')), .SDcols = A:B]]
set.seed(24)
DT = data.table(A = sample(letters, 1e7, replace = TRUE),
B = sample(letters, 1e7, replace = TRUE), value = 1:1e7)
DT1 <- copy(DT)
system.time({
setkey(DT, A, B)
uA <- unique(DT[, A])
rbind(DT[.(uA, "y"), nomatch=0], DT[.("y"), nomatch=0])
})
# user system elapsed
# 1.14 0.19 0.87
system.time({
DT1[DT1[, Reduce('|', lapply(.SD, '==', 'y')), .SDcols = A:B]]
})
# user system elapsed
# 0.17 0.02 0.19