我正在尝试在R中使用data.table
来实现大于和小于这样的有效子集:
library(data.table)
x = runif(10000, min = 1, max = 2)
rowname = seq(10000)
min.x = x - 0.0001
max.x = x + 0.0001
table = data.table(rowname, min.x, max.x)
system.time(x.candidates <- lapply(x, function(x) {table[x > min.x & x < max.x, rowname]}))
# -> user system elapsed
# 4.87 0.00 4.90
table2 = data.table(rowname, min.x, max.x)
setindex(table2, min.x)
setindex(table2, max.x)
system.time(x.candidates2 <- lapply(x, function(x) {table2[x > min.x & x < max.x, rowname]}))
# -> user system elapsed
# 4.90 0.00 4.92
table3 = data.frame(rowname, min.x, max.x)
system.time(x.candidates3 <- lapply(x, function(x) {table3[x > table3$min.x & x < table3$max.x, "rowname"]}))
# -> user system elapsed
# 1.77 0.00 1.78
但是,当设置索引和data.frame更快时,我看不到加速。通常甚至有可能在data.table或R中更高效地编写此代码?
最佳解决方案
@eddi指出,这是使用.EACHI的正确方法:
table4 = data.table(rowname, min.x, max.x)
system.time(x.candidates4 <- table4[data.table(x), on = .(min.x < x, max.x > x), list(rowname = list(rowname)), by = .EACHI])
# user system elapsed
# 0.02 0.00 0.01
答案 0 :(得分:3)
您做错了。 [.data.table
所执行的循环调用lapply
会很慢,因为该函数有很多开销,而对于您执行的微小操作而言,开销不值得。正确的方法是进行非等额联接:
table[data.table(x), on = .(min.x < x, max.x > x), rowname, by = .EACHI]
# min.x max.x rowname
# 1: 1.084668 1.084668 1
# 2: 1.293461 1.293461 7734
# 3: 1.293461 1.293461 739
# 4: 1.293461 1.293461 2
# 5: 1.293461 1.293461 3757
# ---
#30216: 1.324366 1.324366 9999
#30217: 1.324366 1.324366 9635
#30218: 1.869469 1.869469 8740
#30219: 1.869469 1.869469 3302
#30220: 1.869469 1.869469 10000
以上是瞬时的。当前的列命名有点不幸(有一个可解决此问题的FR)-想象一下前两列被命名为x
会增加清晰度。
答案 1 :(得分:1)
或者,您也可以进行非装备 self 加入:
library(data.table)
n <- 10
set.seed(777)
x <- runif(n, min = 1, max = 2)
rowname <- seq(n)
min.x <- x - 0.0001
max.x <- x + 0.0001
DT <- data.table(rowname, x, min.x, max.x)
DT[DT, on = .(min.x < x, max.x > x), .(i.rowname, x.rowname), by = .EACHI][]
min.x max.x i.rowname x.rowname 1: 1.687857 1.687857 1 1 2: 1.492193 1.492193 2 2 3: 1.345116 1.345116 3 7 4: 1.345116 1.345116 3 3 5: 1.995050 1.995050 4 4 6: 1.695267 1.695267 5 5 7: 1.010700 1.010700 6 6 8: 1.345016 1.345016 7 7 9: 1.345016 1.345016 7 3 10: 1.172049 1.172049 8 8 11: 1.949361 1.949361 9 9 12: 1.249193 1.249193 10 10
但是,eddi's answer稍微快一点。
library(bench)
bm <- press(
n = 10^c(2:4),
{
set.seed(123)
x <- runif(n, min = 1, max = 2)
rowname <- seq(n)
min.x <- x - 0.0001
max.x <- x + 0.0001
table <- data.table(rowname, min.x, max.x)
table2 <- data.table(rowname, min.x, max.x)
setindex(table2, min.x)
setindex(table2, max.x)
table3 <- data.frame(rowname, min.x, max.x)
DT <- data.table(rowname, x, min.x, max.x)
mark(
Benni1 = lapply(x, function(x) {table[x > min.x & x < max.x, rowname]}),
Benni2 = lapply(x, function(x) {table2[x > min.x & x < max.x, rowname]}),
Benni3 = lapply(x, function(x) {table3[x > table3$min.x & x < table3$max.x, "rowname"]}),
Eddi = table[data.table(x), on = .(min.x < x, max.x > x), rowname, by = .EACHI],
Uwe = DT[DT, on = .(min.x < x, max.x > x), .(i.rowname, x.rowname), by = .EACHI],
check = FALSE
)
}
)
library(ggplot2)
autoplot(bm)
请注意对数时间刻度。
内存消耗也有巨大差异:
setDT(bm)[n == max(n), 1:11]
expression n min mean median max itr/sec mem_alloc n_gc n_itr total_time 1: Benni1 10000 6.78s 6.78s 6.78s 6.78s 0.1475977 1.5GB 48 1 6.78s 2: Benni2 10000 6.8s 6.8s 6.8s 6.8s 0.1470747 1.5GB 48 1 6.8s 3: Benni3 10000 1.8s 1.8s 1.8s 1.8s 0.5563497 1.49GB 32 1 1.8s 4: Eddi 10000 12.91ms 13.94ms 13.71ms 17.96ms 71.7622591 1.01MB 1 36 501.66ms 5: Uwe 10000 17.21ms 18.42ms 18.11ms 22.54ms 54.2845397 1.04MB 1 28 515.8ms
Benni's approaches分配的内存是Eddi或我的内存的1500倍。