我有一个data.table,其中包含太空中许多点的每日观察结果:
# day point_id x y var1 var2 var3
# 1: 1 1 0.8179541 0.0220291 0.0903821 0.7306495 0.52508116
# 2: 1 2 0.1798340 0.8267741 0.5634569 0.6738693 0.88823133
# 3: 1 3 0.4204264 0.7223463 0.4948849 0.6911563 0.27390131
# ...
我编写了一个便捷函数来按点ID分组并汇总列中的值。我正在使用get(col_name)
来标识我想要汇总的列:
summarize <- function(dtable, col_name) {
dtable[, .(
x = mean(x),
y = mean(y),
min = min(get(col_name)),
mean = mean(get(col_name)),
max = max(get(col_name))
), by = point_id]
}
我的功能明显慢于直接指定列:
system.time(summarize(dtable, "var1"))
# user system elapsed
# 1.140 0.000 1.139
system.time(
dtable[, .(
x = mean(x),
y = mean(y),
min = min(var1),
mean = mean(var1),
max = max(var1)
), by = point_id]
)
# user system elapsed
# 0.344 0.000 0.344
为什么会这样,加速功能的最佳方法是什么?
我可以将表达式构造为字符串,替换为所需的列名,然后parse
和eval
,但我想有更好的方法。
完整示例:
library(data.table)
# Simulate some data
points <- data.table(
id = 1:50000,
x = runif(50000),
y = runif(50000)
)
dtable <- CJ(day = 1:100, point_id = points$id)[points, on = c(point_id = "id")]
dtable[, var1 := runif(1:nrow(dtable))]
dtable[, var2 := runif(1:nrow(dtable))]
dtable[, var3 := runif(1:nrow(dtable))]
setkey(dtable, day, point_id)
# This is fast
system.time(
dtable[, .(
x = mean(x),
y = mean(y),
min = min(var1),
mean = mean(var1),
max = max(var1)
), by = point_id]
)
# user system elapsed
# 0.344 0.000 0.344
# Why is this slower?
summarize <- function(dtable, col_name) {
dtable[, .(
x = mean(x),
y = mean(y),
min = min(get(col_name)),
mean = mean(get(col_name)),
max = max(get(col_name))
), by = point_id]
}
system.time(summarize(dtable, "var1"))
# user system elapsed
# 1.140 0.000 1.139