data.table:使用变量分组和使用`get`来指定要汇总的列很慢

时间:2018-06-03 12:38:29

标签: r data.table

我有一个data.table,其中包含太空中许多点的每日观察结果:

#    day point_id         x         y      var1      var2       var3
# 1:   1        1 0.8179541 0.0220291 0.0903821 0.7306495 0.52508116
# 2:   1        2 0.1798340 0.8267741 0.5634569 0.6738693 0.88823133
# 3:   1        3 0.4204264 0.7223463 0.4948849 0.6911563 0.27390131
# ...

我编写了一个便捷函数来按点ID分组并汇总列中的值。我正在使用get(col_name)来标识我想要汇总的列:

summarize <- function(dtable, col_name) {
  dtable[, .(
    x = mean(x),
    y = mean(y),
    min = min(get(col_name)),
    mean = mean(get(col_name)),
    max = max(get(col_name))
  ), by = point_id]
}

我的功能明显慢于直接指定列:

system.time(summarize(dtable, "var1"))
#  user  system elapsed
# 1.140   0.000   1.139

system.time(
  dtable[, .(
    x = mean(x),
    y = mean(y),
    min = min(var1),
    mean = mean(var1),
    max = max(var1)
  ), by = point_id]
)
#  user  system elapsed 
# 0.344   0.000   0.344 

为什么会这样,加速功能的最佳方法是什么?

我可以将表达式构造为字符串,替换为所需的列名,然后parseeval,但我想有更好的方法。

完整示例:

library(data.table)

# Simulate some data
points <- data.table(
  id = 1:50000,
  x = runif(50000),
  y = runif(50000)
)
dtable <- CJ(day = 1:100, point_id = points$id)[points, on = c(point_id = "id")]
dtable[, var1 := runif(1:nrow(dtable))]
dtable[, var2 := runif(1:nrow(dtable))]
dtable[, var3 := runif(1:nrow(dtable))]
setkey(dtable, day, point_id)

# This is fast
system.time(
  dtable[, .(
    x = mean(x),
    y = mean(y),
    min = min(var1),
    mean = mean(var1),
    max = max(var1)
  ), by = point_id]
)
#  user  system elapsed 
# 0.344   0.000   0.344 

# Why is this slower?
summarize <- function(dtable, col_name) {
  dtable[, .(
    x = mean(x),
    y = mean(y),
    min = min(get(col_name)),
    mean = mean(get(col_name)),
    max = max(get(col_name))
  ), by = point_id]
}
system.time(summarize(dtable, "var1"))
#  user  system elapsed
# 1.140   0.000   1.139

0 个答案:

没有答案