我使用的模拟数据集包含多个组(+ 2mil),我想计算每个组的观察总数和高于阈值的观察数(此处为2)。
当我创建一个标志变量时似乎要快得多,特别是dplyr
和data.table
更快一点。
为什么会这样?在每种情况下它如何在后台工作?
查看下面的示例。
模拟数据集
# create an example dataset
set.seed(318)
N = 3000000 # number of rows
dt = data.frame(id = sample(1:5000000, N, replace = T),
value = runif(N, 0, 10))
使用dplyr
library(dplyr)
# calculate summary variables for each group
t = proc.time()
dt2 = dt %>% group_by(id) %>% summarise(N = n(),
N2 = sum(value > 2))
proc.time() - t
# user system elapsed
# 51.70 0.06 52.11
# calculate summary variables for each group after creating a flag variable
t = proc.time()
dt2 = dt %>% mutate(flag = ifelse(value > 2, 1, 0)) %>%
group_by(id) %>% summarise(N = n(),
N2 = sum(flag))
proc.time() - t
# user system elapsed
# 3.40 0.16 3.55
使用data.table
library(data.table)
# set as data table
dt2 = setDT(dt, key = "id")
# calculate summary variables for each group
t = proc.time()
dt3 = dt2[, .(N = .N,
N2 = sum(value > 2)), by = id]
proc.time() - t
# user system elapsed
# 1.93 0.00 1.94
# calculate summary variables for each group after creating a flag variable
t = proc.time()
dt3 = dt2[, flag := ifelse(value > 2, 1, 0)][, .(N = .N,
N2 = sum(flag)), by = id]
proc.time() - t
# user system elapsed
# 0.33 0.04 0.39
答案 0 :(得分:1)
dplyr的问题在于sum函数与表达式和大量ID /组一起使用。从Arun在评论中说的内容,我想data.table的问题很相似。
考虑下面的代码:我把它减少到说明问题所需的最低限度。在对表达式求和时,dplyr很慢,即使表达式只涉及标识函数,因此性能问题与大于比较运算符无关。相反,当对矢量求和时,dplyr很快。通过将ID /组的数量从一百万减少到十个,可以获得更大的性能提升。
原因是hybrid evaluation,即C ++中的评估,只有当sum与向量一起使用时才有效。使用表达式作为参数,评估在R中完成,这增加了每个组的开销。详细信息位于链接的插图中。从代码的配置文件来看,开销似乎主要来自tryCatch错误处理函数。
##########################
### many different IDs ###
##########################
df <- data.frame(id = 1:1e6, value = runif(1e6))
# sum with expression as argument
system.time(df %>% group_by(id) %>% summarise(sum(identity(value))))
# user system elapsed
# 80.492 0.368 83.251
# sum with vector as argument
system.time(df %>% group_by(id) %>% summarise(sum(value)))
# user system elapsed
# 1.264 0.004 1.279
#########################
### few different IDs ###
#########################
df$id <- rep(1:10, each = 1e5)
# sum with expression as argument
system.time(df %>% group_by(id) %>% summarise(sum(identity(value))))
# user system elapsed
# 0.088 0.000 0.093
# sum with vector as argument
system.time(df %>% group_by(id) %>% summarise(sum(value)))
# user system elapsed
# 0.072 0.004 0.077
#################
### profiling ###
#################
df <- data.frame(id = 1:1e6, value = runif(1e6))
profvis::profvis({ df %>% group_by(id) %>% summarise(sum(identity(value))) })
代码档案: