如何在R中优化sapply以计算数据帧上的运行总计

时间:2015-03-18 17:03:01

标签: r performance dataframe sapply processing-efficiency

我在R中编写了一个函数来计算按月数计算的累计总数,但随着数据集变大,我方法的执行时间呈指数级增长。我是新手R程序员,你能帮助我提高效率吗? 函数和我调用函数的方式:

accumulate <- function(recordnum,df){
    sumthese <- (df$subject == df$subject[recordnum]) &
        (df$month <= df$month[recordnum])
    sum(df$measurement[sumthese])
}
set.seed(42)
datalength = 10
df <- data.frame(measurement = runif(1:datalength),
                 subject=rep(c("dog","cat"),each =datalength/2),
                 month=rep(seq(datalength/2,1,by=-1)))
system.time(df$cumulative <- sapply(1:datalength,accumulate,df))

输入数据框:

> df
   measurement subject month
1    0.4577418     dog     5
2    0.7191123     dog     4
3    0.9346722     dog     3
4    0.2554288     dog     2
5    0.4622928     dog     1
6    0.9400145     cat     5
7    0.9782264     cat     4
8    0.1174874     cat     3
9    0.4749971     cat     2
10   0.5603327     cat     1

输出数据帧:

> df
   measurement subject month cumulative
1    0.9148060     dog     5  3.6102141
2    0.9370754     dog     4  2.6954081
3    0.2861395     dog     3  1.7583327
4    0.8304476     dog     2  1.4721931
5    0.6417455     dog     1  0.6417455
6    0.5190959     cat     5  2.7524079
7    0.7365883     cat     4  2.2333120
8    0.1346666     cat     3  1.4967237
9    0.6569923     cat     2  1.3620571
10   0.7050648     cat     1  0.7050648

请注意,累积列显示所有测量值的累计,包括当前月份。该函数不需要对数据帧进行排序。当数据长度等于100时,经过的时间为0.3。 1000是0.58。 10,000 = 27.72。我需要这个以运行200K +记录 谢谢!

4 个答案:

答案 0 :(得分:5)

dplyr会让这很容易

library(dplyr)
df %>%
    group_by(subject) %>%
    arrange(month) %>%
    mutate(cumulative = cumsum(measurement))

Source: local data frame [10 x 4]
Groups: subject

   measurement subject month cumulative
1    0.7050648     cat     1  0.7050648
2    0.6569923     cat     2  1.3620571
3    0.1346666     cat     3  1.4967237
4    0.7365883     cat     4  2.2333120
5    0.5190959     cat     5  2.7524079
6    0.6417455     dog     1  0.6417455
7    0.8304476     dog     2  1.4721931
8    0.2861395     dog     3  1.7583327
9    0.9370754     dog     4  2.6954081
10   0.9148060     dog     5  3.6102141

虽然如果您正在寻找绝对表现,您可能想要使用data.table

library(data.table)
setDT(df)[order(month), cumulative := cumsum(measurement), by=subject]    

#     measurement subject month cumulative
#  1:   0.7050648     cat     1  0.7050648
#  2:   0.6569923     cat     2  1.3620571
#  3:   0.1346666     cat     3  1.4967237
#  4:   0.7365883     cat     4  2.2333120
#  5:   0.5190959     cat     5  2.7524079
#  6:   0.6417455     dog     1  0.6417455
#  7:   0.8304476     dog     2  1.4721931
#  8:   0.2861395     dog     3  1.7583327
#  9:   0.9370754     dog     4  2.6954081
# 10:   0.9148060     dog     5  3.6102141

答案 1 :(得分:3)

这是非破坏性的,即原始df未被修改。没有使用包裹。保留df行的原始顺序;但是,如果这不重要,则可以省略最后一行的[order(o), ]

o <- order(df$subject, df$month)
transform(df[o, ], cumulative = ave(measurement, subject, FUN = cumsum))[order(o), ]

,并提供:

   measurement subject month cumulative
1   0.37955924     dog     5  2.2580530
2   0.43577158     dog     4  1.8784938
3   0.03743103     dog     3  1.4427222
4   0.97353991     dog     2  1.4052912
5   0.43175125     dog     1  0.4317512
6   0.95757660     cat     5  4.0751151
7   0.88775491     cat     4  3.1175385
8   0.63997877     cat     3  2.2297836
9   0.97096661     cat     2  1.5898048
10  0.61883821     cat     1  0.6188382

答案 2 :(得分:1)

为什么不使用内置R函数bycumsum而不是使用自定义函数?

df <- df[order(df$subject,df$month),]
df <- cbind(df,
            cumulative=do.call(what=c,
                               args=by(data=df$measurement,
                               INDICES=df$subject,
                               FUN=cumsum)))
print(df)

   measurement subject month cumulative
10   0.7050648     cat     1  0.7050648
9    0.6569923     cat     2  1.3620571
8    0.1346666     cat     3  1.4967237
7    0.7365883     cat     4  2.2333120
6    0.5190959     cat     5  2.7524079
5    0.6417455     dog     1  0.6417455
4    0.8304476     dog     2  1.4721931
3    0.2861395     dog     3  1.7583327
2    0.9370754     dog     4  2.6954081
1    0.9148060     dog     5  3.6102141

cumsum创建累积总和,by允许您进行分组处理(返回列表 - 替代为aggreagate,为您提供数据框)。只要数据正确排序,就可以为您提供正确的数据。

答案 3 :(得分:1)

此函数采用测量和月份的向量,计算出按月对数据进行排序的方法,然后计算按月排序的累积量的累计和,返回原始顺序(使用(x[o])[order(o)] == x)的事实)

FUN <- function(measure, month) {
    o <- order(month)
    cumsum(measure[o])[order(o)]
}

因此,如果您要将测量值和月份分成基于主题的列表,则可以将每个元素从旧值映射到新值

Map(FUN, split(df$measurement, df$subject), split(df$month, df$subject))

如果暗示的“几何”是一致的,split()<-会进行簿记,以便将值列表分配到向量中的正确位置

df$cumulative <- NA_real_   # or add this column to df's construction
split(df$cumulative, df$subject) <- 
    Map(FUN, split(df$measurement, df$subject), split(df$month, df$subject))

我认为到目前为止,这是保留数据原始顺序的唯一解决方案(可能步骤可以添加到其他解决方案中......)

这似乎是线性扩展的,至少在行数变大时

f0 <- function(df) {
    split(df$cumulative, df$subject) <- 
        Map(FUN, split(df$measurement, df$subject), split(df$month, df$subject))
    df
}

df <- lapply(10^(3:6), function(datalength) {
    data.frame(measurement = runif(1:datalength),
               subject=rep(c("dog","cat"),each =datalength/2),
               month=rep(seq(datalength/2,1,by=-1)),
               cumulative=rep(NA_real_, datalength))
})

library(microbenchmark)

然后

> microbenchmark(f0(df[[1]]), f0(df[[2]]), f0(df[[3]]), f0(df[[4]]))
Unit: microseconds
        expr        min          lq        mean      median          uq
 f0(df[[1]])    503.076    523.5275    576.4077    574.7825    612.9585
 f0(df[[2]])   2701.103   2769.3830   2869.0045   2847.1190   2922.0120
 f0(df[[3]])  26673.878  27184.7980  27894.5087  27547.5595  28595.6775
 f0(df[[4]]) 283416.456 285104.5225 292142.5274 290043.3785 295415.6995
        max neval
    913.945   100
   3296.594   100
  35015.903   100
 342556.407   100