我在R中编写了一个函数来计算按月数计算的累计总数,但随着数据集变大,我方法的执行时间呈指数级增长。我是新手R程序员,你能帮助我提高效率吗? 函数和我调用函数的方式:
accumulate <- function(recordnum,df){
sumthese <- (df$subject == df$subject[recordnum]) &
(df$month <= df$month[recordnum])
sum(df$measurement[sumthese])
}
set.seed(42)
datalength = 10
df <- data.frame(measurement = runif(1:datalength),
subject=rep(c("dog","cat"),each =datalength/2),
month=rep(seq(datalength/2,1,by=-1)))
system.time(df$cumulative <- sapply(1:datalength,accumulate,df))
输入数据框:
> df
measurement subject month
1 0.4577418 dog 5
2 0.7191123 dog 4
3 0.9346722 dog 3
4 0.2554288 dog 2
5 0.4622928 dog 1
6 0.9400145 cat 5
7 0.9782264 cat 4
8 0.1174874 cat 3
9 0.4749971 cat 2
10 0.5603327 cat 1
输出数据帧:
> df
measurement subject month cumulative
1 0.9148060 dog 5 3.6102141
2 0.9370754 dog 4 2.6954081
3 0.2861395 dog 3 1.7583327
4 0.8304476 dog 2 1.4721931
5 0.6417455 dog 1 0.6417455
6 0.5190959 cat 5 2.7524079
7 0.7365883 cat 4 2.2333120
8 0.1346666 cat 3 1.4967237
9 0.6569923 cat 2 1.3620571
10 0.7050648 cat 1 0.7050648
请注意,累积列显示所有测量值的累计,包括当前月份。该函数不需要对数据帧进行排序。当数据长度等于100时,经过的时间为0.3。 1000是0.58。 10,000 = 27.72。我需要这个以运行200K +记录 谢谢!
答案 0 :(得分:5)
dplyr
会让这很容易
library(dplyr)
df %>%
group_by(subject) %>%
arrange(month) %>%
mutate(cumulative = cumsum(measurement))
Source: local data frame [10 x 4]
Groups: subject
measurement subject month cumulative
1 0.7050648 cat 1 0.7050648
2 0.6569923 cat 2 1.3620571
3 0.1346666 cat 3 1.4967237
4 0.7365883 cat 4 2.2333120
5 0.5190959 cat 5 2.7524079
6 0.6417455 dog 1 0.6417455
7 0.8304476 dog 2 1.4721931
8 0.2861395 dog 3 1.7583327
9 0.9370754 dog 4 2.6954081
10 0.9148060 dog 5 3.6102141
虽然如果您正在寻找绝对表现,您可能想要使用data.table
library(data.table)
setDT(df)[order(month), cumulative := cumsum(measurement), by=subject]
# measurement subject month cumulative
# 1: 0.7050648 cat 1 0.7050648
# 2: 0.6569923 cat 2 1.3620571
# 3: 0.1346666 cat 3 1.4967237
# 4: 0.7365883 cat 4 2.2333120
# 5: 0.5190959 cat 5 2.7524079
# 6: 0.6417455 dog 1 0.6417455
# 7: 0.8304476 dog 2 1.4721931
# 8: 0.2861395 dog 3 1.7583327
# 9: 0.9370754 dog 4 2.6954081
# 10: 0.9148060 dog 5 3.6102141
答案 1 :(得分:3)
这是非破坏性的,即原始df
未被修改。没有使用包裹。保留df
行的原始顺序;但是,如果这不重要,则可以省略最后一行的[order(o), ]
。
o <- order(df$subject, df$month)
transform(df[o, ], cumulative = ave(measurement, subject, FUN = cumsum))[order(o), ]
,并提供:
measurement subject month cumulative
1 0.37955924 dog 5 2.2580530
2 0.43577158 dog 4 1.8784938
3 0.03743103 dog 3 1.4427222
4 0.97353991 dog 2 1.4052912
5 0.43175125 dog 1 0.4317512
6 0.95757660 cat 5 4.0751151
7 0.88775491 cat 4 3.1175385
8 0.63997877 cat 3 2.2297836
9 0.97096661 cat 2 1.5898048
10 0.61883821 cat 1 0.6188382
答案 2 :(得分:1)
为什么不使用内置R函数by
和cumsum
而不是使用自定义函数?
df <- df[order(df$subject,df$month),]
df <- cbind(df,
cumulative=do.call(what=c,
args=by(data=df$measurement,
INDICES=df$subject,
FUN=cumsum)))
print(df)
measurement subject month cumulative
10 0.7050648 cat 1 0.7050648
9 0.6569923 cat 2 1.3620571
8 0.1346666 cat 3 1.4967237
7 0.7365883 cat 4 2.2333120
6 0.5190959 cat 5 2.7524079
5 0.6417455 dog 1 0.6417455
4 0.8304476 dog 2 1.4721931
3 0.2861395 dog 3 1.7583327
2 0.9370754 dog 4 2.6954081
1 0.9148060 dog 5 3.6102141
cumsum
创建累积总和,by
允许您进行分组处理(返回列表 - 替代为aggreagate
,为您提供数据框)。只要数据正确排序,就可以为您提供正确的数据。
答案 3 :(得分:1)
此函数采用测量和月份的向量,计算出按月对数据进行排序的方法,然后计算按月排序的累积量的累计和,返回原始顺序(使用(x[o])[order(o)] == x)
的事实)
FUN <- function(measure, month) {
o <- order(month)
cumsum(measure[o])[order(o)]
}
因此,如果您要将测量值和月份分成基于主题的列表,则可以将每个元素从旧值映射到新值
Map(FUN, split(df$measurement, df$subject), split(df$month, df$subject))
如果暗示的“几何”是一致的,split()<-
会进行簿记,以便将值列表分配到向量中的正确位置
df$cumulative <- NA_real_ # or add this column to df's construction
split(df$cumulative, df$subject) <-
Map(FUN, split(df$measurement, df$subject), split(df$month, df$subject))
我认为到目前为止,这是保留数据原始顺序的唯一解决方案(可能步骤可以添加到其他解决方案中......)
这似乎是线性扩展的,至少在行数变大时
f0 <- function(df) {
split(df$cumulative, df$subject) <-
Map(FUN, split(df$measurement, df$subject), split(df$month, df$subject))
df
}
df <- lapply(10^(3:6), function(datalength) {
data.frame(measurement = runif(1:datalength),
subject=rep(c("dog","cat"),each =datalength/2),
month=rep(seq(datalength/2,1,by=-1)),
cumulative=rep(NA_real_, datalength))
})
library(microbenchmark)
然后
> microbenchmark(f0(df[[1]]), f0(df[[2]]), f0(df[[3]]), f0(df[[4]]))
Unit: microseconds
expr min lq mean median uq
f0(df[[1]]) 503.076 523.5275 576.4077 574.7825 612.9585
f0(df[[2]]) 2701.103 2769.3830 2869.0045 2847.1190 2922.0120
f0(df[[3]]) 26673.878 27184.7980 27894.5087 27547.5595 28595.6775
f0(df[[4]]) 283416.456 285104.5225 292142.5274 290043.3785 295415.6995
max neval
913.945 100
3296.594 100
35015.903 100
342556.407 100