我有以下数据框:
id<-c(1,1,1,1,1,3,3,3,3)
spent<-c(10,20,30,40,50,60,70,80,90)
date<-c("11-11-07","11-11-07","23-11-07","12-12-08","17-12-08","11-11-07","23-11-07","23- 11-07","16-01-08")
df<-data.frame(id,date,spent)
df$date2<-as.Date(as.character(df$date), format = "%d-%m-%y")
id date spent date2
1 1 11-11-07 10 2007-11-11
2 1 11-11-07 20 2007-11-11
3 1 23-11-07 30 2007-11-23
4 1 12-12-08 40 2008-12-12
5 1 17-12-08 50 2008-12-17
6 3 11-11-07 60 2007-11-11
7 3 23-11-07 70 2007-11-23
8 3 23-11-07 80 2007-11-23
9 3 16-01-08 90 2008-01-16
我需要每天按spent
计算总和id
,并将其包含在框架工作中,如下所示:
id date spent date2 sum.spent
1 1 11-11-07 10 2007-11-11 10
2 1 11-11-07 20 2007-11-11 30
3 1 23-11-07 30 2007-11-23 30
4 1 12-12-08 40 2008-12-12 40
5 1 17-12-08 50 2008-12-17 50
6 3 11-11-07 60 2007-11-11 60
7 3 23-11-07 70 2007-11-23 70
8 3 23-11-07 80 2007-11-23 150
9 3 16-01-08 90 2008-01-16 90
以下脚本运行良好(除了第一行并不重要):
df$spent2<-NA
for (a in 2:9)
if (df[a,1]==df[a-1,1]&& df[a,4]==df[a-1,4])
(df[a,5]=df[a,3]+df[a-1,3])else(df[a,5]=df[a,3])
但是,由于我的实际数据集中的行数约为150万,因此上述脚本大约需要5天才能执行。我想知道你是否可以建议一种更有效的方法来编写这个代码并实现相同的目标。
答案 0 :(得分:6)
data.table
非常快,特别是对于如此大的数据集。对于1.5百万的记录,这应该非常快。
library(data.table)
df <- data.table(df)
df <- df[, sum.spent:=cumsum(spent), by = list(id, date2)]
答案 1 :(得分:3)
以下是基础R解决方案:
df$sum.spent <- ave(df$spent,df$id,df$date2,FUN=cumsum)
我得到的结果与你预期的答案不同,但我认为这是正确的吗?