具有动态条件的条件累加总和

时间:2019-03-31 15:55:05

标签: r

下午好 我正在尝试用“扭曲”创建一个累积均值-我只想对当前日期之前的字段取平均值(可能有相同日期的字段)

我成功使用几个自定义创建的函数以“肮脏的方式”完成了此操作,但是它花费的时间太长,而且效率很低-我很确定有更好的方法。

我正在考虑以下方面的事情:

averages <- DB %>% group_by(field1,field2) %>% mutate(Avg=cummean(???*value1)))

我如何访问cummean函数的当前观测值

我走的路是为每个带有子集的子集创建一个逻辑向量

for (i in 1:length(datevector)-1)
    logicalvector[i] <- datevector[length(datevector)]>datevector[i]
  logicalvector[length(datevector)]=F

并在另一个函数中使用它来计算均值

一个简单的例子是:

df <- data.frame(id=1:5,Date=as.Date(c("2013-08-02","2013-08-02","2013-08-03","2013-08-03","2013-08-04")),Value=c(1,4,5,2,4))

id  Date    Value     accum mean
1  02/08/2013     1         0
2  02/08/2013     4         0
3  03/08/2013     5        2.5
4  03/08/2013     2        2.5
5  04/08/2013     4         3

Explanation:
there are no observation with a prior date for the first 2 observations so the mean is 0
the 3rd observation averages the 1st and 2nd, so does the 4th.
the 5th observation averages all

2 个答案:

答案 0 :(得分:2)

这可以实现为SQL中的复杂自连接。这会将每行平均Date小于Value的所有行连接到每一行。在平均值为Null的情况下,coalesce用于分配0。

library(sqldf)

sqldf("select a.*, coalesce(avg(b.Value), 0) as mean
  from df as a 
  left join df as b on b.Date < a.Date
  group by a.rowid")

给予:

  id       Date Value mean
1  1 2013-08-02     1  0.0
2  2 2013-08-02     4  0.0
3  3 2013-08-03     5  2.5
4  4 2013-08-03     2  2.5
5  5 2013-08-04     4  3.0

答案 1 :(得分:1)

使用data.tablelubridate,您可以选择以下选项:

library(data.table)
library(lubridate)
dt <- data.table(id=c(1:5))
dt$Date <- c("02/08/2013", "02/08/2013", "03/08/2013", "03/08/2013", "04/08/2013")
dt$Value <- c(1,4,5,2,4)
dt$Date <- dmy(dt$Date)

cummean <- function(d){
  if(nrow(dt[Date<d])>0)
    dt[Date<d, sum(Value)/.N]
  else 0
}

dt[, accuMean:=mapply(cummean,Date)]

#    id    Date    Value accuMean
#1:  1 2013-08-02     1      0.0
#2:  2 2013-08-02     4      0.0
#3:  3 2013-08-03     5      2.5
#4:  4 2013-08-03     2      2.5
#5:  5 2013-08-04     4      3.0

具有多个值时的解决方案:

library(data.table)
library(lubridate)
dt <- data.table(id=c(1:5))
dt$Date <- c("02/08/2013", "02/08/2013", "03/08/2013", "03/08/2013", "04/08/2013")
dt$Value_1 <- c(1,4,5,2,4)
dt$Value_2 <- c(3,2,0,1,2)
dt$Value_3 <- c(4,9,3,3,3)
dt$Date <- dmy(dt$Date)

cummean <- function(d,Value){
  if(nrow(dt[Date<d])>0)
    sum(dt[Date<d, Value, with=F])/dt[Date<d, .N]
  else 0
}

n <- 3
accuMean <- paste0("accuMean_", (1:n))
for(i in 1:n){
  print(i)
  dt[, (accuMean[i]):=mapply(cummean,Date,MoreArgs = list(paste0("Value_",i)))]
}

假设您有n个名为Value_i的值。在您的情况下为十,只需设置n = 10