如何按组计算先前数据的累积数据差异?

时间:2017-08-06 10:57:10

标签: r for-loop plyr sapply cumulative-sum

简化的原始数据如下

Data    group
2016/1/10   1
2016/2/4    1
2016/3/25   1
2016/4/13   1
2016/5/5    1
2016/7/1    2
2016/8/1    2
2016/10/1   2
2016/12/1   2
2016/12/31  2

我希望得到的最终数据是:

Data    group   cum_diff_preceding
2016/1/10   1   0
2016/2/4    1   25
2016/3/25   1   125
2016/4/13   1   182
2016/5/5    1   270
2016/7/1    2   0
2016/8/1    2   31
2016/10/1   2   153
2016/12/1   2   336
2016/12/31  2   380

计算方法如下:

for row 2016/1/10, cum_diff_preceding is 0
for row 2016/2/4, cum_diff_preceding is (2016/2/4-2016/1/10)
for row 2016/3/25, cum_diff_preceding is (2016/3/25-2016/1/10)+(2016/3/25-2016/2/4)
for row 2016/4/13, cum_diff_preceding is (2016/4/13-2016/1/10)+(2016/4/13- 2016/2/4)+(2016/4/13-2016/3/25)
for row 2016/5/5, cum_diff_preceding is (2016/5/5-2016/1/10)+(2016/5/5- 2016/2/4)+(2016/5/5-2016/3/25)+(2016/4/13-2016/4/13)
for row 2016/7/1, cum_diff_preceding is  0
for row 2016/8/1, cum_diff_preceding is (2016/8/1-2016/7/1)
for row 2016/10/1, cum_diff_preceding is (2016/10/1-2016/7/1)+(2016/10/1- 2016/8/1)
for row 2016/12/1, cum_diff_preceding is (2016/12/1-2016/7/1)+(2016/10/1- 2016/8/1)+(2016/10/1- 2016/10/1)
for row 2016/12/31, cum_diff_preceding is (2016/12/31-2016/7/1)+(2016/10/1- 2016/8/1)+(2016/10/1- 2016/10/1)+(2016/12/31- 2016/12/1)

我的主要代码如下

>as.Date(df$Data,"%Y-%m-%d")
>fun_forcast<-function(df){for(i in 2:nrow(df)){df$cum_diff_preceeding[i]<-sum(df$data[i]-df$data[1:(i-1)])}} 
>ddply(df,.(group),transform,cum_diff_preceding<-fun_forcast)

但它不起作用。

或当我将代码更改为

>fun_forcast<-function(df)(df$cum_diff_preceding<-sapply(1:NROW(df), >function(i) sum(df$data[i] - df$data[1:(i-1)])))
ddply(df,.(group),fun_forcast)

可行,但结果格式为

> ddply(df,.(group),fun_forcast)
  group V1 V2  V3  V4  V5
1     1  0 25 125 182 270
2     2  0 31 153 336 380

我不知道如何将结果带回原始data.frame中的cum_diff_preceding。

2 个答案:

答案 0 :(得分:1)

我们可以使用ave

中的base R执行此操作
df$Data <- as.Date(df$Data, "%Y/%m/%d")
fun_forcast <- function(v1) sapply(seq_along(v1), function(i) sum(v1[i] - v1[1:(i-1)]))
df$cum_diff_preceding <- with(df, ave(as.numeric(Data), group, FUN = fun_forcast))
df$cum_diff_preceding
#[1]   0  25 125 182 270   0  31 153 336 456

或使用dplyr

library(dplyr)
df %>%
    group_by(group) %>%
    mutate(cum_diff_preceding = fun_forcast(Data))
# A tibble: 10 x 3
# Groups:   group [2]
#         Data group cum_diff_preceding
#       <date> <int>              <dbl>
# 1 2016-01-10     1                  0
# 2 2016-02-04     1                 25
# 3 2016-03-25     1                125
# 4 2016-04-13     1                182
# 5 2016-05-05     1                270
# 6 2016-07-01     2                  0
# 7 2016-08-01     2                 31
# 8 2016-10-01     2                153
# 9 2016-12-01     2                336
#10 2016-12-31     2                456

答案 1 :(得分:1)

将日期转换为数字,并概括公式:

df %>%
  group_by(group) %>%
  mutate(numdata = as.numeric(Data),
         cum_diff_preceding  = (1:n())*numdata-cumsum(numdata)) %>%
  select(-numdata)

# A tibble: 10 x 3
# Groups:   group [2]
#          Data group cum_diff_preceding
#        <date> <int>              <dbl>
#  1 2016-01-10     1                  0
#  2 2016-02-04     1                 25
#  3 2016-03-25     1                125
#  4 2016-04-13     1                182
#  5 2016-05-05     1                270
#  6 2016-07-01     2                  0
#  7 2016-08-01     2                 31
#  8 2016-10-01     2                153
#  9 2016-12-01     2                336
# 10 2016-12-31     2                456