r-在特定时间范围内按id计算滚动总和

时间:2018-06-20 16:24:44

标签: r data.table zoo rollapply

我想按ID计算前一年窗口中当前行之前的行数。

这是我的数据:

df <- structure(list(id = c("1", "1", "1", "1", 
                                   "2", "2", "2", "2", "2", "2", "2", 
                                   "2", "2"), flag = c(1, 1, 0, 1, 0, 0, 1, 1, 
                                                                         1, 1, 1, 1, 1), date = structure(c(15425, 15456, 16613, 
                                                                                                                       16959, 15513, 15513, 15625, 15635, 15649, 15663, 15670, 16051, 
                                                                                                                       16052), class = "Date")), sorted = "id", class = c("data.table", 
                                                                                                                                                             "data.frame"), row.names = c(NA, -13L))




roll_sum <- c(0, 1, 0, 1, 0, 1, 2, 3, 4, 5, 6, 0, 1)
flag_sum <- c(0, 1, 0, 0, 0, 0, 0, 1, 2, 3, 4, 0, 1)

df_desired <- cbind(df, roll_sum) # roll_sum: number of rows excluding current row in 1 year time frame rolling
df_desired <- cbind(df_desired, flag_sum) # flag_sum: number of rows excluding current row in 1 year time frame rolling where flag was 1

数据:

id flag       date
 1:  1    1 2012-03-26
 2:  1    1 2012-04-26
 3:  1    0 2015-06-27
 4:  1    1 2016-06-07
 5:  2    0 2012-06-22
 6:  2    0 2012-06-22
 7:  2    1 2012-10-12
 8:  2    1 2012-10-22
 9:  2    1 2012-11-05
10:  2    1 2012-11-19
11:  2    1 2012-11-26
12:  2    1 2013-12-12
13:  2    1 2013-12-13

输出:

df_desired
    id flag       date roll_sum flag_sum
 1:  1    1 2012-03-26        0        0
 2:  1    1 2012-04-26        1        1
 3:  1    0 2015-06-27        0        0
 4:  1    1 2016-06-07        1        0
 5:  2    0 2012-06-22        0        0
 6:  2    0 2012-06-22        1        0
 7:  2    1 2012-10-12        2        0
 8:  2    1 2012-10-22        3        1
 9:  2    1 2012-11-05        4        2
10:  2    1 2012-11-19        5        3
11:  2    1 2012-11-26        6        4
12:  2    1 2013-12-12        0        0
13:  2    1 2013-12-13        1        1

我在Compute rolling sum by id variables, with missing timepoints中使用zoo尝试了G. Grothendieck给出的解决方案,但这给了我一个错误:

  

merge.zoo(z,g)中的错误:     系列不能与系列中的非唯一索引条目合并   另外:警告消息:   在动物园(计数,日期):

我使用make.index.uniquemake.time.unique使日期列变得唯一。

感谢您提供有关优化解决方案的帮助。谢谢。

1 个答案:

答案 0 :(得分:1)

不确定这对您数据的维度是否有帮助。

首先,创建运行索引以处理重复的日期和总和不得包含上次的重复日期,并且还必须在一年前创建日期(我认为365更好,但似乎OP希望366)。

然后,执行非等价自联接,同时确保未使用上次伪造日期且日期在一年之内。

VERSION ?= $(shell . $(HELPER); getVersion)
$(if $(VERSION),,$(error getVersion failed))

结果:

df[, c("rn", "oneYrAgo") := .(.I, date - 366)]

df[df, 
    .(roll_sum=.N, flag_sum=sum(flag, na.rm=TRUE)), 
    on=.(date >= oneYrAgo, rn < rn, id, date <= date), 
    by=.EACHI][, 
        -seq_len(2L)]