使用dplyr和滞后的滚动累积和不限于滞后范围

时间:2017-09-01 08:57:53

标签: r dplyr lag cumsum

我有一个包含以下结构数据的大型数据框:

   name       date val1 val2
1     A 2017-01-01    0    2
2     A 2017-01-02    1    1
3     A 2017-01-03    1    0
4     A 2017-01-04    0    3
5     A 2017-01-05    1    1
6     A 2017-01-06    0    0
7     B 2017-01-01    0    0
8     B 2017-01-02    0    3
9     B 2017-01-03    1    2
10    B 2017-01-04    1    1
11    B 2017-01-05    0    0
12    B 2017-01-06    1    0
13    C 2017-01-01    0    2
14    C 2017-01-02    0    1
15    C 2017-01-03    1    2
16    C 2017-01-04    0    0
17    C 2017-01-05    0    0
18    C 2017-01-06    1    3

对于date每组中的任何name,我现在想要计算最近2次出现的cumsum() val1val2最近3次出现。

我正在使用以下代码尝试此操作(基于此答案:https://stackoverflow.com/a/27649238/1162278;包括创建示例数据集):

library(dplyr)
library(data.table)

dates <- seq(as.Date('2017-01-01'), as.Date('2017-01-06'), by = '1 day')

d <- CJ(
  name = c('A', 'B', 'C'),
  date = dates
) %>% 
  left_join(
    data.frame(
      name = c(rep('A',6), rep('B',6), rep('C',6)),
      date = c(rep(dates, 3)),
      val1 = c(0,1,1,0,1,0,0,0,1,1,0,1,0,0,1,0,0,1),
      val2 = c(2,1,0,3,1,0,0,3,2,1,0,0,2,1,2,0,0,3)
    )
  )


d %>% 
  group_by(name) %>% 
  mutate(
    val1_l2 = dplyr::lag(cumsum(val1), k=2),
    val2_l3 = dplyr::lag(cumsum(val2), k=3)
  )

这会产生:

    name       date  val1  val2 val1_l2 val2_l3
   <chr>     <date> <dbl> <dbl>   <dbl>   <dbl>
 1     A 2017-01-01     0     2      NA      NA
 2     A 2017-01-02     1     1       0       2
 3     A 2017-01-03     1     0       1       3
 4     A 2017-01-04     0     3       2       3
 5     A 2017-01-05     1     1       2       6
 6     A 2017-01-06     0     0       3       7
 7     B 2017-01-01     0     0      NA      NA
 8     B 2017-01-02     0     3       0       0
 9     B 2017-01-03     1     2       0       3
10     B 2017-01-04     1     1       1       5
11     B 2017-01-05     0     0       2       6
12     B 2017-01-06     1     0       2       6
13     C 2017-01-01     0     2      NA      NA
14     C 2017-01-02     0     1       0       2
15     C 2017-01-03     1     2       0       3
16     C 2017-01-04     0     0       1       5
17     C 2017-01-05     0     0       1       5
18     C 2017-01-06     1     3       1       5

但是,似乎cumsum()始终是针对name组中所有先前记录计算的,而不是针对k=2k=3的滚动范围计算的val1分别为{1}}和val2

示例:

Row   Variable   Calculated   Expected
  5   val1_l2        2           1
  5   val2_l3        6           4

我做错了什么?

1 个答案:

答案 0 :(得分:0)

我们可能不需要在这里使用lag。除最后两行或三行外,我们可以将所有值替换为0,然后使用cumsum。这是一个例子。请注意d2是最终输出。 n():(n() - 1)n():(n() - 2)表示最后两行或三行。 ifelse(row_number() %in% ...)检查行号是否与最后两行或三行匹配。

d2 <- d %>%
  group_by(name) %>% 
  mutate(val1_l2 = ifelse(row_number() %in% n():(n() - 1), val1, 0),
         val2_l3 = ifelse(row_number() %in% n():(n() - 2), val2, 0)) %>%
  mutate(val1_l2 = cumsum(val1_l2), 
         val2_l3 = cumsum(val2_l3))

d2
# A tibble: 18 x 6
# Groups:   name [3]
    name       date  val1  val2 val1_l2 val2_l3
   <chr>     <date> <dbl> <dbl>   <dbl>   <dbl>
 1     A 2017-01-01     0     2       0       0
 2     A 2017-01-02     1     1       0       0
 3     A 2017-01-03     1     0       0       0
 4     A 2017-01-04     0     3       0       3
 5     A 2017-01-05     1     1       1       4
 6     A 2017-01-06     0     0       1       4
 7     B 2017-01-01     0     0       0       0
 8     B 2017-01-02     0     3       0       0
 9     B 2017-01-03     1     2       0       0
10     B 2017-01-04     1     1       0       1
11     B 2017-01-05     0     0       0       1
12     B 2017-01-06     1     0       1       1
13     C 2017-01-01     0     2       0       0
14     C 2017-01-02     0     1       0       0
15     C 2017-01-03     1     2       0       0
16     C 2017-01-04     0     0       0       0
17     C 2017-01-05     0     0       0       0
18     C 2017-01-06     1     3       1       3

数据

library(dplyr)
library(data.table)

dates <- seq(as.Date('2017-01-01'), as.Date('2017-01-06'), by = '1 day')

d <- CJ(
  name = c('A', 'B', 'C'),
  date = dates
) %>% 
  left_join(
    data.frame(
      name = c(rep('A',6), rep('B',6), rep('C',6)),
      date = c(rep(dates, 3)),
      val1 = c(0,1,1,0,1,0,0,0,1,1,0,1,0,0,1,0,0,1),
      val2 = c(2,1,0,3,1,0,0,3,2,1,0,0,2,1,2,0,0,3)
    )
  )