我有一个包含以下结构数据的大型数据框:
name date val1 val2
1 A 2017-01-01 0 2
2 A 2017-01-02 1 1
3 A 2017-01-03 1 0
4 A 2017-01-04 0 3
5 A 2017-01-05 1 1
6 A 2017-01-06 0 0
7 B 2017-01-01 0 0
8 B 2017-01-02 0 3
9 B 2017-01-03 1 2
10 B 2017-01-04 1 1
11 B 2017-01-05 0 0
12 B 2017-01-06 1 0
13 C 2017-01-01 0 2
14 C 2017-01-02 0 1
15 C 2017-01-03 1 2
16 C 2017-01-04 0 0
17 C 2017-01-05 0 0
18 C 2017-01-06 1 3
对于date
每组中的任何name
,我现在想要计算最近2次出现的cumsum()
val1
和val2
最近3次出现。
我正在使用以下代码尝试此操作(基于此答案:https://stackoverflow.com/a/27649238/1162278;包括创建示例数据集):
library(dplyr)
library(data.table)
dates <- seq(as.Date('2017-01-01'), as.Date('2017-01-06'), by = '1 day')
d <- CJ(
name = c('A', 'B', 'C'),
date = dates
) %>%
left_join(
data.frame(
name = c(rep('A',6), rep('B',6), rep('C',6)),
date = c(rep(dates, 3)),
val1 = c(0,1,1,0,1,0,0,0,1,1,0,1,0,0,1,0,0,1),
val2 = c(2,1,0,3,1,0,0,3,2,1,0,0,2,1,2,0,0,3)
)
)
d %>%
group_by(name) %>%
mutate(
val1_l2 = dplyr::lag(cumsum(val1), k=2),
val2_l3 = dplyr::lag(cumsum(val2), k=3)
)
这会产生:
name date val1 val2 val1_l2 val2_l3
<chr> <date> <dbl> <dbl> <dbl> <dbl>
1 A 2017-01-01 0 2 NA NA
2 A 2017-01-02 1 1 0 2
3 A 2017-01-03 1 0 1 3
4 A 2017-01-04 0 3 2 3
5 A 2017-01-05 1 1 2 6
6 A 2017-01-06 0 0 3 7
7 B 2017-01-01 0 0 NA NA
8 B 2017-01-02 0 3 0 0
9 B 2017-01-03 1 2 0 3
10 B 2017-01-04 1 1 1 5
11 B 2017-01-05 0 0 2 6
12 B 2017-01-06 1 0 2 6
13 C 2017-01-01 0 2 NA NA
14 C 2017-01-02 0 1 0 2
15 C 2017-01-03 1 2 0 3
16 C 2017-01-04 0 0 1 5
17 C 2017-01-05 0 0 1 5
18 C 2017-01-06 1 3 1 5
但是,似乎cumsum()
始终是针对name
组中所有先前记录计算的,而不是针对k=2
和k=3
的滚动范围计算的val1
分别为{1}}和val2
。
示例:
Row Variable Calculated Expected
5 val1_l2 2 1
5 val2_l3 6 4
我做错了什么?
答案 0 :(得分:0)
我们可能不需要在这里使用lag
。除最后两行或三行外,我们可以将所有值替换为0,然后使用cumsum
。这是一个例子。请注意d2
是最终输出。 n():(n() - 1)
或n():(n() - 2)
表示最后两行或三行。 ifelse(row_number() %in% ...)
检查行号是否与最后两行或三行匹配。
d2 <- d %>%
group_by(name) %>%
mutate(val1_l2 = ifelse(row_number() %in% n():(n() - 1), val1, 0),
val2_l3 = ifelse(row_number() %in% n():(n() - 2), val2, 0)) %>%
mutate(val1_l2 = cumsum(val1_l2),
val2_l3 = cumsum(val2_l3))
d2
# A tibble: 18 x 6
# Groups: name [3]
name date val1 val2 val1_l2 val2_l3
<chr> <date> <dbl> <dbl> <dbl> <dbl>
1 A 2017-01-01 0 2 0 0
2 A 2017-01-02 1 1 0 0
3 A 2017-01-03 1 0 0 0
4 A 2017-01-04 0 3 0 3
5 A 2017-01-05 1 1 1 4
6 A 2017-01-06 0 0 1 4
7 B 2017-01-01 0 0 0 0
8 B 2017-01-02 0 3 0 0
9 B 2017-01-03 1 2 0 0
10 B 2017-01-04 1 1 0 1
11 B 2017-01-05 0 0 0 1
12 B 2017-01-06 1 0 1 1
13 C 2017-01-01 0 2 0 0
14 C 2017-01-02 0 1 0 0
15 C 2017-01-03 1 2 0 0
16 C 2017-01-04 0 0 0 0
17 C 2017-01-05 0 0 0 0
18 C 2017-01-06 1 3 1 3
数据
library(dplyr)
library(data.table)
dates <- seq(as.Date('2017-01-01'), as.Date('2017-01-06'), by = '1 day')
d <- CJ(
name = c('A', 'B', 'C'),
date = dates
) %>%
left_join(
data.frame(
name = c(rep('A',6), rep('B',6), rep('C',6)),
date = c(rep(dates, 3)),
val1 = c(0,1,1,0,1,0,0,0,1,1,0,1,0,0,1,0,0,1),
val2 = c(2,1,0,3,1,0,0,3,2,1,0,0,2,1,2,0,0,3)
)
)