说,这是我的数据
mydat=structure(list(ItemRelation = c(11629L, 11629L, 11629L, 11629L,
11629L, 11629L, 11629L, 11629L, 11629L, 11629L, 11629L, 11629L,
11629L, 11629L, 11629L, 11629L, 11629L, 11629L, 11629L, 11629L,
11629L, 11630L, 11630L, 11630L, 11630L, 11630L, 11630L, 11630L,
11630L, 11630L, 11630L, 11630L, 11630L), exp_date_days = c(5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L
), CustomerName = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("ТС", "ТС1"), class = "factor"),
DocumentNum = c(11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L,
11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L,
11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L
), IsPromo = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L), CalendarYear = c(2018L, 2018L,
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L,
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L,
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L,
2018L, 2018L, 2018L, 2018L), diff = 1:33), .Names = c("ItemRelation",
"exp_date_days", "CustomerName", "DocumentNum", "IsPromo", "CalendarYear",
"diff"), class = "data.frame", row.names = c(NA, -33L))
我需要为每个ItemRelation+CustomerName+DocumentNum+CalendarYear
组根据条件汇总汇总数据。
如果exp_date_days
中的分组的值<= 5,则diff列必须仅按10个零之和求和,这些零在ispromo一类之后。如果零小于10,则以最大零数进行汇总。
如果exp_date_days
组的值> 5,则diff列必须仅按15个零之和求和,这是一类ispromo。如果零小于15,则以最大零个数进行汇总。
因此在此示例中输出
ItemRelation CustomerName DocumentNum CalendarYear diff
11629 ТС 11 2018 126
11630 ТС 11 2018 285
如何使用dplyr或data.table做到这一点?
ItemRelation exp_date_days CustomerName DocumentNum IsPromo CalendarYear diff
11629 5 ТС 11 0 2018 1
11629 5 ТС 11 0 2018 2
11629 5 ТС 11 0 2018 3
11629 5 ТС 11 0 2018 4
11629 5 ТС 11 0 2018 5
11629 5 ТС 11 0 2018 6
11629 5 ТС 11 0 2018 7
11629 5 ТС 11 0 2018 8
11629 5 ТС 11 0 2018 9
11629 5 ТС 11 0 2018 10
11629 5 ТС 11 0 2018 11
11629 5 ТС 11 0 2018 12
11629 5 ТС 11 1 2018 13
11629 5 ТС 11 1 2018 14
**11629 5 ТС 11 0 2018 15
11629 5 ТС 11 0 2018 16
11629 5 ТС 11 0 2018 17
11629 5 ТС 11 0 2018 18
11629 5 ТС 11 0 2018 19
11629 5 ТС 11 0 2018 20
11629 5 ТС 11 0 2018 21** (sum 126)
ItemRelation exp_date_days CustomerName DocumentNum IsPromo CalendarYear diff
11630 6 ТС1 11 0 2018 22
11630 6 ТС1 11 1 2018 23
**11630 6 ТС1 11 0 2018 24
11630 6 ТС1 11 0 2018 25
11630 6 ТС1 11 0 2018 26
11630 6 ТС1 11 0 2018 27
11630 6 ТС1 11 0 2018 28
11630 6 ТС1 11 0 2018 29
11630 6 ТС1 11 0 2018 30
11630 6 ТС1 11 0 2018 31
11630 6 ТС1 11 0 2018 32
11630 6 ТС1 11 0 2018 33** (285)
答案 0 :(得分:2)
我们可以在filter
之后执行group_by
,然后获取'diff'列的sum
library(dplyr)
mydat %>%
group_by(ItemRelation, CustomerName, DocumentNum, CalendarYear) %>%
filter(cumsum(c(FALSE, diff(IsPromo == 1) < 0)) == 1) %>%
filter(if(all(exp_date_days < 5)) row_number() <= 10 else row_number() <= 15) %>%
summarise(diff = sum(diff))
# A tibble: 2 x 5
# Groups: ItemRelation, CustomerName, DocumentNum [?]
# ItemRelation CustomerName DocumentNum CalendarYear diff
# <int> <fct> <int> <int> <int>
#1 11629 ТС 11 2018 126
#2 11630 ТС1 11 2018 285