我正在尝试根据条件计算给定窗口的累积总和。我见过解决方案做条件累积和(Calculate a conditional running sum in R for every row in data frame)和滚动求和(Rolling Sum by Another Variable in R)的线程,但我找不到两者。我还看到data.table
在R data.table sliding window没有滚动窗口功能。所以,这个问题对我来说非常具有挑战性。
此外,滚动金额的solution posted by Mike Grahan超出了我的理解范围。我正在寻找基于data.table
的方法,主要是为了速度。但是,如果可以理解的话,我对其他方法持开放态度。
这是我的输入数据:
DFI <- structure(list(FY = c(2011, 2012, 2013, 2015, 2016, 2011, 2011,
2012, 2013, 2014, 2015, 2010, 2016, 2013, 2014, 2015, 2010),
Customer = c(13575, 13575, 13575, 13575, 13575, 13575, 13575,
13575, 13575, 13575, 13575, 13578, 13578, 13578, 13578, 13578,
13578), Product = c("A", "A", "A", "A", "A", "B", "B", "B",
"B", "B", "B", "A", "A", "B", "C", "D", "E"), Rev = c(4,
3, 3, 1, 2, 1, 2, 3, 4, 5, 6, 3, 2, 2, 4, 2, 2)), .Names = c("FY",
"Customer", "Product", "Rev"), row.names = c(NA, 17L), class = "data.frame")
这是我的预期输出:(手动创建;如果出现手动错误,我道歉)
DFO <- structure(list(FY = c(2011, 2012, 2013, 2015, 2016, 2011, 2012,
2013, 2014, 2015, 2010, 2016, 2013, 2014, 2015, 2010), Customer = c(13575,
13575, 13575, 13575, 13575, 13575, 13575, 13575, 13575, 13575,
13578, 13578, 13578, 13578, 13578, 13578), Product = c("A", "A",
"A", "A", "A", "B", "B", "B", "B", "B", "A", "A", "B", "C", "D",
"E"), Rev = c(4, 3, 3, 1, 2, 3, 3, 4, 5, 6, 3, 2, 2, 4, 2, 2),
cumsum = c(4, 7, 10, 11, 9, 3, 6, 10, 15, 21, 3, 2, 2, 4,
2, 2)), .Names = c("FY", "Customer", "Product", "Rev", "cumsum"
), row.names = c(NA, 16L), class = "data.frame")
关于逻辑的一些评论:
1)我想在5年内找到滚动金额。理想情况下,我希望这个5年期间是可变的,即我可以在代码中的其他地方指定的内容。这样,我可以随后改变窗口以进行分析。
2)Window的结尾基于最大年份(即上例中的FY
)。在上面的示例中,FY
中的最大DFI
为2016
。因此,对于2016 - 5 + 1 = 2012
中的所有条目,窗口的起始年份为2016
。
3)窗口总和(或运行总和)由Customer
和特定Product
计算。
我尝试了什么:
我想在张贴前尝试一些事情。这是我的代码:
DFI <- data.table::as.data.table(DFI)
#Sort it first
DFI<-DFI[order(Customer,FY),]
#find cumulative sum; remove Rev column; order rows
DFOTest<-DFI[,cumsum := cumsum(Rev),by=.(Customer,Product)][,.SD[which.max(cumsum)],by=.(FY,Customer,Product)][,("Rev"):=NULL][order(Customer,Product,FY)]
此代码计算累积总和,但我无法定义5年窗口,然后计算运行总和。我有两个问题:
问题1)如何计算5年运行总和?
问题2)有人可以解释Mike's method on this thread吗?它看起来很快。但是,我不确定那里发生了什么。我确实看到有人要求一些评论,但我不确定它是否是不言自明的。
提前致谢。我已经在这个问题上苦苦挣扎了两天。
答案 0 :(得分:6)
1)rollapply 创建一个Sum
函数,将FY
和Rev
作为2列矩阵(如果不是,则为1)然后求和这些年的收入在去年k
之内。然后将DFI
转换为数据表,对具有相同客户/产品/年份的行求和,并为每个客户/产品组运行rollapplyr
Sum
。
library(data.table)
library(zoo)
k <- 5
Sum <- function(x) {
x <- matrix(x,, 2)
FY <- x[, 1]
Rev <- x[, 2]
ok <- FY >= tail(FY, 1) - k + 1
sum(Rev[ok])
}
DT <- as.data.table(DFI)
DT <- DT[, list(Rev = sum(Rev)), by = c("Customer", "Product", "FY")]
DT[, cumsum := rollapplyr(.SD, k, Sum, by.column = FALSE, partial = TRUE),
by = c("Customer", "Product"), .SDcols = c("FY", "Rev")]
,并提供:
> DT
Customer Product FY Rev cumsum
1: 13575 A 2011 4 4
2: 13575 A 2012 3 7
3: 13575 A 2013 3 10
4: 13575 A 2015 1 11
5: 13575 A 2016 2 9
6: 13575 B 2011 3 3
7: 13575 B 2012 3 6
8: 13575 B 2013 4 10
9: 13575 B 2014 5 15
10: 13575 B 2015 6 21
11: 13578 A 2010 3 3
12: 13578 A 2016 2 2
13: 13578 B 2013 2 2
14: 13578 C 2014 4 4
15: 13578 D 2015 2 2
16: 13578 E 2010 2 2
2)仅限data.table
具有相同客户/产品/ FY的第一个总和行,然后按客户/产品分组,对于每个FY值fy
,选择FY值介于Rev
之间的fy-k+1
值{1}}和fy
和总和。
library(data.table)
k <- 5
DT <- as.data.table(DFI)
DT <- DT[, list(Rev = sum(Rev)), by = c("Customer", "Product", "FY")]
DT[, cumsum := sapply(FY, function(fy) sum(Rev[between(FY, fy-k+1, fy)])),
by = c("Customer", "Product")]
,并提供:
> DT
Customer Product FY Rev cumsum
1: 13575 A 2011 4 4
2: 13575 A 2012 3 7
3: 13575 A 2013 3 10
4: 13575 A 2015 1 11
5: 13575 A 2016 2 9
6: 13575 B 2011 3 3
7: 13575 B 2012 3 6
8: 13575 B 2013 4 10
9: 13575 B 2014 5 15
10: 13575 B 2015 6 21
11: 13578 A 2010 3 3
12: 13578 A 2016 2 2
13: 13578 B 2013 2 2
14: 13578 C 2014 4 4
15: 13578 D 2015 2 2
16: 13578 E 2010 2 2
答案 1 :(得分:2)
我的解决方案保留在tidyverse
方面,但是,如果您的源数据不是过多,性能差异可能不是问题。
我将首先声明一个使用tibbletime::rollify
计算滚动总和的函数,然后展开数据框以包含缺少的FY
值。然后在应用滚动总和时进行分组和汇总。
library(tidyr)
library(dplyr)
rollsum_5 <- tibbletime::rollify(sum, window = 5)
df %>%
complete(FY, Customer, Product) %>%
replace_na(list(Rev = 0), Rev) %>%
arrange(Customer, Product, FY) %>%
group_by(Customer, Product, FY) %>%
summarise(Rev = sum(Rev)) %>%
mutate(cumsum = rollsum_5(Rev)) %>%
ungroup %>%
filter(Rev != 0)
# # A tibble: 16 x 5
# Customer Product FY Rev cumsum
# <dbl> <chr> <dbl> <dbl> <dbl>
# 1 13575 A 2011 4.00 NA
# 2 13575 A 2012 3.00 NA
# 3 13575 A 2013 3.00 NA
# 4 13575 A 2015 1.00 11.0
# 5 13575 A 2016 2.00 9.00
# 6 13575 B 2011 3.00 NA
# 7 13575 B 2012 3.00 NA
# 8 13575 B 2013 4.00 NA
# 9 13575 B 2014 5.00 15.0
# 10 13575 B 2015 6.00 21.0
# 11 13578 A 2010 3.00 NA
# 12 13578 A 2016 2.00 2.00
# 13 13578 B 2013 2.00 NA
# 14 13578 C 2014 4.00 4.00
# 15 13578 D 2015 2.00 2.00
# 16 13578 E 2010 2.00 NA
N.B。此情况下的滚动总和只会出现在窗口( 5行)完整的行中。建议部分价值等于五年的总和可能会产生误导。
答案 2 :(得分:1)
# Load packages
library(dplyr)
library(tidyr)
library(zoo)
# A helper function to convert the rolling cumsum result
cumsum_roll <- function(x){
vec <- c(x[1, ], x[, ncol(x)][-1])
return(vec)
}
DFI2 <- DFI %>%
# Group by FY, Customer, Product
group_by_at(vars(-Rev)) %>%
# Calculate the total Rev pf each group
summarise(Rev = sum(Rev)) %>%
ungroup() %>%
group_by(Customer) %>%
# Expand the data frame based on FY and Product
# Fill the Rev to be 0
complete(FY = full_seq(FY, period = 1), Product, fill = list(Rev = 0)) %>%
# Sort the data frame by Customer, FY, and Product
arrange(Customer, Product, FY) %>%
ungroup() %>%
group_by(Customer, Product) %>%
# Apply the rolling cumsum by rollapply. Specify the window as 5.
# cumsum_roll is to transcribe the output of rollapply, a matrix, to a vector
mutate(cumsum = cumsum_roll(rollapply(Rev, 5, FUN = cumsum))) %>%
# Remove Rev = 0
filter(Rev != 0) %>%
# Reorder the columns
select(FY, Customer, Product, Rev, cumsum) %>%
ungroup() %>%
as.data.frame()
DFI2
# FY Customer Product Rev cumsum
# 1 2011 13575 A 4 4
# 2 2012 13575 A 3 7
# 3 2013 13575 A 3 10
# 4 2015 13575 A 1 11
# 5 2016 13575 A 2 9
# 6 2011 13575 B 3 3
# 7 2012 13575 B 3 6
# 8 2013 13575 B 4 10
# 9 2014 13575 B 5 15
# 10 2015 13575 B 6 21
# 11 2010 13578 A 3 3
# 12 2016 13578 A 2 2
# 13 2013 13578 B 2 2
# 14 2014 13578 C 4 4
# 15 2015 13578 D 2 2
# 16 2010 13578 E 2 2
答案 3 :(得分:0)
不是新的tidyverse
答案,但我认为nest
有助于提高可读性
library(tidyverse)
library(zoo)
roll_cumsum <- function(df) {
df %>%
complete(FY = full_seq(FY, period=1)) %>%
mutate(roll_cumsum = rollapplyr(Rev, 5, sum, na.rm=TRUE, partial=TRUE))
}
DFI %>%
group_by_at(vars(-Rev)) %>%
summarise(Rev = sum(Rev)) %>%
group_by(Customer, Product) %>%
nest(FY, Rev) %>%
mutate(data = map(data, ~roll_cumsum(.x))) %>%
unnest() %>%
filter(!is.na(Rev)) %>%
arrange(Customer, Product, FY)
# A tibble: 16 x 5
# Customer Product FY Rev roll_cumsum
# <dbl> <chr> <dbl> <dbl> <dbl>
# 1 13575 A 2011 4.00 4.00
# 2 13575 A 2012 3.00 7.00
# 3 13575 A 2013 3.00 10.0
# 4 13575 A 2015 1.00 11.0
# 5 13575 A 2016 2.00 9.00
# 6 13575 B 2011 3.00 3.00
# 7 13575 B 2012 3.00 6.00
# 8 13575 B 2013 4.00 10.0
# 9 13575 B 2014 5.00 15.0
# 10 13575 B 2015 6.00 21.0
# 11 13578 A 2010 3.00 3.00
# 12 13578 A 2016 2.00 2.00
# 13 13578 B 2013 2.00 2.00
# 14 13578 C 2014 4.00 4.00
# 15 13578 D 2015 2.00 2.00
# 16 13578 E 2010 2.00 2.00