基于R

时间:2018-01-25 01:13:27

标签: r dplyr data.table

我正在尝试根据条件计算给定窗口的累积总和。我见过解决方案做条件累积和(Calculate a conditional running sum in R for every row in data frame)和滚动求和(Rolling Sum by Another Variable in R)的线程,但我找不到两者。我还看到data.tableR data.table sliding window没有滚动窗口功能。所以,这个问题对我来说非常具有挑战性。

此外,滚动金额的solution posted by Mike Grahan超出了我的理解范围。我正在寻找基于data.table的方法,主要是为了速度。但是,如果可以理解的话,我对其他方法持开放态度。

这是我的输入数据:

DFI <- structure(list(FY = c(2011, 2012, 2013, 2015, 2016, 2011, 2011, 
2012, 2013, 2014, 2015, 2010, 2016, 2013, 2014, 2015, 2010), 
    Customer = c(13575, 13575, 13575, 13575, 13575, 13575, 13575, 
    13575, 13575, 13575, 13575, 13578, 13578, 13578, 13578, 13578, 
    13578), Product = c("A", "A", "A", "A", "A", "B", "B", "B", 
    "B", "B", "B", "A", "A", "B", "C", "D", "E"), Rev = c(4, 
    3, 3, 1, 2, 1, 2, 3, 4, 5, 6, 3, 2, 2, 4, 2, 2)), .Names = c("FY", 
"Customer", "Product", "Rev"), row.names = c(NA, 17L), class = "data.frame")

这是我的预期输出:(手动创建;如果出现手动错误,我道歉)

DFO <- structure(list(FY = c(2011, 2012, 2013, 2015, 2016, 2011, 2012, 
2013, 2014, 2015, 2010, 2016, 2013, 2014, 2015, 2010), Customer = c(13575, 
13575, 13575, 13575, 13575, 13575, 13575, 13575, 13575, 13575, 
13578, 13578, 13578, 13578, 13578, 13578), Product = c("A", "A", 
"A", "A", "A", "B", "B", "B", "B", "B", "A", "A", "B", "C", "D", 
"E"), Rev = c(4, 3, 3, 1, 2, 3, 3, 4, 5, 6, 3, 2, 2, 4, 2, 2), 
    cumsum = c(4, 7, 10, 11, 9, 3, 6, 10, 15, 21, 3, 2, 2, 4, 
    2, 2)), .Names = c("FY", "Customer", "Product", "Rev", "cumsum"
), row.names = c(NA, 16L), class = "data.frame")

关于逻辑的一些评论:

1)我想在5年内找到滚动金额。理想情况下,我希望这个5年期间是可变的,即我可以在代码中的其他地方指定的内容。这样,我可以随后改变窗口以进行分析。

2)Window的结尾基于最大年份(即上例中的FY)。在上面的示例中,FY中的最大DFI2016。因此,对于2016 - 5 + 1 = 2012中的所有条目,窗口的起始年份为2016

3)窗口总和(或运行总和)由Customer和特定Product计算。

我尝试了什么:

我想在张贴前尝试一些事情。这是我的代码:

  DFI <- data.table::as.data.table(DFI)

  #Sort it first
  DFI<-DFI[order(Customer,FY),]

  #find cumulative sum; remove Rev column; order rows
  DFOTest<-DFI[,cumsum := cumsum(Rev),by=.(Customer,Product)][,.SD[which.max(cumsum)],by=.(FY,Customer,Product)][,("Rev"):=NULL][order(Customer,Product,FY)]

此代码计算累积总和,但我无法定义5年窗口,然后计算运行总和。我有两个问题:

问题1)如何计算5年运行总和?

问题2)有人可以解释Mike's method on this thread吗?它看起来很快。但是,我不确定那里发生了什么。我确实看到有人要求一些评论,但我不确定它是否是不言自明的。

提前致谢。我已经在这个问题上苦苦挣扎了两天。

4 个答案:

答案 0 :(得分:6)

1)rollapply 创建一个Sum函数,将FYRev作为2列矩阵(如果不是,则为1)然后求和这些年的收入在去年k之内。然后将DFI转换为数据表,对具有相同客户/产品/年份的行求和,并为每个客户/产品组运行rollapplyr Sum

library(data.table)
library(zoo)

k <- 5
Sum <- function(x) {
  x <- matrix(x,, 2)
  FY <- x[, 1]
  Rev <- x[, 2]
  ok <- FY >= tail(FY, 1) - k + 1
  sum(Rev[ok])
}
DT <- as.data.table(DFI)
DT <- DT[, list(Rev = sum(Rev)), by = c("Customer", "Product", "FY")]
DT[, cumsum := rollapplyr(.SD, k, Sum, by.column = FALSE, partial = TRUE),
       by = c("Customer", "Product"), .SDcols = c("FY", "Rev")]

,并提供:

 > DT
    Customer Product   FY Rev cumsum
 1:    13575       A 2011   4      4
 2:    13575       A 2012   3      7
 3:    13575       A 2013   3     10
 4:    13575       A 2015   1     11
 5:    13575       A 2016   2      9
 6:    13575       B 2011   3      3
 7:    13575       B 2012   3      6
 8:    13575       B 2013   4     10
 9:    13575       B 2014   5     15
10:    13575       B 2015   6     21
11:    13578       A 2010   3      3
12:    13578       A 2016   2      2
13:    13578       B 2013   2      2
14:    13578       C 2014   4      4
15:    13578       D 2015   2      2
16:    13578       E 2010   2      2

2)仅限data.table

具有相同客户/产品/ FY的第一个总和行,然后按客户/产品分组,对于每个FY值fy,选择FY值介于Rev之间的fy-k+1值{1}}和fy和总和。

library(data.table)

k <- 5
DT <- as.data.table(DFI)
DT <- DT[, list(Rev = sum(Rev)), by = c("Customer", "Product", "FY")]
DT[, cumsum := sapply(FY, function(fy) sum(Rev[between(FY, fy-k+1, fy)])),
       by = c("Customer", "Product")]

,并提供:

> DT
    Customer Product   FY Rev cumsum
 1:    13575       A 2011   4      4
 2:    13575       A 2012   3      7
 3:    13575       A 2013   3     10
 4:    13575       A 2015   1     11
 5:    13575       A 2016   2      9
 6:    13575       B 2011   3      3
 7:    13575       B 2012   3      6
 8:    13575       B 2013   4     10
 9:    13575       B 2014   5     15
10:    13575       B 2015   6     21
11:    13578       A 2010   3      3
12:    13578       A 2016   2      2
13:    13578       B 2013   2      2
14:    13578       C 2014   4      4
15:    13578       D 2015   2      2
16:    13578       E 2010   2      2

答案 1 :(得分:2)

我的解决方案保留在tidyverse方面,但是,如果您的源数据不是过多,性能差异可能不是问题。

我将首先声明一个使用tibbletime::rollify计算滚动总和的函数,然后展开数据框以包含缺少的FY值。然后在应用滚动总和时进行分组和汇总。

library(tidyr)
library(dplyr)

rollsum_5 <- tibbletime::rollify(sum, window = 5)

df %>%
  complete(FY, Customer, Product) %>%
  replace_na(list(Rev = 0), Rev) %>%
  arrange(Customer, Product, FY) %>%
  group_by(Customer, Product, FY) %>%
  summarise(Rev = sum(Rev)) %>%
  mutate(cumsum = rollsum_5(Rev)) %>%
  ungroup %>%
  filter(Rev != 0)

# # A tibble: 16 x 5
#    Customer Product    FY   Rev cumsum
#       <dbl> <chr>   <dbl> <dbl>  <dbl>
#  1    13575 A        2011  4.00  NA   
#  2    13575 A        2012  3.00  NA   
#  3    13575 A        2013  3.00  NA   
#  4    13575 A        2015  1.00  11.0 
#  5    13575 A        2016  2.00   9.00
#  6    13575 B        2011  3.00  NA   
#  7    13575 B        2012  3.00  NA   
#  8    13575 B        2013  4.00  NA   
#  9    13575 B        2014  5.00  15.0 
# 10    13575 B        2015  6.00  21.0 
# 11    13578 A        2010  3.00  NA   
# 12    13578 A        2016  2.00   2.00
# 13    13578 B        2013  2.00  NA   
# 14    13578 C        2014  4.00   4.00
# 15    13578 D        2015  2.00   2.00
# 16    13578 E        2010  2.00  NA 
  

N.B。此情况下的滚动总和只会出现在窗口( 5行)完整的行中。建议部分价值等于五年的总和可能会产生误导。

答案 2 :(得分:1)

使用的解决方案。

# Load packages
library(dplyr)
library(tidyr)
library(zoo)

# A helper function to convert the rolling cumsum result
cumsum_roll <- function(x){
  vec <- c(x[1, ], x[, ncol(x)][-1])
  return(vec)
}

DFI2 <- DFI %>%
  # Group by FY, Customer, Product
  group_by_at(vars(-Rev)) %>%                 
  # Calculate the total Rev pf each group
  summarise(Rev = sum(Rev)) %>%               
  ungroup() %>%
  group_by(Customer) %>%
  # Expand the data frame based on FY and Product
  # Fill the Rev to be 0
  complete(FY = full_seq(FY, period = 1), Product, fill = list(Rev = 0)) %>%
  # Sort the data frame by Customer, FY, and Product
  arrange(Customer, Product, FY) %>%
  ungroup() %>%
  group_by(Customer, Product) %>%
  # Apply the rolling cumsum by rollapply. Specify the window as 5.
  # cumsum_roll is to transcribe the output of rollapply, a matrix, to a vector
  mutate(cumsum = cumsum_roll(rollapply(Rev, 5, FUN = cumsum))) %>%
  # Remove Rev = 0
  filter(Rev != 0) %>%
  # Reorder the columns
  select(FY, Customer, Product, Rev, cumsum) %>%
  ungroup() %>%
  as.data.frame()

DFI2
#      FY Customer Product Rev cumsum
# 1  2011    13575       A   4      4
# 2  2012    13575       A   3      7
# 3  2013    13575       A   3     10
# 4  2015    13575       A   1     11
# 5  2016    13575       A   2      9
# 6  2011    13575       B   3      3
# 7  2012    13575       B   3      6
# 8  2013    13575       B   4     10
# 9  2014    13575       B   5     15
# 10 2015    13575       B   6     21
# 11 2010    13578       A   3      3
# 12 2016    13578       A   2      2
# 13 2013    13578       B   2      2
# 14 2014    13578       C   4      4
# 15 2015    13578       D   2      2
# 16 2010    13578       E   2      2

答案 3 :(得分:0)

不是新的tidyverse答案,但我认为nest有助于提高可读性

library(tidyverse)
library(zoo)

roll_cumsum <- function(df) {
                  df %>%
                     complete(FY = full_seq(FY, period=1)) %>%
                     mutate(roll_cumsum = rollapplyr(Rev, 5, sum, na.rm=TRUE, partial=TRUE))
               }

DFI %>%
  group_by_at(vars(-Rev)) %>%
  summarise(Rev = sum(Rev)) %>%
  group_by(Customer, Product) %>%
  nest(FY, Rev) %>%
  mutate(data = map(data, ~roll_cumsum(.x))) %>%
  unnest() %>%
  filter(!is.na(Rev)) %>%
  arrange(Customer, Product, FY)

# A tibble: 16 x 5
   # Customer Product    FY   Rev roll_cumsum
      # <dbl> <chr>   <dbl> <dbl>       <dbl>
 # 1    13575 A        2011  4.00        4.00
 # 2    13575 A        2012  3.00        7.00
 # 3    13575 A        2013  3.00       10.0 
 # 4    13575 A        2015  1.00       11.0 
 # 5    13575 A        2016  2.00        9.00
 # 6    13575 B        2011  3.00        3.00
 # 7    13575 B        2012  3.00        6.00
 # 8    13575 B        2013  4.00       10.0 
 # 9    13575 B        2014  5.00       15.0 
# 10    13575 B        2015  6.00       21.0 
# 11    13578 A        2010  3.00        3.00
# 12    13578 A        2016  2.00        2.00
# 13    13578 B        2013  2.00        2.00
# 14    13578 C        2014  4.00        4.00
# 15    13578 D        2015  2.00        2.00
# 16    13578 E        2010  2.00        2.00