滚动的和有条件的窗口

时间:2017-12-20 09:05:09

标签: r dataframe data.table

所以这是我的数据的一个例子

> d
   customer       date revenue
1:        A 2016-01-01      32
2:        A 2016-01-03      88
3:        A 2016-01-04      80
4:        A 2016-02-01      38
5:        B 2016-01-13      44
6:        B 2016-01-24      11
7:        B 2016-01-25      50
8:        B 2016-02-26      46
> dput(d)
structure(list(customer = c("A", "A", "A", "A", "B", "B", "B", 
"B"), date = structure(c(16801, 16803, 16804, 16832, 16813, 16824, 
16825, 16857), class = "Date"), revenue = c(32, 88, 80, 38, 44, 
11, 50, 46)), .Names = c("customer", "date", "revenue"), row.names = c(NA, 
-8L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000000002a60788>)

我想要做的是,我想创建一个列,我们称之为roll_sum_3days。 此列是之后发生的收入的滚动总和。窗口大小以日期列为条件。在这种情况下,roll_sum_3days是之后发生的收入总和,不得晚于3天。

预期的结果将是这样的

   customer       date revenue    roll_sum_3days
1:        A 2016-01-01      32                168
2:        A 2016-01-03      88                 80
3:        A 2016-01-04      80                 0
4:        A 2016-02-01      38                 0
5:        B 2016-01-13      44                 0
6:        B 2016-01-24      11                 96
7:        B 2016-01-25      50                 46
8:        B 2016-01-26      46                 0

2 个答案:

答案 0 :(得分:3)

可能的解决方案:

library(lubridate) # for the '%m+%'-function

d[, roll_sum_3d := .SD[.SD[, .(date, date2 = date %m+% days(3), revenue)]
                       , on = .(date > date, date <= date2)
                       ][, sum(revenue, na.rm = TRUE), by = date]$V1
  , by = customer][]

给出:

   customer       date revenue roll_sum_3d
1:        A 2016-01-01      32         168
2:        A 2016-01-03      88          80
3:        A 2016-01-04      80           0
4:        A 2016-02-01      38           0
5:        B 2016-01-13      44           0
6:        B 2016-01-24      11          96
7:        B 2016-01-25      50          46
8:        B 2016-01-26      46           0

这是做什么的:

  • d分组customer with by = customer`。
  • 通过引用roll_sum_3d添加:=
  • 为每个具有该组日期窗口的组加入roll_sum_3d S ubset D ata)计算.SD({ {1}}使用非等值加入.SD[, .(date, date2 = date %m+% days(3), revenue)],汇总每个日期的收入并将其返回。

基于@ Arun评论的另一种选择:

on = .(date > date, date <= date2)

答案 1 :(得分:1)

嗨,我猜你的例子中还有另一个错误:观察数字8不会增加前两次观察的计数,因为它来自二月。没关系如果你想使用(change)="change($event)"apply()函数

,我有一个解决方案
POSIXct()

我无法保留您的日期格式,因为操作员df <- data.frame(customer = c("A", "A", "A", "A", "B", "B", "B", "B"), date = structure(c(16801, 16803, 16804, 16832, 16813, 16824, 16825, 16857), class = "Date"), revenue = c(32, 88, 80, 38, 44, 11, 50, 46)) df$date <- as.POSIXct(df$date) calc <- function(x){ date <- as.POSIXct(unlist(x["date"]),origin = "1970-01-01") customer <- unlist(x["customer"]) # There you choose what you want to sum (here conditions are between the day and 3 days later and same customer) # 86400 is the number of second in a day! output <- sum(df[df$date > date & df$date <= (date+86400*3) & df$customer==customer,"revenue"]) return(output) } df$sum <- apply(df,1,calc) # if you want to come back with your date format. df$date <- as.Date(df$date) df customer date revenue sum 1 A 2016-01-01 32 168 2 A 2016-01-03 88 80 3 A 2016-01-04 80 0 4 A 2016-02-01 38 0 5 B 2016-01-13 44 0 6 B 2016-01-24 11 50 7 B 2016-01-25 50 0 8 B 2016-02-26 46 0 无法使用它。