我有一个大型面板数据集(10,000,000 x 53),大约有50列分数。我按组(大约15,000)和日期汇总了每个分数。
现在我想计算三个值的滚动总和,包括前两个日期和当前日期的分数,创建一个新的相应总和列。 应按日期和组计算每个得分列的总和。 对于组内的第1和第2个日期,允许的值少于3个。
GROUP DATE LAGGED SCORE1 SUM1 SCORE2 SUM2 ... SCORE50 SUM50
#1 A 2017-04-01 2017-03-30 1 1|1 2 2|2 4 4|4
#2 A 2017-04-02 2017-03-31 1 1+1|2 3 3+2|5 3 3+4|7
#3 A 2017-04-04 2017-04-02 2 2+1+1|4 4 4+3+2|9 2 2+3+4|9
#5 B 2017-04-02 2017-03-31 2 2|2 3 3|3 1 1|1
#6 B 2017-04-05 2017-04-03 2 2+2|4 2 2+3|5 1 1+1|2
#7 B 2017-04-08 2017-04-06 3 3+2+2|7 1 1+2+3|6 3 3+1+1|5
#8 C 2017-04-02 2017-03-31 3 3|3 1 1|1 1 1|1
#9 C 2017-04-03 2017-04-01 2 2+3|5 3 3+1|4 2 2+1|3
: : : : : : : : : :
#10M XX 2018-03-30 2018-03-28 2 2 1 1 ... 1 1
大卫在this post的答案涵盖了我关于按群组汇总滚动窗口的大部分问题,但我仍然遗漏了几件。
library(data.table) #v1.10.4
## Convert to a proper date class, and add another column
## in order to define the range
setDT(input)[, c("Date", "Date2") := {
Date = as.IDate(Date)
Date2 = Date - 2L
.(Date, Date2)
}]
## Run a non-equi join against the unique Date/Group combination in input
## Sum the Scores on the fly
## You can ignore the second Date column
input[unique(input, by = c("Date", "Group")), ## This removes the dupes
on = .(Group, Date <= Date, Date >= Date2), ## The join condition
.(Score = sum(Score)), ## sum the scores
keyby = .EACHI] ## Run the sum by each row in
## unique(input, by = c("Date", "Group"))
我的问题分为两部分:
答案 0 :(得分:3)
可能的解决方案:
cols <- grep('^SCORE', names(input), value = TRUE)
input[, gsub('SCORE','SUM',cols) := lapply(.SD, cumsum)
, by = GROUP
, .SDcols = cols][]
给出:
GROUP DATE LAGGED SCORE1 SCORE2 SUM1 SUM2 1: A 2017-04-01 2017-03-30 1 2 1 2 2: A 2017-04-02 2017-03-31 1 3 2 5 3: A 2017-04-04 2017-04-02 2 4 4 9 4: B 2017-04-02 2017-03-31 2 3 2 3 5: B 2017-04-05 2017-04-03 2 2 4 5 6: B 2017-04-08 2017-04-06 3 1 7 6 7: C 2017-04-02 2017-03-31 3 1 3 1 8: C 2017-04-03 2017-04-01 2 3 5 4
如果您还想考虑一个时间窗口,您可以这样做(假设LAGGED
是时间窗口的开头):
input[input[input[, .(GROUP, DATE, LAGGED)]
, on = .(GROUP, DATE >= LAGGED, DATE <= DATE)
][, setNames(lapply(.SD, sum), gsub('SCORE','SUM',cols))
, by = .(GROUP, DATE = DATE.1)
, .SDcols = cols]
, on = .(GROUP, DATE)]
给出:
GROUP DATE LAGGED SCORE1 SCORE2 SUM1 SUM2 1: A 2017-04-01 2017-03-30 1 2 1 2 2: A 2017-04-02 2017-03-31 1 3 2 5 3: A 2017-04-04 2017-04-02 2 4 3 7 4: B 2017-04-02 2017-03-31 2 3 2 3 5: B 2017-04-05 2017-04-03 2 2 2 2 6: B 2017-04-08 2017-04-06 3 1 3 1 7: C 2017-04-02 2017-03-31 3 1 3 1 8: C 2017-04-03 2017-04-01 2 3 5 4
使用过的数据:
input <- fread(' GROUP DATE LAGGED SCORE1 SCORE2
A 2017-04-01 2017-03-30 1 2
A 2017-04-02 2017-03-31 1 3
A 2017-04-04 2017-04-02 2 4
B 2017-04-02 2017-03-31 2 3
B 2017-04-05 2017-04-03 2 2
B 2017-04-08 2017-04-06 3 1
C 2017-04-02 2017-03-31 3 1
C 2017-04-03 2017-04-01 2 3')