我的问题最好用一个例子来解释。
library(data.table)
IDs <- 5
samplesPerId <- 100
set.seed(2019)
foo <- data.table(
id = rep(sample(1000000, size = 5, replace = FALSE), each = samplesPerId),
time = sample(999999, size = 5 * samplesPerId, replace = FALSE),
val = round(runif(n = 5 * samplesPerId, min = 0, max = 1), 2)
)
setorderv(foo, c("id", "time"))
foo[, val_cmltv_max := cummax(val), by = id]
bar <- data.table(time = seq(1, 999999, by = 1))
> foo
id time val val_cmltv_max
1: 459383 11250 0.83 0.83
2: 459383 13774 0.45 0.83
3: 459383 22266 0.27 0.83
4: 459383 44513 0.37 0.83
5: 459383 49432 0.86 0.86
---
496: 826316 950991 0.36 0.98
497: 826316 960187 0.80 0.98
498: 826316 961433 0.17 0.98
499: 826316 965398 0.36 0.98
500: 826316 994626 0.07 0.98
> bar
time
1: 1
2: 2
3: 3
4: 4
5: 5
---
999995: 999995
999996: 999996
999997: 999997
999998: 999998
999999: 999999
对于每个时间点1、2 ... 999999,我想获取该时间点已知的id的val_cmltv_max之和。例如,在时间1,总和应该为0,因为甚至不存在任何ID,而在时间999999,总和应该略低于5,因为有5个ID,并且到时间999999,每个ID的val_cmltv_max应该接近1。
在这里,我从每个时间点(1、2,...,9999999)的每个ID(1、2、3、4、5)的笛卡尔乘积表开始,这使 big 约500万行的中间表。然后,我使用滚动连接将来自foo的每个ID的最新记录连接到大中间表,然后我可以按时间总计val_cmltv_max的总和来汇总。
temp <- CJ(time = bar$time, id = sort(unique(foo$id)))
temp2 <- foo[temp, on = c("id", "time"), roll = TRUE]
result <- temp2[, list(sum_val_cmltv_max = sum(val_cmltv_max, na.rm = T)), by = time]
> result
time sum_val_cmltv_max
1: 1 0.00
2: 2 0.00
3: 3 0.00
4: 4 0.00
5: 5 0.00
---
999995: 999995 4.95
999996: 999996 4.95
999997: 999997 4.95
999998: 999998 4.95
999999: 999999 4.95
有没有一种方法可以快速而又高效地实现此目的,从而避免了巨大的中间表?
答案 0 :(得分:5)
U。发布5分钟后,我意识到了解决方法。
# get the first row per unique (id, val_cmltv_max)
changes <- foo[foo[, .I[1L], by = list(id, val_cmltv_max)]$V1]
# For each id, get the change in val_cmltv_max
# Would use shift() here but it's slow
# changes[, val_cmltv_max_prev := shift(val_cmltv_max, type = "lag", fill = 0), by = id]
changes[, val_cmltv_max_prev := c(0, head(val_cmltv_max, -1)), by = id]
changes[, change := val_cmltv_max - val_cmltv_max_prev]
# aggregate changes by time
changes <- changes[, list(change = sum(change)), by = time]
# insert into bar and cumsum
bar[, change := 0]
bar[changes, change := i.change, on = "time"]
bar[, sum_val_cmltv_max := cumsum(change)]