滚动加入+总和而不消耗内存

时间:2019-11-27 21:05:56

标签: r data.table

我的问题最好用一个例子来解释。

设置

library(data.table)

IDs <- 5
samplesPerId <- 100
set.seed(2019)

foo <- data.table(
  id = rep(sample(1000000, size = 5, replace = FALSE), each = samplesPerId),
  time = sample(999999, size = 5 * samplesPerId, replace = FALSE),
  val = round(runif(n = 5 * samplesPerId, min = 0, max = 1), 2)
)

setorderv(foo, c("id", "time"))
foo[, val_cmltv_max := cummax(val), by = id]
bar <- data.table(time = seq(1, 999999, by = 1))

> foo
         id   time  val val_cmltv_max
  1: 459383  11250 0.83          0.83
  2: 459383  13774 0.45          0.83
  3: 459383  22266 0.27          0.83
  4: 459383  44513 0.37          0.83
  5: 459383  49432 0.86          0.86
 ---                                 
496: 826316 950991 0.36          0.98
497: 826316 960187 0.80          0.98
498: 826316 961433 0.17          0.98
499: 826316 965398 0.36          0.98
500: 826316 994626 0.07          0.98

> bar
          time
     1:      1
     2:      2
     3:      3
     4:      4
     5:      5
    ---       
999995: 999995
999996: 999996
999997: 999997
999998: 999998
999999: 999999

目标

对于每个时间点1、2 ... 999999,我想获取该时间点已知的id的val_cmltv_max之和。例如,在时间1,总和应该为0,因为甚至不存在任何ID,而在时间999999,总和应该略低于5,因为有5个ID,并且到时间999999,每个ID的val_cmltv_max应该接近1。

当前解决方案

在这里,我从每个时间点(1、2,...,9999999)的每个ID(1、2、3、4、5)的笛卡尔乘积表开始,这使 big 约500万行的中间表。然后,我使用滚动连接将来自foo的每个ID的最新记录连接到大中间表,然后我可以按时间总计val_cmltv_max的总和来汇总。

temp <- CJ(time = bar$time, id = sort(unique(foo$id)))
temp2 <- foo[temp, on = c("id", "time"), roll = TRUE]
result <- temp2[, list(sum_val_cmltv_max = sum(val_cmltv_max, na.rm = T)), by = time]

> result
          time sum_val_cmltv_max
     1:      1              0.00
     2:      2              0.00
     3:      3              0.00
     4:      4              0.00
     5:      5              0.00
    ---                         
999995: 999995              4.95
999996: 999996              4.95
999997: 999997              4.95
999998: 999998              4.95
999999: 999999              4.95

有没有一种方法可以快速而又高效地实现此目的,从而避免了巨大的中间表?

1 个答案:

答案 0 :(得分:5)

U。发布5分钟后,我意识到了解决方法。

# get the first row per unique (id, val_cmltv_max)
changes <- foo[foo[, .I[1L], by = list(id, val_cmltv_max)]$V1]

# For each id, get the change in val_cmltv_max 
# Would use shift() here but it's slow
# changes[, val_cmltv_max_prev := shift(val_cmltv_max, type = "lag", fill = 0), by = id]
changes[, val_cmltv_max_prev := c(0, head(val_cmltv_max, -1)), by = id]
changes[, change := val_cmltv_max - val_cmltv_max_prev]

# aggregate changes by time
changes <- changes[, list(change = sum(change)), by = time]

# insert into bar and cumsum
bar[, change := 0]
bar[changes, change := i.change, on = "time"]
bar[, sum_val_cmltv_max := cumsum(change)]