汇总在滑动窗口中满足标准的交易

时间:2016-09-22 16:51:25

标签: r data.table dplyr

我们有一个交易表

set.seed(1)
X <- data.table(id = 1:10, 
                time = c(1,2,5,6,9,12,14,20,21,23),
                val = sample(0.1*10^(1:4), 10, replace=TRUE),
                code = sample(c('A','A','C','B'), 10, replace=TRUE)
                )


    id time  val code
 1:  1    1   10    A
 2:  2    2   10    A
 3:  3    5  100    C
 4:  4    6 1000    A
 5:  5    9    1    B
 6:  6   12 1000    A
 7:  7   14 1000    C
 8:  8   20  100    B
 9:  9   21  100    A
10: 10   23    1    B

对于每一行,我想计算code == 'A'的出现次数以及前一行val的总和previous_row$time >= current_row$time - 3 即预期的结果应该是

    id time  val code count_A_within_3 sum_a_within_3
 1:  1    1   10    A                1             10
 2:  2    2   10    A                2             20
 3:  3    5  100    C                1             10
 4:  4    6 1000    A                1           1000
 5:  5    9    1    B                1           1000
 6:  6   12 1000    A                1           1000
 7:  7   14 1000    C                1           1000
 8:  8   20  100    B                0              0
 9:  9   21  100    A                1            100
10: 10   23    1    B                1            100

使用data.tabledplyr可以有效计算吗?

真实数据集包含〜1M个组,其中应在每个组中执行此操作。每个组中的行数范围从1到1000.一个必要的解决方案(对于具有嵌套ifs和状态变量的循环)是可行但非常慢。

1 个答案:

答案 0 :(得分:2)

使用最新devel版本(1.9.7 +)中的非equi连接:

X[, prev.time := time - 3]

X[, c("count_A_within_3", "sum_a_within_3") :=
      X[X, on = .(time >= prev.time, time <= time),
        .(sum(code == "A"), sum(val[code == "A"])), by = .EACHI][, .(V1, V2)]]
X
#    id time  val code prev.time count_A_within_3 sum_a_within_3
# 1:  1    1   10    A        -2                1             10
# 2:  2    2   10    A        -1                2             20
# 3:  3    5  100    C         2                1             10
# 4:  4    6 1000    A         3                1           1000
# 5:  5    9    1    B         6                1           1000
# 6:  6   12 1000    A         9                1           1000
# 7:  7   14 1000    C        11                1           1000
# 8:  8   20  100    B        17                0              0
# 9:  9   21  100    A        18                1            100
#10: 10   23    1    B        20                1            100

您可能希望用X替换内部两个.SD以获得更强大的代码;我只是将其保留为X,以便清楚/易于理解其工作原理。