我们有一个交易表
set.seed(1)
X <- data.table(id = 1:10,
time = c(1,2,5,6,9,12,14,20,21,23),
val = sample(0.1*10^(1:4), 10, replace=TRUE),
code = sample(c('A','A','C','B'), 10, replace=TRUE)
)
id time val code
1: 1 1 10 A
2: 2 2 10 A
3: 3 5 100 C
4: 4 6 1000 A
5: 5 9 1 B
6: 6 12 1000 A
7: 7 14 1000 C
8: 8 20 100 B
9: 9 21 100 A
10: 10 23 1 B
对于每一行,我想计算code == 'A'
的出现次数以及前一行val
的总和previous_row$time >= current_row$time - 3
即预期的结果应该是
id time val code count_A_within_3 sum_a_within_3
1: 1 1 10 A 1 10
2: 2 2 10 A 2 20
3: 3 5 100 C 1 10
4: 4 6 1000 A 1 1000
5: 5 9 1 B 1 1000
6: 6 12 1000 A 1 1000
7: 7 14 1000 C 1 1000
8: 8 20 100 B 0 0
9: 9 21 100 A 1 100
10: 10 23 1 B 1 100
使用data.table
或dplyr
可以有效计算吗?
真实数据集包含〜1M个组,其中应在每个组中执行此操作。每个组中的行数范围从1到1000.一个必要的解决方案(对于具有嵌套ifs和状态变量的循环)是可行但非常慢。
答案 0 :(得分:2)
使用最新devel版本(1.9.7 +)中的非equi连接:
X[, prev.time := time - 3]
X[, c("count_A_within_3", "sum_a_within_3") :=
X[X, on = .(time >= prev.time, time <= time),
.(sum(code == "A"), sum(val[code == "A"])), by = .EACHI][, .(V1, V2)]]
X
# id time val code prev.time count_A_within_3 sum_a_within_3
# 1: 1 1 10 A -2 1 10
# 2: 2 2 10 A -1 2 20
# 3: 3 5 100 C 2 1 10
# 4: 4 6 1000 A 3 1 1000
# 5: 5 9 1 B 6 1 1000
# 6: 6 12 1000 A 9 1 1000
# 7: 7 14 1000 C 11 1 1000
# 8: 8 20 100 B 17 0 0
# 9: 9 21 100 A 18 1 100
#10: 10 23 1 B 20 1 100
您可能希望用X
替换内部两个.SD
以获得更强大的代码;我只是将其保留为X
,以便清楚/易于理解其工作原理。