从不同的起点应用cumsum()

时间:2018-02-28 15:55:33

标签: r data.table vectorization

我有数据

library(data.table)

set.seed(42)
t <- data.table(time=1:1000, value=runif(100,0,1))
p <- data.table(id=1:10, cut=sample(1:100,5))
vals <- 1:5

> head(t)
   time     value
1:    1 0.9148060
2:    2 0.9370754
3:    3 0.2861395
4:    4 0.8304476
5:    5 0.6417455
6:    6 0.5190959

> head(p)
    id cut
 1:  1  63
 2:  2  22
 3:  3  99
 4:  4  38
 5:  5  91
 6:  6  63

> vals
[1] 1 2 3 4 5

其中t提供了一些与时间点相关联的values向量,p为每个人提供了time的截止值。

我想为每个人获得积累vals中每个值所需的时间单位。

我现在的方法是使用for循环,为每个人计算累积和的临时向量,从time的特定截止值开始。接下来,我使用findInterval()来获取cumsum到达vals中每个级别的位置。

out <- matrix(NA, nrow=nrow(p), ncol=length(vals)); colnames(out) <- vals
for(i in 1:nrow(p)){
   temp <- cumsum(t$value[t$time > p$cut[i]]); temp <- temp[!is.na(temp)]
   out[i,] <- findInterval(vals,temp)
}

应该产生

      1 2 3 4  5
 [1,] 1 4 5 9 12
 [2,] 1 2 5 6  7
 [3,] 1 2 4 5  7
 [4,] 1 3 5 7  8
 [5,] 2 3 5 7  8
 [6,] 1 4 5 9 12
 [7,] 1 2 5 6  7
 [8,] 1 2 4 5  7
 [9,] 1 3 5 7  8
[10,] 2 3 5 7  8

这当然是非常低效的,并且不能公正地对待R的力量。有没有办法加速这个?

1 个答案:

答案 0 :(得分:2)

我会做

# precompute cumsum on full table
t[, cs := cumsum(value)]

# compute one time per unique cut value, not per id
cuts = unique(p[, .(t_cut = cut)])

# look up value at cut time
cuts[t, on=.(t_cut = time), v_cut := i.cs]

# look up time at every cut value combo
cutres = cuts[, .(pt = vals + v_cut), by=t_cut][, .(
  t_cut, 
  v = vals,
  t_plus = t[.SD, on=.(cs = pt), roll=TRUE, x.time] - t_cut
)]

给出了

    t_cut v t_plus
 1:    63 1      1
 2:    63 2      4
 3:    63 3      5
 4:    63 4      9
 5:    63 5     12
 6:    22 1      1
 7:    22 2      2
 8:    22 3      5
 9:    22 4      6
10:    22 5      7
11:    99 1      1
12:    99 2      2
13:    99 3      4
14:    99 4      5
15:    99 5      7
16:    38 1      1
17:    38 2      3
18:    38 3      5
19:    38 4      7
20:    38 5      8
21:    91 1      2
22:    91 2      3
23:    91 3      5
24:    91 4      7
25:    91 5      8
    t_cut v t_plus

如果你想将它映射回id并将其放入id x vals表中......

cutres[p, on=.(t_cut = cut), allow.cartesian=TRUE, 
  dcast(.SD, id ~ v, value.var = "t_plus")]

    id 1 2 3 4  5
 1:  1 1 4 5 9 12
 2:  2 1 2 5 6  7
 3:  3 1 2 4 5  7
 4:  4 1 3 5 7  8
 5:  5 2 3 5 7  8
 6:  6 1 4 5 9 12
 7:  7 1 2 5 6  7
 8:  8 1 2 4 5  7
 9:  9 1 3 5 7  8
10: 10 2 3 5 7  8

(或者,关键部分可以像t_plus = t[.SD, on=.(cs = pt), roll=TRUE, which=TRUE] - t_cut那样完成,因为t$time是行号。)