R - 在data.table上滚动窗口

时间:2014-05-11 20:46:03

标签: r data.table

我有以下data.table:

          time       id type   price      size  api start.point  end.point
 1: 1399672906 37119594  ASK 440.002 1.4840000 TRUE  1399672606 1399672906
 2: 1399672940 37119597  BID 441.000 0.1758830 TRUE  1399672640 1399672940
 3: 1399672940 37119598  BID 441.000 0.0491166 TRUE  1399672640 1399672940
 4: 1399673105 37119638  ASK 440.002 0.1313700 TRUE  1399672805 1399673105
 5: 1399673198 37119668  BID 441.000 0.0233013 TRUE  1399672898 1399673198
 6: 1399673198 37119669  BID 441.000 0.9744230 TRUE  1399672898 1399673198
 7: 1399673208 37119675  BID 441.000 0.1587060 TRUE  1399672908 1399673208
 8: 1399673208 37119676  BID 441.000 0.1238870 TRUE  1399672908 1399673208
 9: 1399673208 37119677  BID 441.001 0.0100000 TRUE  1399672908 1399673208
10: 1399673208 37119678  BID 441.175 0.0129740 TRUE  1399672908 1399673208
11: 1399673208 37119679  BID 441.192 0.0100000 TRUE  1399672908 1399673208
12: 1399673208 37119680  BID 441.399 0.0129740 TRUE  1399672908 1399673208
13: 1399673208 37119681  BID 441.499 1.7500000 TRUE  1399672908 1399673208
14: 1399673208 37119682  BID 441.500 8.0214600 TRUE  1399672908 1399673208
15: 1399673241 37119691  BID 441.500 0.0453001 TRUE  1399672941 1399673241
16: 1399673274 37119696  ASK 440.030 0.9133460 TRUE  1399672974 1399673274
17: 1399673360 37119705  BID 440.030 0.0580000 TRUE  1399673060 1399673360
18: 1399673433 37119709  ASK 440.002 0.0319611 TRUE  1399673133 1399673433
19: 1399673506 37119711  ASK 440.002 0.2618460 TRUE  1399673206 1399673506
20: 1399673507 37119712  BID 440.002 1.0000000 TRUE  1399673207 1399673507

其中:

  • 时间是unix时间戳
  • id是交易所分配的交易号
  • 起点=“时间”少于5分钟
  • end.point =实际上等于变量“time”

系列不是等距的。变量start.point和end.point实际上创建了以变量“time”结束的5分钟移动窗口。我想计算特定窗口中交易的频率。

我完成了for循环:

for (i in 1:nrow(trades)){

  trades[i, freq := length(unique(trades[time >= start.point[i] & time <= end.point[i]]$id))]

  setTxtProgressBar(status.bar, i)

}

但是,我想知道是否还有一些“时尚”的数据。 我试过像:

trades[, freq := list(length(unique(trades[time >= start.point & time <= end.point,]$id))), by = list(id)]

但结果是错误的,似乎它不适用于“每行一线”:

            time       id type   price       size  api start.point  end.point freq
  1: 1399672906 37119594  ASK 440.002  1.4840000 TRUE  1399672606 1399672906  100
  2: 1399672940 37119597  BID 441.000  0.1758830 TRUE  1399672640 1399672940  100
  3: 1399672940 37119598  BID 441.000  0.0491166 TRUE  1399672640 1399672940  100
  4: 1399673105 37119638  ASK 440.002  0.1313700 TRUE  1399672805 1399673105  100
  5: 1399673198 37119668  BID 441.000  0.0233013 TRUE  1399672898 1399673198  100
  6: 1399673198 37119669  BID 441.000  0.9744230 TRUE  1399672898 1399673198  100
  7: 1399673208 37119675  BID 441.000  0.1587060 TRUE  1399672908 1399673208  100
  8: 1399673208 37119676  BID 441.000  0.1238870 TRUE  1399672908 1399673208  100
  9: 1399673208 37119677  BID 441.001  0.0100000 TRUE  1399672908 1399673208  100
 10: 1399673208 37119678  BID 441.175  0.0129740 TRUE  1399672908 1399673208  100
 11: 1399673208 37119679  BID 441.192  0.0100000 TRUE  1399672908 1399673208  100

更新

见下面的结构:

structure(list(time = c(1399672906L, 1399673105L, 1399673274L, 
1399673433L, 1399673506L, 1399673531L), id = c(37119594L, 37119638L, 
37119696L, 37119709L, 37119711L, 37119717L), type = c("ASK", 
"ASK", "ASK", "ASK", "ASK", "ASK"), price = c(440.002, 440.002, 
440.03, 440.002, 440.002, 440), size = c(1.484, 0.13137, 0.913346, 
0.0319611, 0.261846, 3.168), api = c(TRUE, TRUE, TRUE, TRUE, 
TRUE, TRUE), start.point = c(1399672606, 1399672805, 1399672974, 
1399673133, 1399673206, 1399673231), end.point = c(1399672906L, 
1399673105L, 1399673274L, 1399673433L, 1399673506L, 1399673531L
), freq = c(1L, 4L, 13L, 14L, 13L, 11L)), .Names = c("time", 
"id", "type", "price", "size", "api", "start.point", "end.point", 
"freq"), sorted = c("type", "time"), class = c("data.table", 
"data.frame"), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x0000000002e50788>)

1 个答案:

答案 0 :(得分:4)

我认为现在可以使用bioconductor package IRanges包最好地完成此操作,直到data.table中实现了间隔连接/范围连接。

require(IRanges)
ir1 = IRanges(trades$time, width=1L)
ir2 = IRanges(trades$start.point, trades$end.point)

olaps = findOverlaps(ir1, ir2, type = "within")
dt = data.table(queryHits(olaps), subjectHits(olaps))[, .N, by=V2]

trades[dt$V2, freq := dt$N]

#          time       id type   price      size  api start.point  end.point freq
# 1: 1399672906 37119594  ASK 440.002 1.4840000 TRUE  1399672606 1399672906    1
# 2: 1399673105 37119638  ASK 440.002 0.1313700 TRUE  1399672805 1399673105    2
# 3: 1399673274 37119696  ASK 440.030 0.9133460 TRUE  1399672974 1399673274    2
# 4: 1399673433 37119709  ASK 440.002 0.0319611 TRUE  1399673133 1399673433    2
# 5: 1399673506 37119711  ASK 440.002 0.2618460 TRUE  1399673206 1399673506    3
# 6: 1399673531 37119717  ASK 440.000 3.1680000 TRUE  1399673231 1399673531    4

HTH