使用data.table

时间:2019-02-20 03:37:45

标签: r data.table

我有一些数据

head(stockAtt)
         DATE         TIME EX SYM_ROOT SIZE
1: 2018-12-03 34201.549405  X        T    1
2: 2018-12-03 34201.549405  P        T   28
3: 2018-12-03 34301.549405  P        T   28
4: 2018-12-03 35401.549405  T        T   36
5: 2018-12-03 35501.549405  T        T   36
6: 2018-12-03 36601.549405  T        T   36
7: 2018-12-03 36101.549405  Z        T    3
8: 2018-12-03 36801.549405  Z        T   23
9: 2018-12-03 37001.549405  Z        T   16
10: 2018-12-03 39001.549405  X        T    5

我有一个以秒为单位的时间序列,可以将其视为垃圾箱。

seq(from = 34200, to = 40000, by = 1000 )
[1] 34200 35200 36200 37200 38200 39200

我想按如下所示的基于“时间”的间隔将data.table拆分。

         DATE         TIME EX SYM_ROOT SIZE
1: 2018-12-03 34201.549405  X        T    1
2: 2018-12-03 34201.549405  P        T   28
3: 2018-12-03 34301.549405  P        T   28
         DATE         TIME EX SYM_ROOT SIZE
1: 2018-12-03 35401.549405  T        T   36
2: 2018-12-03 35501.549405  T        T   36
         DATE         TIME EX SYM_ROOT SIZE
1: 2018-12-03 36601.549405  T        T   36
2: 2018-12-03 36101.549405  Z        T    3
3: 2018-12-03 36801.549405  Z        T   23
         DATE         TIME EX SYM_ROOT SIZE
1: 2018-12-03 37001.549405  Z        T   16
         DATE         TIME EX SYM_ROOT SIZE
1: 2018-12-03 39001.549405  X        T    5

1 个答案:

答案 0 :(得分:1)

以下是一些选择:

1)使用data.table::split

split(DT, DT[, cut(TIME, seq(34200, 40000, 1000))])

2)在cut内使用by

DT[, .(.(as.data.table(c(.(TIME=TIME), .SD)))), by=cut(TIME, seq(34200, 40000, 1000))]$V1

DT[, tm := TIME][, .(.(.SD)), by=cut(tm, seq(34200, 40000, 1000))]$V1

3)jangorecki在评论中建议的另一种方法:

data.table:::split.data.table(DT[, cut_col := cut(TIME, seq(34200, 40000, 1000))], by="cut_col")

主力确实是cut。来自cut的帮助:

  

cut将x的范围划分为间隔,并根据x的值落入的间隔对值进行编码。


一些时间:

set.seed(0L)
nr <- 1e7
DT <- data.table(TIME=rnorm(nr, 37100))
DT2 <- copy(DT)
DT3 <- copy(DT)
DT4 <- copy(DT)
microbenchmark::microbenchmark(
    split_f=data.table:::split.data.table(DT, f=DT[, cut(TIME, seq(34200, 40000, 1000))]),
    split_by=data.table:::split.data.table(DT2[, cut_col := cut(TIME, seq(34200, 40000, 1000))], by="cut_col"),
    by1=DT3[, tm := TIME][, .(.(.SD)), by=cut(tm, seq(34200, 40000, 1000))]$V1,
    by2=DT4[, .(.(as.data.table(c(.(TIME=TIME), .SD)))), by=cut(TIME, seq(34200, 40000, 1000))]$V1,
    times=3L
)

时间:

Unit: milliseconds
     expr      min       lq     mean   median       uq      max neval cld
  split_f 691.6382 716.6919 748.6798 741.7457 777.2006 812.6554     3   a
 split_by 840.0505 910.3817 938.2106 980.7129 987.2906 993.8683     3   a
      by1 738.8859 749.1444 797.0015 759.4029 826.0593 892.7157     3   a
      by2 623.7743 667.5200 720.1821 711.2658 768.3860 825.5063     3   a