如何快速计算data.table中的新列?

时间:2018-03-29 10:36:21

标签: r performance data.table

我在R中有两个data.tables,如下所示:

        ask     bid          createTime
 1: 106.788 106.487 2018-03-01 00:00:01
 2: 106.788 106.487 2018-03-01 00:00:01
 3: 106.788 106.487 2018-03-01 00:00:02
 4: 106.788 106.487 2018-03-01 00:00:02
 5: 106.788 106.487 2018-03-01 00:00:03
         .         .
         .         .
992698: 105.730 105.431 2018-03-06 23:59:56
992699: 105.730 105.431 2018-03-06 23:59:56
992700: 105.732 105.431 2018-03-06 23:59:57
992701: 105.732 105.431 2018-03-06 23:59:57
992702: 105.732 105.431 2018-03-06 23:59:59

和酒吧:

     volume                from                  to
  1.196550000 2018-03-01 00:00:00 2018-03-01 00:01:00
  2.233350000 2018-03-01 00:01:00 2018-03-01 00:02:00
  3.201950000 2018-03-01 00:02:00 2018-03-01 00:03:00
  4.97700000 2018-03-01 00:03:00 2018-03-01 00:04:00
  5.34200000 2018-03-01 00:04:00 2018-03-01 00:05:00
                .         .
                .         .     
8068:53800000 2018-03-06 23:55:00 2018-03-06 23:56:00

所以,我希望Bars表中的每一行计算Ticks计数,其中creatime> = from和creatime<至。像这样:

    volume                from                  to     TicksCount
  1.196550000 2018-03-01 00:00:00 2018-03-01 00:01:00     187
  2.233350000 2018-03-01 00:01:00 2018-03-01 00:02:00     72
  3.201950000 2018-03-01 00:02:00 2018-03-01 00:03:00     56
  4.97700000 2018-03-01 00:03:00 2018-03-01 00:04:00      58
  5.34200000 2018-03-01 00:04:00 2018-03-01 00:05:00      52

我找到了怎么做的方法,但效果很慢。 我试着这样做:

    Bars <- Bars[, TicksCount:= sapply(1:nrow(Bars), function(i) {
    nrow(Tick[Bars$from[i] <= createTime & createTime < Bars$to[i]])
  })]

也许谁知道如何让它更快? 求救!)

2 个答案:

答案 0 :(得分:1)

data.table :: foverlaps()很快就能完成您的工作:

你的两张桌子:

ticks <-
  data.table(
    ask = runif(1e5, 0, 1e5),
    bid = runif(1e5, 0, 1e5),
    createTime = runif(1e5, 0, 1e3)
  )

bars <-
  data.table(
    volume = runif(1e3, 0, 1e3),
    from = seq(0, 1e3 - 1, 1),
    to = seq(1, 1e3)
  )

要使用foverlaps(),您需要有两个具有两个范围的表,而不仅仅是一个具有范围的表。因此,在ticks中添加一个辅助列以创建临时范围:

ticks[, helper := createTime]

然后,为每个条形组创建一个ID(假设没有重复项,条形图中没有重叠范围):

bars[, bar.id := .I]

每个表都必须有一个data.table键,其中key1是范围开始,key2是范围结束:

setkey(ticks, createTime, helper)
setkey(bars, from, to)

然后,在&#39;内运行一个&#39;数据集上的foverlaps,其中x是Ticks,y是Bars。这通过在重叠范围上连接x和y来创建新表(其中x范围落在y范围内)。下面的第二步聚合新表,按bar.id计算滴答,第三步将聚合表连接回Bars,将字段ticksCount添加到Bars。

foverlaps(ticks, bars, type = 'within')[,
    .(ticksCount = .N), .(bar.id)
        ][bars, on = 'bar.id']

答案 1 :(得分:0)

以另一种方式sapply

尝试使用此简单解决方案
f<-function(createTime,Bars)
{
  return(sum(Bars$from <= createTime & createTime < Bars$to))
}

Bars$TickCount<-sapply(Ticks$createTime,f,Bars=Bars)

你的输出:

Bars
   volume                from                  to TickCount
1 1.19655 2018-03-01 00:00:00 2018-03-01 00:01:00         2
2 2.23335 2018-03-01 00:00:00 2018-03-01 00:02:00         2