data.table:快速计算双向时间移动窗口内行时间的统计信息

时间:2018-04-09 21:51:50

标签: r data.table

library(data.table)
library(lubridate)
df <- data.table(col1 = c('A', 'A', 'A', 'B', 'B', 'B'), col2 = c("2015-03-06 01:37:57", "2015-03-06 01:39:57", "2015-03-06 01:45:28", "2015-03-06 02:31:44", "2015-03-06 03:55:45", "2015-03-06 04:01:40"))

对于每一行,我想计算具有相同值'col1'的行的时间标准偏差(col2)和在该行(包括)的时间之前的过去10分钟内的窗口内的时间以及此时间之后的下一个10分钟行(包括)

我尝试使用基于previous question

解决方案的快速方法
df$col2 <- as_datetime(df$col2)
gap <- 10L
df[, feat1 := .SD[.(col1 = col1, t1 = col2 - gap * 60L, t2 = col2 + gap * 60L)
                  , on = .(col1, col2 >= t1, col2 <= t2)
                  , .(col1, col2 = x.col2, times = as.numeric(col2))
                  ][, .(sd_times = sd(times))
                    , by = .(col1, col2)]$sd_times][]

但我有下一个错误:

Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
  Join results in 14 rows; more than 12 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.

1 个答案:

答案 0 :(得分:0)

我已使用上面的Frank评论解决了我的任务:

df[, feat1 := .SD[.(col1 = col1, t1 = col2 - gap * 60L, t2 = col2 + gap * 60L)
                  , on = .(col1, col2 >= t1, col2 <= t2)
                  , .(col1, col2 = x.col2, times = as.numeric(col2)), allow.cartesian=TRUE
                  ][, .(sd_times = sd(times))
                    , by = .(col1, col2)]$sd_times][]