foverlaps和data.table

时间:2018-01-07 22:05:44

标签: r data.table time-series

有2张桌子

dums:

start   end 10min 

2013-04-01 00:00:54 UTC 2013-04-01 01:00:10 UTC 0.05

2013-04-01 00:40:26 UTC 2013-04-01 01:00:00 UTC 0.1

2013-04-01 02:13:20 UTC 2013-04-01 04:53:42 UTC 0.15

2013-04-02 02:22:00 UTC 2013-04-01 04:33:12 UTC 0.2

2013-04-01 02:26:23 UTC 2013-04-01 04:05:12 UTC 0.25

2013-04-01 02:42:47 UTC 2013-04-01 04:34:33 UTC 0.3

2013-04-01 02:53:12 UTC 2013-04-03 05:27:05 UTC 0.35

2013-04-02 02:54:08 UTC 2013-04-02 05:31:15 UTC 0.4

2013-04-03 02:57:16 UTC 2013-04-03 05:29:32 UTC 0.45

地图:开始和结束是跨越2013-4-1 00:00:00至2013-04-04的10分钟间隔块

我想将dt1的第3列添加到地图中,只要开始和结束时间在10分钟的块内并继续附加列

理想情况下输出应为

start              end            10min

4/1/2013 0:00:00 4/1/2013 0:10:00   0.05  0

4/1/2013 0:10   4/1/2013 0:20   0.05     0

4/1/2013 0:20   4/1/2013 0:30   0.05    0

4/1/2013 0:30   4/1/2013 0:40   0.05    0

4/1/2013 0:40   4/1/2013 0:50   0.05    0.01

4/1/2013 0:50   4/1/2013 1:00   0.05    0.01

我试过

setkey(dums,start,end)

setkey(map,start,end)

foverlaps(map,dums,type="within",nomatch=0L)

我一直收到错误:

Error in foverlaps(map, dums, type = "within", nomatch = 0L) :   All entries in column start should be <= corresponding entries in column end in data.table 'y'

任何指针或替代方法?

由于

2 个答案:

答案 0 :(得分:1)

错误消息

  

列start中的所有条目应为&lt; = data.table'y'中列末尾的相应条目

可能是由数据集中的拼写错误引起的。

dums[start > end, with = TRUE]

返回4,dums的第4行是:

                 start                 end min10
1: 2013-04-02 02:22:00 2013-04-01 04:33:12   0.2

start更改为2013-04-01 02:22:00后,OP的代码运行正常。

但是,要实现预期输出,foverlaps()的结果需要从长格式转换为宽格式。

这可以通过两种方式完成:

dcast(foverlaps(map, dums, nomatch = 0L), i.start + i.end ~ min10, 
      value.var = "min10")
                 i.start               i.end 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
  1: 2013-04-01 00:00:00 2013-04-01 00:10:00 0.05  NA   NA  NA   NA  NA   NA  NA   NA
  2: 2013-04-01 00:10:00 2013-04-01 00:20:00 0.05  NA   NA  NA   NA  NA   NA  NA   NA
  3: 2013-04-01 00:20:00 2013-04-01 00:30:00 0.05  NA   NA  NA   NA  NA   NA  NA   NA
  4: 2013-04-01 00:30:00 2013-04-01 00:40:00 0.05  NA   NA  NA   NA  NA   NA  NA   NA
  5: 2013-04-01 00:40:00 2013-04-01 00:50:00 0.05 0.1   NA  NA   NA  NA   NA  NA   NA
 ---                                                                                 
311: 2013-04-03 04:40:00 2013-04-03 04:50:00   NA  NA   NA  NA   NA  NA 0.35  NA 0.45
312: 2013-04-03 04:50:00 2013-04-03 05:00:00   NA  NA   NA  NA   NA  NA 0.35  NA 0.45
313: 2013-04-03 05:00:00 2013-04-03 05:10:00   NA  NA   NA  NA   NA  NA 0.35  NA 0.45
314: 2013-04-03 05:10:00 2013-04-03 05:20:00   NA  NA   NA  NA   NA  NA 0.35  NA 0.45
315: 2013-04-03 05:20:00 2013-04-03 05:30:00   NA  NA   NA  NA   NA  NA 0.35  NA 0.45

或者更符合OP的预期结果:

dcast(foverlaps(map, dums, nomatch = 0L), i.start + i.end ~ rowid(i.start), 
      value.var = "min10")
                 i.start               i.end    1    2  3  4  5
  1: 2013-04-01 00:00:00 2013-04-01 00:10:00 0.05   NA NA NA NA
  2: 2013-04-01 00:10:00 2013-04-01 00:20:00 0.05   NA NA NA NA
  3: 2013-04-01 00:20:00 2013-04-01 00:30:00 0.05   NA NA NA NA
  4: 2013-04-01 00:30:00 2013-04-01 00:40:00 0.05   NA NA NA NA
  5: 2013-04-01 00:40:00 2013-04-01 00:50:00 0.05 0.10 NA NA NA
 ---                                                           
311: 2013-04-03 04:40:00 2013-04-03 04:50:00 0.35 0.45 NA NA NA
312: 2013-04-03 04:50:00 2013-04-03 05:00:00 0.35 0.45 NA NA NA
313: 2013-04-03 05:00:00 2013-04-03 05:10:00 0.35 0.45 NA NA NA
314: 2013-04-03 05:10:00 2013-04-03 05:20:00 0.35 0.45 NA NA NA
315: 2013-04-03 05:20:00 2013-04-03 05:30:00 0.35 0.45 NA NA NA

请注意,为简洁起见,已跳过参数type = "within"

数据

# corrected
dums <- fread(
  " 2013-04-01 00:00:54 UTC 2013-04-01 01:00:10 UTC 0.05
    2013-04-01 00:40:26 UTC 2013-04-01 01:00:00 UTC 0.1
    2013-04-01 02:13:20 UTC 2013-04-01 04:53:42 UTC 0.15
    2013-04-01 02:22:00 UTC 2013-04-01 04:33:12 UTC 0.2
    2013-04-01 02:26:23 UTC 2013-04-01 04:05:12 UTC 0.25
    2013-04-01 02:42:47 UTC 2013-04-01 04:34:33 UTC 0.3
    2013-04-01 02:53:12 UTC 2013-04-03 05:27:05 UTC 0.35
    2013-04-02 02:54:08 UTC 2013-04-02 05:31:15 UTC 0.4
    2013-04-03 02:57:16 UTC 2013-04-03 05:29:32 UTC 0.45"
)
dums <- dums[, .(start = as.POSIXct(paste(V1, V2, V3)),
         end = as.POSIXct(paste(V4, V5, V6)),
         min10 = V7)]
setkey(dums, start, end)
ts <- seq(as.POSIXct("2013-04-01 00:00:00 UTC"),
          as.POSIXct("2013-04-04 00:00:00 UTC"),
          by = "10 min")
map <- data.table(start = head(ts, -1L), end = tail(ts, -1L),
                   key = c("start", "end"))

答案 1 :(得分:0)

这是一个很好的捕捉POSIXct时间关闭1行。我觉得在输入数据中掩盖了这样的错误是非常愚蠢的。

最终目标是拥有3个列变量:YYYY-DD-MM;开始时间(POSIXCt),结束时间(POSIXCt)。 开始和结束时间是10分钟的窗口。 天数是365.因此有效地查看365 * 144(一天10分钟切片)。问题是,我有45万行&#34; dums&#34;数据和min10不是均匀间隔的离散间隔,它是连续数据。如果我必须聚合(sum,means,sd等),有没有办法在+ grouping中使用dcast + aggregate + foverlaps?我可以使用for循环,只是将min10值从开始到结束,但它看起来超级耗时且效率低。

输出为

  5: 2013-04-01 00:40:00 2013-04-01 00:50:00 0.15
  ---
  311: 2013-04-03 04:40:00 2013-04-03 04:50:00 0.80 

  map <- data.table(start = head(ts, -1L), end = tail(ts, -1L),
               key = c("start", "end"))
    # plus do something on the lines
  dums[, .(count=.N, sum=sum(min10)), by = ID1]