时间戳之间的子集记录

时间:2017-12-30 01:24:04

标签: r dplyr data.table subset posixct

我有两个数据框trips,它们是具有唯一idintervals的自行车的唯一行程,每10分钟显示自行车ID的位置。如果intervals介于timestart之间且finish相同,我的目标是<{>>移除来自bike_id的记录。时间为posixCT级,原始数据帧有数十万条记录

例如,下面这两个数据集的结果应为:

> trips
  bike_id               start              finish
1       1 2017-11-22 15:52:36 2017-11-22 17:47:53
2       2 2017-11-22 16:05:44 2017-11-22 16:23:25
3       3 2017-11-22 16:31:06 2017-11-22 17:11:20


  > intervals
                      time bike_id
    3  2017-11-22 16:00:03       1
    4  2017-11-22 16:10:03       1
    5  2017-11-22 16:20:02       1
    6  2017-11-22 16:30:02       1
    7  2017-11-22 16:40:03       1
    8  2017-11-22 16:50:02       1
    9  2017-11-22 17:00:02       1
    10 2017-11-22 17:10:02       1
    11 2017-11-22 17:20:03       1
    12 2017-11-22 17:30:03       1
    13 2017-11-22 16:00:03       2
    14 2017-11-22 16:10:03       2
    15 2017-11-22 16:20:02       2
    16 2017-11-22 16:30:02       2
    17 2017-11-22 16:40:03       2
    18 2017-11-22 16:50:02       2
    19 2017-11-22 17:00:02       2
    20 2017-11-22 17:10:02       2
    21 2017-11-22 17:20:03       2
    22 2017-11-22 17:30:03       2
    23 2017-11-22 16:30:02       3
    24 2017-11-22 16:40:03       3
    25 2017-11-22 16:50:02       3
    26 2017-11-22 17:00:02       3
    27 2017-11-22 17:10:02       3
    28 2017-11-22 17:20:03       3
    29 2017-11-22 17:30:03       3

结果

  > outcome
                      time bike_id
    13 2017-11-22 16:00:03       2
    16 2017-11-22 16:30:02       2
    17 2017-11-22 16:40:03       2
    18 2017-11-22 16:50:02       2
    19 2017-11-22 17:00:02       2
    20 2017-11-22 17:10:02       2
    21 2017-11-22 17:20:03       2
    22 2017-11-22 17:30:03       2
    23 2017-11-22 16:30:02       3
    28 2017-11-22 17:20:03       3
    29 2017-11-22 17:30:03       3

不确定从哪里开始。任何有关从dplyrapply函数开始的建议都将不胜感激!

以下是示例数据:

> dput(intervals)
structure(list(time = structure(c(1511384403.94561, 1511385003.17654, 
1511385602.47887, 1511386202.99895, 1511386803.18361, 1511387402.98233, 
1511388002.69461, 1511388602.5818, 1511389203.52712, 1511389803.652, 
1511384403.94561, 1511385003.17654, 1511385602.47887, 1511386202.99895, 
1511386803.18361, 1511387402.98233, 1511388002.69461, 1511388602.5818, 
1511389203.52712, 1511389803.652, 1511386202.99895, 1511386803.18361, 
1511387402.98233, 1511388002.69461, 1511388602.5818, 1511389203.52712, 
1511389803.652), class = c("POSIXct", "POSIXt"), tzone = ""), 
    bike_id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 
    2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3)), .Names = c("time", 
"bike_id"), row.names = 3:29, class = "data.frame")

> dput(trips)
structure(list(bike_id = c(1, 2, 3), start = structure(c(1511383956, 
1511384744, 1511386266), class = c("POSIXct", "POSIXt"), tzone = ""), 
    finish = structure(c(1511390873, 1511385805, 1511388680), class = c("POSIXct", 
    "POSIXt"), tzone = "")), .Names = c("bike_id", "start", "finish"
), row.names = c(NA, 3L), class = "data.frame")

3 个答案:

答案 0 :(得分:3)

我对包非常陌生,所以请仔细测试以下方法。

我选择而不是的原因是因为此任务需要按范围加入,暂时无法执行。以下是使用inner join函数的解决方案。

foverlaps

答案 1 :(得分:1)

这可以通过一种非等反连接来解决。

从版本1.9.8(2016年11月25日CRAN)开始,{p> 非等联接data.table中可用,并且可以在许多foverlaps()中用作方便的替换案例。特别是,foverlaps()需要键入第二个参数,而非equi join 同样适用于无键和键控data.tables。

首先, non-equi join 用于标识intervals行的索引,这些行位于startfinish次{ {1}}。然后,这些行将从trips

中删除
intervals

library(data.table) tmp <- setDT(intervals)[setDT(trips), on = .(bike_id, time >= start, time <= finish), which = TRUE] intervals[!tmp] time bike_id 1: 2017-11-22 16:00:03 2 2: 2017-11-22 16:30:02 2 3: 2017-11-22 16:40:03 2 4: 2017-11-22 16:50:02 2 5: 2017-11-22 17:00:02 2 6: 2017-11-22 17:10:02 2 7: 2017-11-22 17:20:03 2 8: 2017-11-22 17:30:03 2 9: 2017-11-22 16:30:02 3 10: 2017-11-22 17:20:03 3 11: 2017-11-22 17:30:03 3 包含要删除的行的索引:

tmp
tmp

答案 2 :(得分:0)

这是我的答案。 trips是参考数据集。

matched()是一个在tripsintervals的开头和结尾匹配的函数。

<强>答案

trips <- data.frame(bike_id = 1:3, 
                    start = as.POSIXct(c("2017-11-22 15:52:36", "2017-11-22 16:05:44", "2017-11-22 16:31:06")),
                    finish = as.POSIXct(c("2017-11-22 17:47:53","2017-11-22 16:23:25","2017-11-22 17:11:20")))%>%
         mutate(start = as.numeric(start),
                finish = as.numeric(finish))


matched <- function(var1, var2, df1, df2){
return(df2[,var1][match(df1[,var2],df2[,var2])])
}



intervals%>%
mutate(time_num = as.numeric(time),
       start = matched("start", "bike_id", intervals , trips),
       finish = matched("finish", "bike_id", intervals , trips))%>%
filter(time_num < start | time_num > finish)%>%
select(time, bike_id)


                  time bike_id
1  2017-11-22 16:00:03       2
2  2017-11-22 16:30:02       2
3  2017-11-22 16:40:03       2
4  2017-11-22 16:50:02       2
5  2017-11-22 17:00:02       2
6  2017-11-22 17:10:02       2
7  2017-11-22 17:20:03       2
8  2017-11-22 17:30:03       2
9  2017-11-22 16:30:02       3
10 2017-11-22 17:20:03       3
11 2017-11-22 17:30:03       3

由于一些奇怪的原因,我无法让between()工作。我稍后会看到这个。