在两个时间戳之间左加入R

时间:2019-06-20 14:49:27

标签: r dplyr data.table tidyverse

我的目标是在intervals匹配且bike_id中的created_at时间戳记records和{{1 }}在start表中

end

在这种情况下,输出看起来像

intervals

我尝试使用> class(records) [1] "data.table" "data.frame" > class(intervals) [1] "data.table" "data.frame" > records bike_id created_at resolved_at 1 28780 2019-05-03 08:29:18 2019-05-03 08:35:37 2 28780 2019-05-03 21:05:28 2019-05-03 21:07:28 3 28780 2019-05-04 21:13:39 2019-05-04 21:15:40 4 28780 2019-05-07 17:24:20 2019-05-07 17:26:39 5 28780 2019-05-08 11:34:32 2019-05-08 12:16:44 6 28780 2019-05-08 23:38:39 2019-05-08 23:40:36 > intervals bike_id start end id 1: 28780 2019-05-03 04:44:45 2019-05-03 16:58:56 1 2: 28780 2019-05-04 07:07:39 2019-05-04 14:48:29 2 3: 28780 2019-05-07 23:28:32 2019-05-08 12:56:24 3 4: 28780 2019-05-10 06:06:21 2019-05-10 13:12:08 4 5: 28780 2019-05-12 05:21:24 2019-05-12 11:35:52 5 6: 28780 2019-05-13 08:44:54 2019-05-13 12:28:31 6 使用解决方案posted here,但这会导致R用尽内存(尽管两个表中的记录量仅约100K)

> output
  bike_id          created_at         resolved_at   id
1   28780 2019-05-03 08:29:18 2019-05-03 08:35:37    1
2   28780 2019-05-03 21:05:28 2019-05-03 21:07:28  NULL   
3   28780 2019-05-04 21:13:39 2019-05-04 21:15:40  NULL
4   28780 2019-05-07 17:24:20 2019-05-07 17:26:39  NULL
5   28780 2019-05-08 11:34:32 2019-05-08 12:16:44  NULL
6   28780 2019-05-08 23:38:39 2019-05-08 23:40:36  NULL

这将引发错误:tidyverse

fuzzy_left_join( records, intervals, by = c( "bike_id" = "bike_id", "created_at" = "start", "created_at" = "end" ), match_fun = list(`==`, `>=`, `<=`) ) %>% select(id, bike_id = bike_id.x, created_at, start, end) 或什至在Error: vector memory exhausted (limit reached?)的基R中是否存在滚动连接的替代方法?通过id联接两个数据帧以及联接表中其他两个之间的时间戳的好方法是什么?

这里是数据

data.table

3 个答案:

答案 0 :(得分:5)

我们可以使用data.table nonequi join

library(data.table)
setDT(records)[intervals, on = .(bike_id, created_at >= start, created_at <= end)]

答案 1 :(得分:3)

我知道OP要求使用tidyversedata.table解决方案,但是SQL似乎是解决此问题的完美工具:

library(sqldf)

sqldf("select a.*, b.id 
        from records as a
        left join intervals as b
          on a.bike_id = b.bike_id and
            a.created_at >= b.start and
            a.created_at <= b.end")

或使用between作为替代语法:

sqldf("select a.*, b.id 
        from records as a
        left join intervals as b
          on a.bike_id = b.bike_id and
            a.created_at between b.start and b.end")

编辑:如@G所述。 Grothendieck,我们可以在读取数据以匹配OP的时区之前,使用Sys.setenv设置环境的时区。

输出:

  bike_id          created_at         resolved_at id
1   28780 2019-05-03 08:29:18 2019-05-03 08:35:37  1
2   28780 2019-05-03 21:05:28 2019-05-03 21:07:28 NA
3   28780 2019-05-04 21:13:39 2019-05-04 21:15:40 NA
4   28780 2019-05-07 17:24:20 2019-05-07 17:26:39 NA
5   28780 2019-05-08 11:34:32 2019-05-08 12:16:44  3
6   28780 2019-05-08 23:38:39 2019-05-08 23:40:36 NA

数据:(OP的dput确实有效,因为从data.table创建的指针)

Sys.setenv(TZ = "GMT")

records <- structure(list(bike_id = c(28780L, 28780L, 28780L, 28780L, 28780L, 
28780L), created_at = c("2019-05-03 08:29:18", "2019-05-03 21:05:28", 
"2019-05-04 21:13:39", "2019-05-07 17:24:20", "2019-05-08 11:34:32", 
"2019-05-08 23:38:39"), resolved_at = c("2019-05-03 08:35:37", 
"2019-05-03 21:07:28", "2019-05-04 21:15:40", "2019-05-07 17:26:39", 
"2019-05-08 12:16:44", "2019-05-08 23:40:36")), class = "data.frame", row.names = c(NA, 
-6L))

intervals <- structure(list(bike_id = c(28780L, 28780L, 28780L, 28780L, 28780L, 
28780L), start = c("2019-05-03 04:44:45", "2019-05-04 07:07:39", 
"2019-05-07 23:28:32", "2019-05-10 06:06:21", "2019-05-12 05:21:24", 
"2019-05-13 08:44:54"), end = c("2019-05-03 16:58:56", "2019-05-04 14:48:29", 
"2019-05-08 12:56:24", "2019-05-10 13:12:08", "2019-05-12 11:35:52", 
"2019-05-13 12:28:31"), id = c(1, 2, 3, 4, 5, 6)), class = "data.frame", row.names = c(NA, 
-6L))

答案 2 :(得分:1)

一种替代方法是加入bike_id和日期created_at的日期部分,然后删除created_at不在start-{范围内的ID {1}}。这可以通过将事情分解为单独的步骤来解决内存问题:

end

哪个返回:

library(dplyr)
library(lubridate)
library(purrr)

intervals %>% 
    mutate(date = date(start)) %>% 
    right_join(mutate(records,
                      date = date(created_at)),
                      by = c("bike_id", "date")
              ) %>% 
    mutate(within = created_at %within% interval(start, end),
           within = replace_na(within, F),
           id = map2_dbl(id, within, ~ ifelse(.y, .x, NA))
           ) %>% 
    select(bike_id, id, created_at, resolved_at)