Question

我想要比较两个数据帧。

instances <- data.frame(id = c("AED","AED","CFR","DRR","DRR","DRR","UN","PO"),
         dates = as.POSIXct(c("2018-05-17 09:52:00","2018-05-17 10:49:00","2018-05-17 10:38:00","2018-05-17 11:29:00","2018-05-17 12:12:00","2018-05-17 13:20:00","2018-05-17 14:28:00","2018-05-17 15:59:00")))

ranges <- data.frame(id = c("AED","CFR","DRR","DRR","UN"),
             start = as.POSIXct(c("2018-05-17 10:00:00","2018-05-17 10:18:00","2018-05-17 11:18:00","2018-05-17 13:10:00","2018-05-17 14:18:00")),
             end = as.POSIXct(c("2018-05-17 11:56:00","2018-05-17 12:23:00","2018-05-17 12:01:00","2018-05-17 14:18:00",NA)))

通过id，我想比较实例数据框中的每个日期与范围数据框中列出的各自日期范围。如果范围数据帧中没有匹配的id，那么它应该返回为FALSE，如果范围$ end是NA，它也应该返回FALSE。结果应如下：

result <- data.frame(id = c("AED","AED","CFR","DRR","DRR","DRR","UN","PO"),
             dates = c("2018-05-17 09:52:00","2018-05-17 10:49:00","2018-05-17 10:38:00","2018-05-17 11:29:00","2018-05-17 12:12:00","2018-05-17 13:20:00","2018-05-17 14:28:00","2018-05-17 15:59:00"),
             inRange = c(FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE),
             outsideRange = c(TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE))

Answer 1

library(dplyr)

instances %>% 
  full_join(ranges) %>% 
  mutate(inRange = case_when(dates >= start & dates <= end ~ T, T ~ F))

    id               dates               start                 end inRange
1  AED 2018-05-17 09:52:00 2018-05-17 10:00:00 2018-05-17 11:56:00 FALSE
2  AED 2018-05-17 10:49:00 2018-05-17 10:00:00 2018-05-17 11:56:00  TRUE
3  CFR 2018-05-17 10:38:00 2018-05-17 10:18:00 2018-05-17 12:23:00  TRUE
4  DRR 2018-05-17 11:29:00 2018-05-17 11:18:00 2018-05-17 12:01:00  TRUE
5  DRR 2018-05-17 11:29:00 2018-05-17 13:10:00 2018-05-17 14:18:00 FALSE
6  DRR 2018-05-17 12:12:00 2018-05-17 11:18:00 2018-05-17 12:01:00 FALSE
7  DRR 2018-05-17 12:12:00 2018-05-17 13:10:00 2018-05-17 14:18:00 FALSE
8  DRR 2018-05-17 13:20:00 2018-05-17 11:18:00 2018-05-17 12:01:00 FALSE
9  DRR 2018-05-17 13:20:00 2018-05-17 13:10:00 2018-05-17 14:18:00  TRUE
10  UN 2018-05-17 14:28:00 2018-05-17 14:18:00                <NA> FALSE
11  PO 2018-05-17 15:59:00                <NA>                <NA> FALSE

Answer 2

data.table解决方案

我会使用data.table中的foverlaps（）函数解决这个问题...唯一的问题是它只接受完整的日期范围，并且在示例数据中提供的范围[，5]没有enddate。 ..

> ranges
   id               start                 end
1 AED 2018-05-17 10:00:00 2018-05-17 11:56:00
2 CFR 2018-05-17 10:18:00 2018-05-17 12:23:00
3 DRR 2018-05-17 11:18:00 2018-05-17 12:01:00
4 DRR 2018-05-17 13:10:00 2018-05-17 14:18:00
5  UN 2018-05-17 14:18:00                <NA>

为了使下面的解决方案成为单词，所有范围都必须有一个开始和结束。所以，让我们使用一些化妆时间戳填写NA。

ranges <- data.frame(id = c("AED","CFR","DRR","DRR","UN"),
                     start = as.POSIXct(c("2018-05-17 10:00:00","2018-05-17 10:18:00","2018-05-17 11:18:00","2018-05-17 13:10:00","2018-05-17 14:18:00")),
                     end = as.POSIXct(c("2018-05-17 11:56:00","2018-05-17 12:23:00","2018-05-17 12:01:00","2018-05-17 14:18:00", "2018-05-17 16:18:00")))

> ranges
   id               start                 end
1 AED 2018-05-17 10:00:00 2018-05-17 11:56:00
2 CFR 2018-05-17 10:18:00 2018-05-17 12:23:00
3 DRR 2018-05-17 11:18:00 2018-05-17 12:01:00
4 DRR 2018-05-17 13:10:00 2018-05-17 14:18:00
5  UN 2018-05-17 14:18:00 2018-05-17 16:18:00

<强>工作流

library(data.table)
#make instances a data.table without key
instances.dt <- setDT( instances, key = NULL )
#create a data.table with the ranges, set keys 
ranges.dt <- setDT( ranges, key = c("id", "start", "end") )

#create a temporary 'range', where start == end, based on the dates-column
instances.dt[, c( "start", "end") := dates]

#create a column 'inRange' using data.table's foverlaps(). 
#use the secons column of the fovelaps' result. If  this column is NA, then no 'hit' was found 
#in ranges.dt and inrange == FALSE, else inRange == TRUE
instances.dt[, inRange := !is.na( foverlaps(instances.dt, ranges.dt, type = "within", mult = "first", nomatch = NA)[,2] )]

#outsideRange is the opposite of inRange
instances.dt[, outsideRange := !inRange]

#remove the temporary columns 'start' and 'end'
instances.dt[, c("start", "end") := NULL]

<强>结果

> instances.dt
    id               dates inRange outsideRange
1: AED 2018-05-17 09:52:00   FALSE         TRUE
2: AED 2018-05-17 10:49:00    TRUE        FALSE
3: CFR 2018-05-17 10:38:00    TRUE        FALSE
4: DRR 2018-05-17 11:29:00    TRUE        FALSE
5: DRR 2018-05-17 12:12:00   FALSE         TRUE
6: DRR 2018-05-17 13:20:00    TRUE        FALSE
7:  UN 2018-05-17 14:28:00    TRUE        FALSE
8:  PO 2018-05-17 15:59:00   FALSE         TRUE

即使对于庞大的data.tables来说，这也非常快速。

您可以缩短代码，但我总是喜欢一步一步地进行分析，提高可读性。

使用magrittr的管道操作员链接

library(data.table)
library(magrittr)

ranges.dt <- setDT( ranges, key = c("id", "start", "end") )
result <- setDT( instances, key = NULL ) %>% 
  .[, c( "start", "end") := dates] %>%
  .[, inRange := !is.na( foverlaps( ., ranges.dt, type = "within", mult = "first", nomatch = NA )[,2] )] %>%
  .[, outsideRange := !inRange] %>%
  .[, c("start", "end") := NULL]

值是否在一个范围内

2 个答案: