我想要比较两个数据帧。
instances <- data.frame(id = c("AED","AED","CFR","DRR","DRR","DRR","UN","PO"),
dates = as.POSIXct(c("2018-05-17 09:52:00","2018-05-17 10:49:00","2018-05-17 10:38:00","2018-05-17 11:29:00","2018-05-17 12:12:00","2018-05-17 13:20:00","2018-05-17 14:28:00","2018-05-17 15:59:00")))
ranges <- data.frame(id = c("AED","CFR","DRR","DRR","UN"),
start = as.POSIXct(c("2018-05-17 10:00:00","2018-05-17 10:18:00","2018-05-17 11:18:00","2018-05-17 13:10:00","2018-05-17 14:18:00")),
end = as.POSIXct(c("2018-05-17 11:56:00","2018-05-17 12:23:00","2018-05-17 12:01:00","2018-05-17 14:18:00",NA)))
通过id,我想比较实例数据框中的每个日期与范围数据框中列出的各自日期范围。如果范围数据帧中没有匹配的id,那么它应该返回为FALSE,如果范围$ end是NA,它也应该返回FALSE。结果应如下:
result <- data.frame(id = c("AED","AED","CFR","DRR","DRR","DRR","UN","PO"),
dates = c("2018-05-17 09:52:00","2018-05-17 10:49:00","2018-05-17 10:38:00","2018-05-17 11:29:00","2018-05-17 12:12:00","2018-05-17 13:20:00","2018-05-17 14:28:00","2018-05-17 15:59:00"),
inRange = c(FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE),
outsideRange = c(TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE))
答案 0 :(得分:2)
library(dplyr)
instances %>%
full_join(ranges) %>%
mutate(inRange = case_when(dates >= start & dates <= end ~ T, T ~ F))
id dates start end inRange
1 AED 2018-05-17 09:52:00 2018-05-17 10:00:00 2018-05-17 11:56:00 FALSE
2 AED 2018-05-17 10:49:00 2018-05-17 10:00:00 2018-05-17 11:56:00 TRUE
3 CFR 2018-05-17 10:38:00 2018-05-17 10:18:00 2018-05-17 12:23:00 TRUE
4 DRR 2018-05-17 11:29:00 2018-05-17 11:18:00 2018-05-17 12:01:00 TRUE
5 DRR 2018-05-17 11:29:00 2018-05-17 13:10:00 2018-05-17 14:18:00 FALSE
6 DRR 2018-05-17 12:12:00 2018-05-17 11:18:00 2018-05-17 12:01:00 FALSE
7 DRR 2018-05-17 12:12:00 2018-05-17 13:10:00 2018-05-17 14:18:00 FALSE
8 DRR 2018-05-17 13:20:00 2018-05-17 11:18:00 2018-05-17 12:01:00 FALSE
9 DRR 2018-05-17 13:20:00 2018-05-17 13:10:00 2018-05-17 14:18:00 TRUE
10 UN 2018-05-17 14:28:00 2018-05-17 14:18:00 <NA> FALSE
11 PO 2018-05-17 15:59:00 <NA> <NA> FALSE
答案 1 :(得分:1)
data.table解决方案
我会使用data.table中的foverlaps()函数解决这个问题...唯一的问题是它只接受完整的日期范围,并且在示例数据中提供的范围[,5]没有enddate。 ..
> ranges
id start end
1 AED 2018-05-17 10:00:00 2018-05-17 11:56:00
2 CFR 2018-05-17 10:18:00 2018-05-17 12:23:00
3 DRR 2018-05-17 11:18:00 2018-05-17 12:01:00
4 DRR 2018-05-17 13:10:00 2018-05-17 14:18:00
5 UN 2018-05-17 14:18:00 <NA>
为了使下面的解决方案成为单词,所有范围都必须有一个开始和结束。 所以,让我们使用一些化妆时间戳填写NA。
ranges <- data.frame(id = c("AED","CFR","DRR","DRR","UN"),
start = as.POSIXct(c("2018-05-17 10:00:00","2018-05-17 10:18:00","2018-05-17 11:18:00","2018-05-17 13:10:00","2018-05-17 14:18:00")),
end = as.POSIXct(c("2018-05-17 11:56:00","2018-05-17 12:23:00","2018-05-17 12:01:00","2018-05-17 14:18:00", "2018-05-17 16:18:00")))
> ranges
id start end
1 AED 2018-05-17 10:00:00 2018-05-17 11:56:00
2 CFR 2018-05-17 10:18:00 2018-05-17 12:23:00
3 DRR 2018-05-17 11:18:00 2018-05-17 12:01:00
4 DRR 2018-05-17 13:10:00 2018-05-17 14:18:00
5 UN 2018-05-17 14:18:00 2018-05-17 16:18:00
<强>工作流强>
library(data.table)
#make instances a data.table without key
instances.dt <- setDT( instances, key = NULL )
#create a data.table with the ranges, set keys
ranges.dt <- setDT( ranges, key = c("id", "start", "end") )
#create a temporary 'range', where start == end, based on the dates-column
instances.dt[, c( "start", "end") := dates]
#create a column 'inRange' using data.table's foverlaps().
#use the secons column of the fovelaps' result. If this column is NA, then no 'hit' was found
#in ranges.dt and inrange == FALSE, else inRange == TRUE
instances.dt[, inRange := !is.na( foverlaps(instances.dt, ranges.dt, type = "within", mult = "first", nomatch = NA)[,2] )]
#outsideRange is the opposite of inRange
instances.dt[, outsideRange := !inRange]
#remove the temporary columns 'start' and 'end'
instances.dt[, c("start", "end") := NULL]
<强>结果强>
> instances.dt
id dates inRange outsideRange
1: AED 2018-05-17 09:52:00 FALSE TRUE
2: AED 2018-05-17 10:49:00 TRUE FALSE
3: CFR 2018-05-17 10:38:00 TRUE FALSE
4: DRR 2018-05-17 11:29:00 TRUE FALSE
5: DRR 2018-05-17 12:12:00 FALSE TRUE
6: DRR 2018-05-17 13:20:00 TRUE FALSE
7: UN 2018-05-17 14:28:00 TRUE FALSE
8: PO 2018-05-17 15:59:00 FALSE TRUE
即使对于庞大的data.tables来说,这也非常快速。
您可以缩短代码,但我总是喜欢一步一步地进行分析,提高可读性。
使用magrittr的管道操作员链接
library(data.table)
library(magrittr)
ranges.dt <- setDT( ranges, key = c("id", "start", "end") )
result <- setDT( instances, key = NULL ) %>%
.[, c( "start", "end") := dates] %>%
.[, inRange := !is.na( foverlaps( ., ranges.dt, type = "within", mult = "first", nomatch = NA )[,2] )] %>%
.[, outsideRange := !inRange] %>%
.[, c("start", "end") := NULL]