检查值是否在单独数据帧的特定范围内

时间:2018-05-29 19:44:33

标签: r dataframe compare which

我有以下两个数据帧:

df <- data.frame(id = c("AED","AED","CFR","DRR","DRR","DRR","UN","PO"),
             dates = as.POSIXct(c("2018-05-17 09:52:00","2018-05-17 10:49:00","2018-05-17 10:38:00","2018-05-17 11:29:00","2018-05-17 12:12:00","2018-05-17 13:20:00","2018-05-17 14:28:00","2018-05-17 15:59:00")))

events <- data.frame(id = c("AED","CFR","DRR","DRR","UN"),
                 start = as.POSIXct(c("2018-05-17 10:00:00","2018-05-17 10:18:00","2018-05-17 11:18:00","2018-05-17 13:10:00","2018-05-17 14:18:00")),
                 end = as.POSIXct(c("2018-05-17 11:56:00","2018-05-17 12:23:00","2018-05-17 12:01:00","2018-05-17 14:18:00",NA)))

通过唯一ID,我想将df中的每个日期与事件数据框中列出的各个日期范围进行比较(事件数据帧的每一行都被视为自己的时间范围),这样我得到以下结果:

result <- data.frame(id = c("AED","AED","CFR","DRR","DRR","DRR","UN","PO"),
                 dates = c("2018-05-17 09:52:00","2018-05-17 10:49:00","2018-05-17 10:38:00","2018-05-17 11:29:00","2018-05-17 12:12:00","2018-05-17 13:20:00","2018-05-17 14:28:00","2018-05-17 15:59:00"),
                 inRange = c(FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE),
                 outsideRange = c(TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE))

如果来自df的id不在事件中,那么对于inRange和outsideRange都返回FALSE;如果df日期大于事件$ start,但事件$ end为NA,则inRange应为TRUE

我希望将解决方案应用于更大的至少500,000行的数据集。

3 个答案:

答案 0 :(得分:1)

在基地R:

df2 <- merge(df,events)
df2 <- within(df2, inRange <- dates > start & dates < end)
df2 <- aggregate(inRange ~ dates,df2,any)
#                 dates inRange
# 1 2018-05-17 09:52:00   FALSE
# 2 2018-05-17 09:56:00   FALSE
# 3 2018-05-17 10:38:00    TRUE
# 4 2018-05-17 11:29:00    TRUE
# 5 2018-05-17 12:12:00   FALSE
# 6 2018-05-17 13:20:00   FALSE
# 7 2018-05-17 14:28:00    TRUE
# 8 2018-05-17 15:59:00   FALSE

第一次合并是笛卡尔积,如果你的数据很大,我们可能最好先从双方开始提取当天然后合并。

这意味着在上述代码之前执行此操作:

df$year <- as.Date(df$dates)
events$year <- as.Date(events$start) # assuming start and end are always on same day

答案 1 :(得分:1)

一种选择是使用non-equi使用data.table更新加入。在dfevents加入dates>=startdates<=end。将inRange列设置为TRUE以匹配记录。

library(data.table)

setDT(df)
setDT(events)

df[events, on=c("dates>=start", "dates<=end"), inRange := TRUE]
df
#                  dates inRange
# 1: 2018-05-17 09:52:00      NA
# 2: 2018-05-17 09:56:00      NA
# 3: 2018-05-17 10:38:00    TRUE
# 4: 2018-05-17 11:29:00    TRUE
# 5: 2018-05-17 12:12:00      NA
# 6: 2018-05-17 13:20:00      NA
# 7: 2018-05-17 14:28:00    TRUE
# 8: 2018-05-17 15:59:00      NA
# 

答案 2 :(得分:1)

如果events不重叠,则对起点和终点坐标进行排序,并使用findInterval()确定奇数间隔的日期

x = with(events, sort(c(start, end)))
df$inRange = findInterval(df$dates, x) %% 2 == 1

如果events重叠,则创建所有事件的向量,找出如何按顺序放置它们,然后执行此操作

times <- with(events, c(start, end))
o <- order(times)
times <- times[o]

创建一个event向量,当发生开始时为1,发生结束时为-1,并按顺序放置这些事件

event <- rep(c(1, -1), each = nrow(events))[o]

计算&#39; coverage&#39;,当前有效的事件数量。

cvg <- cumsum(event)

最后,创建一个更新的events数据框,其中的开始和结束来自&#39; start&#39;覆盖率为1且事件为“开始”的值。事件,同样为目的

times[ (event == 1 & cvg == 1) | (event == -1 & cvg == 0) ]

并按上述步骤进行。

把它们放在一起我们有

reduce_int <- function(start, end) {
    x <- c(start, end)
    o <- order(x)
    x <- x[o]

    event <- rep(c(1, -1), each = nrow(events))[o]
    cvg <- cumsum(event)

    x[ (event == 1 & cvg == 1) | (event == -1 & cvg == 0) ]
}

overlaps <- function(x, events) {
    vec <- reduce_int(event$start, event$end)
    findInterval(x, vec) %% 2 == 1
}

使用

df$inRange <- overlaps(df$dates, events)