我有以下两个数据帧:
df <- data.frame(id = c("AED","AED","CFR","DRR","DRR","DRR","UN","PO"),
dates = as.POSIXct(c("2018-05-17 09:52:00","2018-05-17 10:49:00","2018-05-17 10:38:00","2018-05-17 11:29:00","2018-05-17 12:12:00","2018-05-17 13:20:00","2018-05-17 14:28:00","2018-05-17 15:59:00")))
events <- data.frame(id = c("AED","CFR","DRR","DRR","UN"),
start = as.POSIXct(c("2018-05-17 10:00:00","2018-05-17 10:18:00","2018-05-17 11:18:00","2018-05-17 13:10:00","2018-05-17 14:18:00")),
end = as.POSIXct(c("2018-05-17 11:56:00","2018-05-17 12:23:00","2018-05-17 12:01:00","2018-05-17 14:18:00",NA)))
通过唯一ID,我想将df中的每个日期与事件数据框中列出的各个日期范围进行比较(事件数据帧的每一行都被视为自己的时间范围),这样我得到以下结果:
result <- data.frame(id = c("AED","AED","CFR","DRR","DRR","DRR","UN","PO"),
dates = c("2018-05-17 09:52:00","2018-05-17 10:49:00","2018-05-17 10:38:00","2018-05-17 11:29:00","2018-05-17 12:12:00","2018-05-17 13:20:00","2018-05-17 14:28:00","2018-05-17 15:59:00"),
inRange = c(FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE),
outsideRange = c(TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE))
如果来自df的id不在事件中,那么对于inRange和outsideRange都返回FALSE;如果df日期大于事件$ start,但事件$ end为NA,则inRange应为TRUE
我希望将解决方案应用于更大的至少500,000行的数据集。
答案 0 :(得分:1)
在基地R:
df2 <- merge(df,events)
df2 <- within(df2, inRange <- dates > start & dates < end)
df2 <- aggregate(inRange ~ dates,df2,any)
# dates inRange
# 1 2018-05-17 09:52:00 FALSE
# 2 2018-05-17 09:56:00 FALSE
# 3 2018-05-17 10:38:00 TRUE
# 4 2018-05-17 11:29:00 TRUE
# 5 2018-05-17 12:12:00 FALSE
# 6 2018-05-17 13:20:00 FALSE
# 7 2018-05-17 14:28:00 TRUE
# 8 2018-05-17 15:59:00 FALSE
第一次合并是笛卡尔积,如果你的数据很大,我们可能最好先从双方开始提取当天然后合并。
这意味着在上述代码之前执行此操作:
df$year <- as.Date(df$dates)
events$year <- as.Date(events$start) # assuming start and end are always on same day
答案 1 :(得分:1)
一种选择是使用non-equi
使用data.table
更新加入。在df
和events
加入dates>=start
和dates<=end
。将inRange
列设置为TRUE
以匹配记录。
library(data.table)
setDT(df)
setDT(events)
df[events, on=c("dates>=start", "dates<=end"), inRange := TRUE]
df
# dates inRange
# 1: 2018-05-17 09:52:00 NA
# 2: 2018-05-17 09:56:00 NA
# 3: 2018-05-17 10:38:00 TRUE
# 4: 2018-05-17 11:29:00 TRUE
# 5: 2018-05-17 12:12:00 NA
# 6: 2018-05-17 13:20:00 NA
# 7: 2018-05-17 14:28:00 TRUE
# 8: 2018-05-17 15:59:00 NA
#
答案 2 :(得分:1)
如果events
不重叠,则对起点和终点坐标进行排序,并使用findInterval()
确定奇数间隔的日期
x = with(events, sort(c(start, end)))
df$inRange = findInterval(df$dates, x) %% 2 == 1
如果events
重叠,则创建所有事件的向量,找出如何按顺序放置它们,然后执行此操作
times <- with(events, c(start, end))
o <- order(times)
times <- times[o]
创建一个event
向量,当发生开始时为1
,发生结束时为-1
,并按顺序放置这些事件
event <- rep(c(1, -1), each = nrow(events))[o]
计算&#39; coverage&#39;,当前有效的事件数量。
cvg <- cumsum(event)
最后,创建一个更新的events
数据框,其中的开始和结束来自&#39; start&#39;覆盖率为1且事件为“开始”的值。事件,同样为目的
times[ (event == 1 & cvg == 1) | (event == -1 & cvg == 0) ]
并按上述步骤进行。
把它们放在一起我们有
reduce_int <- function(start, end) {
x <- c(start, end)
o <- order(x)
x <- x[o]
event <- rep(c(1, -1), each = nrow(events))[o]
cvg <- cumsum(event)
x[ (event == 1 & cvg == 1) | (event == -1 & cvg == 0) ]
}
overlaps <- function(x, events) {
vec <- reduce_int(event$start, event$end)
findInterval(x, vec) %% 2 == 1
}
使用
df$inRange <- overlaps(df$dates, events)