我有这个数据框按END TIME排序:
A B C D E
--------------- ----------
D1 D2 D3 E1 E2 E3
------- ------- -------
F G H I J K
每行是具有开始时间和结束时间的时间间隔
df = data.frame(ID= c(1,1,1,1,1,1,1), NumberInSequence= c(1,2,3,4,5,6,7),
StartTime = as.POSIXct(c("2016-01-15 18:02:11 GMT","2016-01-15 18:10:33 GMT","2016-01-15 18:25:08 GMT",
"2016-01-15 18:33:56 GMT","2016-01-15 18:21:03 GMT","2016-01-15 19:55:09 GMT","2016-01-15 19:57:03 GMT")) ,
EndTime = as.POSIXct(c("2016-01-15 18:02:17 GMT","2016-01-15 18:10:39 GMT","2016-01-15 18:25:14 GMT",
"2016-01-15 18:34:02 GMT","2016-01-15 19:53:17 GMT","2016-01-15 19:56:15 GMT","2016-01-15 19:58:17 GMT"))
)
然后我使用dplyr添加几个字段来计算下一个开始时间和等待时间,这是NextStartTime和EndTime之间的差异。这创造了" WaitTime"在大多数情况下工作的列,除非有多个重叠。
df
ID NumberInSequence StartTime EndTime
1 1 1 2016-01-15 18:02:11 2016-01-15 18:02:17
2 1 2 2016-01-15 18:10:33 2016-01-15 18:10:39
3 1 3 2016-01-15 18:25:08 2016-01-15 18:25:14
4 1 4 2016-01-15 18:33:56 2016-01-15 18:34:02
5 1 5 2016-01-15 18:21:03 2016-01-15 19:53:17
6 1 6 2016-01-15 19:55:09 2016-01-15 19:56:15
7 1 7 2016-01-15 19:57:03 2016-01-15 19:58:17
现在我需要添加一个名为" FLAG"的列。值得好或不好
"确定" 表示间隔不是在另一个间隔内也不是或者部分。因此间隔为" OK"与其他间隔没有重叠。
" Not OK" 表示间隔IS部分或完全与另一个间隔。所以间隔时间为" NOT OK"与其他间隔重叠。
我有以下间隔,FLAG列的结果应该是简短描述
df %>% group_by(ID) %>%
mutate(
NextStartTime = lead(StartTime)[ifelse(lead(NumberInSequence) == (NumberInSequence + 1), TRUE, NA)] ,
WaitTime = difftime(NextStartTime,EndTime, units = 's')
#max_s = max(StartTime) #,
# cum_max_s = as.POSIXct(cummin(as.numeric(StartTime)),origin="1970-01-01")
)
ID NumberInSequence StartTime EndTime NextStartTime WaitTime
1 1 1 2016-01-15 18:02:11 2016-01-15 18:02:17 2016-01-15 18:10:33 496 secs
2 1 2 2016-01-15 18:10:33 2016-01-15 18:10:39 2016-01-15 18:25:08 869 secs
3 1 3 2016-01-15 18:25:08 2016-01-15 18:25:14 2016-01-15 18:33:56 522 secs
4 1 4 2016-01-15 18:33:56 2016-01-15 18:34:02 2016-01-15 18:21:03 -779 secs
5 1 5 2016-01-15 18:21:03 2016-01-15 19:53:17 2016-01-15 19:55:09 112 secs
6 1 6 2016-01-15 19:55:09 2016-01-15 19:56:15 2016-01-15 19:57:03 48 secs
7 1 7 2016-01-15 19:57:03 2016-01-15 19:58:17 <NA> NA secs
我正在考虑在dplyr中使用cummin或cummax ......也许......
StartTime EndTime FLAG
2016-01-15 18:02:11 2016-01-15 18:02:17 OK - this interval does not overlap with other intervals
2016-01-15 18:10:33 2016-01-15 18:10:39 OK - this interval does not overlap with other intervals
2016-01-15 18:25:08 2016-01-15 18:25:14 NOT OK - this inerval is within the 18:21:03 start time interval
2016-01-15 18:33:56 2016-01-15 18:34:02 NOT OK - this inerval is within the 18:21:03 start time interval
2016-01-15 18:21:03 2016-01-15 19:53:17 NOT OK - this interval contains other intervals
2016-01-15 19:55:09 2016-01-15 19:56:15 OK - this interval does not overlap with other intervals
2016-01-15 19:57:03 2016-01-15 19:58:17 OK - this interval does not overlap with other intervals
答案 0 :(得分:2)
这是我对你的尝试。我认为data.table包中的foverlaps()
是我们这种情况的朋友。你可以在SO上找到一些例子。您想要检查它们以了解功能。您需要创建一个虚拟data.table,包括开始和结束时间。在你的情况下,你有它们。我用最少的信息创建了dummy
。然后,您使用setkey()
并使用foverlaps()
。
# Create a dummy dt for hoverlaps.
dummy <- setDT(df2)[, 1:4, with = FALSE]
# Use foverlaps().
setkey(setDT(df2), StartTime, EndTime)
foo <- foverlaps(dummy, setDT(df2), by.x = c("StartTime", "EndTime"))
现在,是时候清理数据了。对于每个NumberInSequence
,如果存在多于1个重叠间隔(n> 1),则删除具有相同开始和结束时间(StartTime == i.StartTime & EndTime == i.EndTime
)的行。然后,删除每个NumberInSequence
的重复行。如果你只有一行表示与另一个间隔重叠,那就够了,对吧?最后,如果StartTime == i.StartTime & EndTime == i.EndTime
为TRUE
,则表示没有其他间隔与间隔重叠。所以,你说OK
。否则,NOT OK
。如有必要,请稍后删除多余的列。
foo[,.SD[!(StartTime == i.StartTime & EndTime == i.EndTime & .N > 1)],
by = c("ID","NumberInSequence")][!duplicated(NumberInSequence)][,
check := ifelse(StartTime == i.StartTime & EndTime == i.EndTime,
"OK", "NOT OK")] -> out
print(out)
# ID NumberInSequence StartTime EndTime NextStartTime WaitTime i.ID i.NumberInSequence
#1: 1 1 2016-01-15 18:02:11 2016-01-15 18:02:17 2016-01-15 18:10:33 496 secs 1 1
#2: 1 2 2016-01-15 18:10:33 2016-01-15 18:10:39 2016-01-15 18:25:08 869 secs 1 2
#3: 1 5 2016-01-15 18:21:03 2016-01-15 19:53:17 2016-01-15 19:55:09 112 secs 1 3
#4: 1 3 2016-01-15 18:25:08 2016-01-15 18:25:14 2016-01-15 18:33:56 522 secs 1 5
#5: 1 4 2016-01-15 18:33:56 2016-01-15 18:34:02 2016-01-15 18:21:03 -779 secs 1 5
#6: 1 6 2016-01-15 19:55:09 2016-01-15 19:56:15 2016-01-15 19:57:03 48 secs 1 6
#7: 1 7 2016-01-15 19:57:03 2016-01-15 19:58:17 <NA> NA secs 1 7
# i.StartTime i.EndTime check
#1: 2016-01-15 18:02:11 2016-01-15 18:02:17 OK
#2: 2016-01-15 18:10:33 2016-01-15 18:10:39 OK
#3: 2016-01-15 18:25:08 2016-01-15 18:25:14 NOT OK
#4: 2016-01-15 18:21:03 2016-01-15 19:53:17 NOT OK
#5: 2016-01-15 18:21:03 2016-01-15 19:53:17 NOT OK
#6: 2016-01-15 19:55:09 2016-01-15 19:56:15 OK
#7: 2016-01-15 19:57:03 2016-01-15 19:58:17 OK