我有两个数据帧融合创建的数据帧。两者都跨越同一时间间隔,但包含不同的信息。当我把它们放在一起时,信息重叠,因为在其中一个数据帧的时间间隔内没有空洞。这是一个例子,其中行“sp = A和B”是第一个df的一部分而行“sp = C”来自第二个。第一个数据帧是连续的,但第二个数据帧是偶发事件。结果数据框如下所示:
start end sp
2010-06-01 17:00:00 2010-06-01 19:30:00 A
2010-06-01 19:30:01 2010-06-01 20:00:00 B
2010-06-01 19:45:00 2010-06-01 19:55:00 C
2010-06-01 20:00:01 2010-06-01 20:30:00 A
2010-06-01 20:05:00 2010-06-01 20:10:00 C
2010-06-01 20:12:00 2010-06-01 20:15:00 C
2010-06-01 20:30:01 2010-06-01 20:40:00 B
2010-06-01 20:35:00 2010-06-01 20:40:10 C
2010-06-01 20:40:01 2010-06-01 20:50:00 A
我想优先考虑“C”,所以当它与另一个“sp”的时间间隔重叠时,相应地切断“A”或“B”的时间间隔。如示例中所示,我有时会有多个“C”事件与“A”或“B”的单个事件重叠。结果将是:
start end sp
2010-06-01 17:00:00 2010-06-01 19:30:00 A
2010-06-01 19:30:01 2010-06-01 19:44:59 B
2010-06-01 19:45:00 2010-06-01 19:55:00 C
2010-06-01 19:55:01 2010-06-01 20:00:00 B
2010-06-01 20:00:01 2010-06-01 20:04:59 A
2010-06-01 20:05:00 2010-06-01 20:10:00 C
2010-06-01 20:10:01 2010-06-01 20:11:59 A
2010-06-01 20:12:00 2010-06-01 20:15:00 C
2010-06-01 20:15:01 2010-06-01 20:30:00 A
2010-06-01 20:30:01 2010-06-01 20:34:59 B
2010-06-01 20:35:00 2010-06-01 20:40:10 C
2010-06-01 20:40:11 2010-06-01 20:50:00 A
我的日期/时间列位于POSIXct中。如果有什么不清楚,请不要犹豫。
提前致谢
答案 0 :(得分:2)
这是使用plyr
包和递归函数执行此操作的好方法:
library(plyr)
splitTimes <- function(arow, df) {
overlap_all = arow$start > df[, 'start'] & arow$end < df[, 'end']
overlap_middle = arow$start < df[, 'start'] & arow$end > df[, 'end']
overlap_end = arow$start < df[, 'start'] & arow$end > df[, 'start'] & arow$end < df[, 'end']
overlap_start = arow$start > df[, 'start'] & arow$end > df[, 'end'] & arow$start < df[, 'end']
if(any(overlap_all)) {
data.frame()
} else if(any(overlap_middle)) {
outrows = rbind(data.frame(start=arow$start, end=df[overlap_middle, 'start'][1]-1, sp=arow$sp),
data.frame(start=df[overlap_middle, 'end'][1]+1, end=arow$end, sp=arow$sp))
ddply(outrows, 'start', 'splitTimes', df)
} else if(any(overlap_end)) {
data.frame(start=arow$start, end=df[overlap_end, 'start']-1, sp=arow$sp)
} else if(any(overlap_start)) {
data.frame(start=df[overlap_start, 'end']+1, end=arow$end, sp=arow$sp)
} else {
arow
}
}
然后你可以这样做:
> dfall = read.table('data.txt', header=T, colClasses=c('POSIXct', 'POSIXct', 'factor'))
> dfAB = subset(dfall, sp %in% c('A', 'B'))
> dfC = subset(dfall, sp == 'C')
> arrange(rbind(ddply(dfAB, 'start', 'splitTimes', dfC), dfC), start)
start end sp
1 2010-06-01 17:00:00 2010-06-01 19:30:00 A
2 2010-06-01 19:30:01 2010-06-01 19:44:59 B
3 2010-06-01 19:45:00 2010-06-01 19:55:00 C
4 2010-06-01 19:55:01 2010-06-01 20:00:00 B
5 2010-06-01 20:00:01 2010-06-01 20:04:59 A
6 2010-06-01 20:05:00 2010-06-01 20:10:00 C
7 2010-06-01 20:10:01 2010-06-01 20:11:59 A
8 2010-06-01 20:12:00 2010-06-01 20:15:00 C
9 2010-06-01 20:15:01 2010-06-01 20:30:00 A
10 2010-06-01 20:30:01 2010-06-01 20:34:59 B
11 2010-06-01 20:35:00 2010-06-01 20:40:10 C
12 2010-06-01 20:40:11 2010-06-01 20:50:00 A
它可以为您提供您想要的内容。
在其他情况下可能会有一些小错误,因为您的示例数据集并未涵盖所有这些错误,但这至少是一般性的想法。希望能帮助到你。祝你好运!