Question

我正在寻找union的时间间隔实现，它能够处理非本身间隔的联合。

我注意到lubridate包含一个union函数用于时间间隔，但它总是返回一个间隔，即使并集不是一个区间（即它返回由两个区域的最小值定义的区间）日期和两个结束日期的最大值，忽略任何间隔未涵盖的中间期间）：

library(lubridate)
int1 <- new_interval(ymd("2001-01-01"), ymd("2002-01-01"))
int2 <- new_interval(ymd("2003-06-01"), ymd("2004-01-01"))
union(int1, int2)
# Union includes intervening time between intervals.
# [1] 2001-01-01 UTC--2004-01-01 UTC

我还查看了interval包，但其文档未提及union。

我的最终目标是使用与%within%：

的复杂联合

my_int %within% Reduce(union, list_of_intervals)

因此，如果我们考虑一个具体的例子，假设list_of_intervals是：

[[1]] 2000-01-01 -- 2001-01-02 
[[2]] 2001-01-01 -- 2004-01-02 
[[3]] 2005-01-01 -- 2006-01-02

然后my_int <- 2001-01-01 -- 2004-01-01不是%within% list_of_intervals所以它应该返回FALSE而my_int <- 2003-01-01 -- 2006-01-01应该是TRUE。

但是，我怀疑复杂的联盟有更多的用途。

Answer 1

如果我理解你的问题，你想从一组可能重叠的间隔开始，并获得一个代表输入集UNION的间隔列表，而不仅仅是跨越最小值和最大值的单个间隔。输入集。这跟我的问题一样。

在Union of intervals

询问了类似的问题

...但是接受的响应失败并且间隔重叠。但是，hosolmaz（我是SO的新手，所以不知道如何链接到这个用户）发布了修改问题的修改（在Python中），然后我将其转换为R，如下所示：

library(dplyr) # for %>%, arrange, bind_rows

interval_union <- function(input) {
  if (nrow(input) == 1) {
    return(input)
  }
  input <- input %>% arrange(start)
  output = input[1, ]
  for (i in 2:nrow(input)) {
    x <- input[i, ]
    if (output$stop[nrow(output)] < x$start) {
      output <- bind_rows(output, x)
    } else if (output$stop[nrow(output)] == x$start) {
      output$stop[nrow(output)] <- x$stop
    }
    if (x$stop > output$stop[nrow(output)]) {
      output$stop[nrow(output)] <- x$stop
    }
  }
  return(output)
}

您的示例包含重叠和非连续的区间：

d <- as.data.frame(list(
  start = c('2005-01-01', '2000-01-01', '2001-01-01'),
  stop = c('2006-01-02', '2001-01-02', '2004-01-02')),
  stringsAsFactors = FALSE)

这会产生：

> d
       start       stop
1 2005-01-01 2006-01-02
2 2000-01-01 2001-01-02
3 2001-01-01 2004-01-02

> interval_union(d)
       start       stop
1 2000-01-01 2004-01-02
2 2005-01-01 2006-01-02

我是R编程的相对新手，所以如果有人可以将上面的interval_union（）函数转换为接受不仅作为参数的输入数据框，还要使用'start'和'stop'列的名称来使用所以功能可以更容易地重复使用，这很棒。

Answer 2

嗯，在您提供的示例中，int1和int2的联合可以看作是具有两个区间的向量：

int1 <- new_interval(ymd("2001-01-01"), ymd("2002-01-01"))
int2 <- new_interval(ymd("2003-06-01"), ymd("2004-01-01"))
ints <- c(int1,int2)

%within%适用于向量，因此您可以执行以下操作：

my_int <- new_interval(ymd("2001-01-01"), ymd("2004-01-01"))
my_int %within% ints
# [1]  TRUE FALSE

因此，您可以使用any检查您的时间间隔是否在列表的某个时间间隔内：

any(my_int %within% ints)
# [1] TRUE

您的评论是正确的，%within%给出的结果似乎与文档不一致，后者说：

如果a是间隔，则其开始日期和结束日期都必须在b内返回TRUE。

如果我查看{和1}的源代码，当a和b都是间隔时，它似乎如下：

%within%

因此，似乎只有setMethod("%within%", signature(a = "Interval", b = "Interval"), function(a,b){ as.numeric(a@start) - as.numeric(b@start) <= b@.Data & as.numeric(a@start) - as.numeric(b@start) >= 0 })的起点针对a进行了测试，并且它看起来与结果一致。也许这应该被视为一个错误，应该报告？

时间间隔的联合不一定是连续的

2 个答案: