筛选最近的日期间隔

时间:2019-09-06 16:45:00

标签: r dplyr

我想知道是否有一种简单的方法来过滤数据集,以仅保留最近间隔的记录。

我的数据如下:

library(tidyverse)
library(lubridate)    
df <- data.frame(country = rep(c("Spain","Portugal"), each = 4), 
                     type = rep(c("1","2"), each = 4), 
                     name = rep(c("A","B"), each = 4), 
                     event_start = as.Date(c("2012-07-13", "2014-09-05", "2016-12-23", "2017-01-01", "2015-11-27", "2014-06-27", "2013-04-11", "2012-11-27")), 
                     event_end = as.Date(c("2014-09-04", "2016-12-22", "2016-12-31", "2017-01-09", "2016-02-10", "2014-11-26", "2014-06-26", "2013-04-10")), 
                     start = rep(as.Date(c("2008-10-01", "2017-01-01")), each = 4),
                     end = rep(as.Date(c("2008-12-31", "2017-12-31")), each = 4),
                     stringsAsFactors = FALSE) %>%
      mutate(event_interval = interval(event_start, event_end),
             int = interval(start, end))

预期结果:

country type name, event_start, event_end, start, end, event_interval, int
Spain 1 A 2012-07-13 2014-09-04 2008-10-01 2008-12-31 2012-07-13 UTC--2014-09-04 UTC 2008-10-01 UTC--2008-12-31 UTC
Portugal 2 B 2015-11-27 2016-02-10 2017-01-01 2017-12-31 2015-11-27 UTC--2016-02-10 UTC 2017-01-01 UTC--2017-12-31 UTC

从本质上讲,我想保留country / type / name的每个组合,其中event_interval最接近{{1 }}。

我曾经尝试过(并取得了一些成功)外观不太好的int,但想知道您是否知道使用for loop更简单?

欢呼

编辑 需要说明的是,在上面的示例中,dplyrevent_int没有相交,但是在我的整个正确数据集中,这并不总是正确的。实际上,对于int / country / type的许多组合,可能有几个nameevent_int重叠,所以我真的需要找出哪个{ {1}}与int最相似,即event_intint重叠最多,或者与event_int最接近。

1 个答案:

答案 0 :(得分:0)

基于上面的评论,我知道了。问我的问题确实有助于我弄清楚我想做什么,然后我找到了R / lubridate: Calculate number of overlapping days between two periods

以下代码满足了我的需要:

df <- df %>%
mutate(ndays = pmax(pmin(end, event_end) - pmax(start, event_start) + 1, pmax(pmin(end, event_end) - pmax(start, event_start) + 1))) %>%
  group_by(country, type, name) %>%
  arrange(country, type, name, desc(ndays)) %>%
  filter(row_number() == 1) #Keeps nearest or most overlapping record