Question

我有一个月值向量months = 5:10（用于5月-10月），我有一个data.table，带有两个日期列。我想删除向量中不包含这两个列的日期范围（包括两个开始日期和结束日期本身）的所有行。因此，如果这两个日期之间有任何一个月份，我想保留一行。如果有人可以提供一些帮助，那就太好了！

df
start       end
2018-06-01  2019-05-31
2018-06-04  2019-05-31
2018-06-05  2019-05-31
2018-07-20  2019-05-31
2018-11-01  2019-04-30
2019-01-01  2019-05-31
2019-04-01  2019-05-31
2019-05-01  2019-05-31
2019-06-01  2019-10-31
2019-06-01  2020-05-31
2019-11-01  2020-04-30
2020-05-01  2020-05-31

因此，在此示例中，这两行应该是从表中删除的行：

df
start       end
2018-11-01  2019-04-30
2019-11-01  2020-04-30

Answer 1

这是一个解决方案。首先，所需的软件包：

library(dplyr)
library(purrr)
library(lubridate)

编写一个函数，根据开始日期和结束日期创建一个间隔1个月的日期向量。然后将这些日期转换为数字月份，与数字月份的向量进行比较并返回长度：

find_overlap <- function(start_date, end_date, months) {
  seq.Date(start_date, end_date, "1 month") %>% 
    month() %>% 
    intersect(months) %>% 
    length()
}

使用purrr::map将函数应用于您的数据表：

v <- 5:10 # EDIT forgot to define this originally

df %>% 
  mutate(i = map2_int(start, end, ~find_overlap(.x, .y, v)))

没有重叠的地方，i = 0：

        start        end i
1  2018-06-01 2019-05-31 6
2  2018-06-04 2019-05-31 6
3  2018-06-05 2019-05-31 6
4  2018-07-20 2019-05-31 5
5  2018-11-01 2019-04-30 0
6  2019-01-01 2019-05-31 1
7  2019-04-01 2019-05-31 1
8  2019-05-01 2019-05-31 1
9  2019-06-01 2019-10-31 5
10 2019-06-01 2020-05-31 6
11 2019-11-01 2020-04-30 0
12 2020-05-01 2020-05-31 1

然后在{> i大于0的情况下filter，并删除i列：

df %>% 
  mutate(i = map2_int(start, end, ~find_overlap(.x, .y, v))) %>%
  filter(i > 0) %>%
  select(-i)

Answer 2

一种可能的data.table方法：

df[
    df[, {
            #get all months between dates
            m <- seq((year(start)-1L)*12L + month(start), 
                (year(end)-1L)*12L + month(end)) %% 12L
            replace(m, m==0L, 12L)
        }, 
        by=.(rn=df[, seq_len(.N)])][
            #filter for rows with required months by using a join
            .(V1=months), on=.(V1), sort(unique(rn))]
]

数据：

library(data.table)
months <- 5:10
df <- fread("start       end
2018-06-01  2019-05-31
2018-06-04  2019-05-31
2018-06-05  2019-05-31
2018-07-20  2019-05-31
2018-11-01  2019-04-30
2019-01-01  2019-05-31
2019-04-01  2019-05-31
2019-05-01  2019-05-31
2019-06-01  2019-10-31
2019-06-01  2020-05-31
2019-11-01  2020-04-30
2020-05-01  2020-05-31")
df[, c("start","end") := lapply(.SD, as.Date, format="%Y-%m-%d"), .SDcols=c("start","end")]

检查某些特定月份是否介于data.table的两个日期列之间

2 个答案: