遍历有条件的日期

时间:2019-09-09 09:24:22

标签: r date dplyr lag

我有一个数据集,其中包含与个人可能相关的时间情节,这些情节可以重叠(即,情节可以晚于开始,但比之前更早结束)。由于存在重叠问题,一旦按start_date排序,我就很难按顺序获取最新的end_date。

我一直在使用的代码可以工作到一定程度,但是我必须重复下面的代码所示。出于这个原因,我想我需要一些循环函数来完成一个过程,直到满足条件为止(end_date晚于上一行的end_date,或者id表示一个新的个体)。

library(dplyr)

## creates example dataframe
id <- c("A","A","A","A","A","A","A","A","A","A",
        "A","A","A","B","B","B","B","B","B")
start_date <- as.Date(c("2004-01-23","2005-03-31","2005-03-31","2005-12-20","2005-12-20",
                        "2006-04-03","2007-11-26","2010-10-12","2011-08-08","2012-06-26",
                        "2012-06-26","2012-09-11","2012-10-03","2003-12-01","2006-02-28",
                        "2012-04-16","2012-08-30","2012-09-19","2012-09-28"))
end_date <- as.Date(c("2009-06-30","2005-09-17","2005-09-19","2005-12-30","2005-12-30",
                      "2006-06-19","2009-06-30","2010-11-05","2011-11-18","2012-06-26",
                      "2012-06-26","2012-09-11","2014-04-01","2012-08-29","2006-02-28",
                      "2012-04-16","2012-09-28","2013-10-11","2013-07-19"))
target_date <- as.Date(c(NA,"2009-06-30","2009-06-30","2009-06-30","2009-06-30","2009-06-30",
                         "2009-06-30","2009-06-30","2010-11-05","2011-11-18","2012-06-26",
                         "2012-06-26","2012-09-11",NA,"2012-08-29","2012-08-29","2012-08-29",
                         "2012-09-28","2013-10-11"))

df <- data.frame(id, start_date, end_date, target_date)

使用the method to flatten overlapping time periods使我接近,但我认为添加某个地方来复制target_date会需要一定的滞后时间...

df <- df %>%
    arrange(id, start_date) %>%
    group_by(id) %>%
    mutate(indx = c(0, cumsum(as.numeric(lead(start_date)) >
                                    cummax(as.numeric(end_date)))[-n()])) %>%
    group_by(id, indx) %>%       
    mutate(latest_date = max(end_date)) %>%
    ungroup()

2 个答案:

答案 0 :(得分:1)

与使用lag相比,我将为该问题提供不同的方法。问题在于您的数据中存在一个可以具有多个级别的层次结构。

在下面的代码中,我尝试查找当前行所属的其他情节(即完全位于另一情节中)。 然后,我用min(start_date)max(end_date)来定义最外层情节。


library(dplyr)
library(tidyr)
library(purrr)

df <- data.frame(id, start_date, end_date, target_date) %>%
  mutate(episode = row_number())

df %>%
  select(id, episode,start_date, end_date) %>%
  inner_join(df %>% select(id, start_date_outer = start_date, end_date_outer = end_date,outer_episode = episode), by = 'id') %>%
  group_by(id,episode,start_date, end_date) %>%
  nest() %>%
  mutate(match = pmap(list(data,start_date,end_date), ~ ..1 %>% filter(start_date_outer <= ..2,
                                                                end_date_outer >= ..3))) %>%
  mutate(start_date_parent = as.Date(map_dbl(match, ~ min(.x$start_date_outer)),origin = '1970-01-01'),
         end_date_parent = as.Date(map_dbl(match, ~max(.x$end_date_outer)),origin = '1970-01-01'))


这导致


# A tibble: 19 x 8
   id    episode start_date end_date   data              match            start_date_parent end_date_parent
   <fct>   <int> <date>     <date>     <list>            <list>           <date>            <date>         
 1 A           1 2004-01-23 2009-06-30 <tibble [13 x 3]> <tibble [1 x 3]> 2004-01-23        2009-06-30     
 2 A           2 2005-03-31 2005-09-17 <tibble [13 x 3]> <tibble [3 x 3]> 2004-01-23        2009-06-30     
 3 A           3 2005-03-31 2005-09-19 <tibble [13 x 3]> <tibble [2 x 3]> 2004-01-23        2009-06-30     
 4 A           4 2005-12-20 2005-12-30 <tibble [13 x 3]> <tibble [3 x 3]> 2004-01-23        2009-06-30     
 5 A           5 2005-12-20 2005-12-30 <tibble [13 x 3]> <tibble [3 x 3]> 2004-01-23        2009-06-30     
 6 A           6 2006-04-03 2006-06-19 <tibble [13 x 3]> <tibble [2 x 3]> 2004-01-23        2009-06-30     
 7 A           7 2007-11-26 2009-06-30 <tibble [13 x 3]> <tibble [2 x 3]> 2004-01-23        2009-06-30     
 8 A           8 2010-10-12 2010-11-05 <tibble [13 x 3]> <tibble [1 x 3]> 2010-10-12        2010-11-05     
 9 A           9 2011-08-08 2011-11-18 <tibble [13 x 3]> <tibble [1 x 3]> 2011-08-08        2011-11-18     
10 A          10 2012-06-26 2012-06-26 <tibble [13 x 3]> <tibble [2 x 3]> 2012-06-26        2012-06-26     
11 A          11 2012-06-26 2012-06-26 <tibble [13 x 3]> <tibble [2 x 3]> 2012-06-26        2012-06-26     
12 A          12 2012-09-11 2012-09-11 <tibble [13 x 3]> <tibble [1 x 3]> 2012-09-11        2012-09-11     
13 A          13 2012-10-03 2014-04-01 <tibble [13 x 3]> <tibble [1 x 3]> 2012-10-03        2014-04-01     
14 B          14 2003-12-01 2012-08-29 <tibble [6 x 3]>  <tibble [1 x 3]> 2003-12-01        2012-08-29     
15 B          15 2006-02-28 2006-02-28 <tibble [6 x 3]>  <tibble [2 x 3]> 2003-12-01        2012-08-29     
16 B          16 2012-04-16 2012-04-16 <tibble [6 x 3]>  <tibble [2 x 3]> 2003-12-01        2012-08-29     
17 B          17 2012-08-30 2012-09-28 <tibble [6 x 3]>  <tibble [1 x 3]> 2012-08-30        2012-09-28     
18 B          18 2012-09-19 2013-10-11 <tibble [6 x 3]>  <tibble [1 x 3]> 2012-09-19        2013-10-11     
19 B          19 2012-09-28 2013-07-19 <tibble [6 x 3]>  <tibble [2 x 3]> 2012-09-19        2013-10-11  

我们在这里可以看到ID A的前7集是第1集的一部分 其余的都靠自己站着。


例如,如果数据集变大,则另一种选择是使用sqldf


require(sqldf)

result <- sqldf("select
      df1.id, df1.episode, min(df2.start_date) AS start_date, max(df2.end_date) AS end_date
      from df AS df1

      inner join df AS df2 
      on df1.id = df2.id
      and df1.start_date >= df2.start_date
      and df1.end_date <= df2.end_date

      group by df1.id, df1.episode
      ")

result %>%
  select(id, start_date, end_date) %>%
  distinct()

导致:


  id start_date   end_date
1  A 2004-01-23 2009-06-30
2  A 2010-10-12 2010-11-05
3  A 2011-08-08 2011-11-18
4  A 2012-06-26 2012-06-26
5  A 2012-09-11 2012-09-11
6  A 2012-10-03 2014-04-01
7  B 2003-12-01 2012-08-29
8  B 2012-08-30 2012-09-28
9  B 2012-09-19 2013-10-11

答案 1 :(得分:1)

如果我理解正确,则OP希望确定较长情节完全包含的重叠情节。此外,拥抱期的结束日期应显示在下一行(在id内)

这可以通过更改David Arenburg's approach来实现:

df %>% 
  arrange(id, start_date) %>% # df must be ordered appropriately
  group_by(id) %>% # create new grouping variable
  mutate(grp = cumsum(cummax(lag(as.integer(end_date), default = 0)) < as.integer(end_date))) %>% 
  group_by(id, grp) %>% 
  mutate(target_date_new = max(end_date)) %>% 
  group_by(id) %>% # re-group ...
  mutate(target_date_new = lag(target_date_new)) # ... for lagging
# A tibble: 19 x 6
# Groups:   id [2]
   id    start_date end_date   target_date   grp target_date_new
   <fct> <date>     <date>     <date>      <int> <date>         
 1 A     2004-01-23 2009-06-30 NA              1 NA             
 2 A     2005-03-31 2005-09-17 2009-06-30      1 2009-06-30     
 3 A     2005-03-31 2005-09-19 2009-06-30      1 2009-06-30     
 4 A     2005-12-20 2005-12-30 2009-06-30      1 2009-06-30     
 5 A     2005-12-20 2005-12-30 2009-06-30      1 2009-06-30     
 6 A     2006-04-03 2006-06-19 2009-06-30      1 2009-06-30     
 7 A     2007-11-26 2009-06-30 2009-06-30      1 2009-06-30     
 8 A     2010-10-12 2010-11-05 2009-06-30      2 2009-06-30     
 9 A     2011-08-08 2011-11-18 2010-11-05      3 2010-11-05     
10 A     2012-06-26 2012-06-26 2011-11-18      4 2011-11-18     
11 A     2012-06-26 2012-06-26 2012-06-26      4 2012-06-26     
12 A     2012-09-11 2012-09-11 2012-06-26      5 2012-06-26     
13 A     2012-10-03 2014-04-01 2012-09-11      6 2012-09-11     
14 B     2003-12-01 2012-08-29 NA              1 NA             
15 B     2006-02-28 2006-02-28 2012-08-29      1 2012-08-29     
16 B     2012-04-16 2012-04-16 2012-08-29      1 2012-08-29     
17 B     2012-08-30 2012-09-28 2012-08-29      2 2012-08-29     
18 B     2012-09-19 2013-10-11 2012-09-28      3 2012-09-28     
19 B     2012-09-28 2013-07-19 2013-10-11      3 2013-10-11

在这里,比较end_date是因为OP希望检测完全包含的时间段。因此,每当出现end_date大于以前的end_date中的任何一个时,情节计数器grp都会前进,因为当前情节并未完全包含在前一期间中。

由于cummax()Date类型的对象没有方法,因此日期被强制为整数值。