我有一个数据集,其中包含与个人可能相关的时间情节,这些情节可以重叠(即,情节可以晚于开始,但比之前更早结束)。由于存在重叠问题,一旦按start_date排序,我就很难按顺序获取最新的end_date。
我一直在使用的代码可以工作到一定程度,但是我必须重复下面的代码所示。出于这个原因,我想我需要一些循环函数来完成一个过程,直到满足条件为止(end_date晚于上一行的end_date,或者id表示一个新的个体)。
library(dplyr)
## creates example dataframe
id <- c("A","A","A","A","A","A","A","A","A","A",
"A","A","A","B","B","B","B","B","B")
start_date <- as.Date(c("2004-01-23","2005-03-31","2005-03-31","2005-12-20","2005-12-20",
"2006-04-03","2007-11-26","2010-10-12","2011-08-08","2012-06-26",
"2012-06-26","2012-09-11","2012-10-03","2003-12-01","2006-02-28",
"2012-04-16","2012-08-30","2012-09-19","2012-09-28"))
end_date <- as.Date(c("2009-06-30","2005-09-17","2005-09-19","2005-12-30","2005-12-30",
"2006-06-19","2009-06-30","2010-11-05","2011-11-18","2012-06-26",
"2012-06-26","2012-09-11","2014-04-01","2012-08-29","2006-02-28",
"2012-04-16","2012-09-28","2013-10-11","2013-07-19"))
target_date <- as.Date(c(NA,"2009-06-30","2009-06-30","2009-06-30","2009-06-30","2009-06-30",
"2009-06-30","2009-06-30","2010-11-05","2011-11-18","2012-06-26",
"2012-06-26","2012-09-11",NA,"2012-08-29","2012-08-29","2012-08-29",
"2012-09-28","2013-10-11"))
df <- data.frame(id, start_date, end_date, target_date)
使用the method to flatten overlapping time periods使我接近,但我认为添加某个地方来复制target_date会需要一定的滞后时间...
df <- df %>%
arrange(id, start_date) %>%
group_by(id) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start_date)) >
cummax(as.numeric(end_date)))[-n()])) %>%
group_by(id, indx) %>%
mutate(latest_date = max(end_date)) %>%
ungroup()
答案 0 :(得分:1)
与使用lag
相比,我将为该问题提供不同的方法。问题在于您的数据中存在一个可以具有多个级别的层次结构。
在下面的代码中,我尝试查找当前行所属的其他情节(即完全位于另一情节中)。
然后,我用min(start_date)
和max(end_date)
来定义最外层情节。
library(dplyr)
library(tidyr)
library(purrr)
df <- data.frame(id, start_date, end_date, target_date) %>%
mutate(episode = row_number())
df %>%
select(id, episode,start_date, end_date) %>%
inner_join(df %>% select(id, start_date_outer = start_date, end_date_outer = end_date,outer_episode = episode), by = 'id') %>%
group_by(id,episode,start_date, end_date) %>%
nest() %>%
mutate(match = pmap(list(data,start_date,end_date), ~ ..1 %>% filter(start_date_outer <= ..2,
end_date_outer >= ..3))) %>%
mutate(start_date_parent = as.Date(map_dbl(match, ~ min(.x$start_date_outer)),origin = '1970-01-01'),
end_date_parent = as.Date(map_dbl(match, ~max(.x$end_date_outer)),origin = '1970-01-01'))
这导致
# A tibble: 19 x 8
id episode start_date end_date data match start_date_parent end_date_parent
<fct> <int> <date> <date> <list> <list> <date> <date>
1 A 1 2004-01-23 2009-06-30 <tibble [13 x 3]> <tibble [1 x 3]> 2004-01-23 2009-06-30
2 A 2 2005-03-31 2005-09-17 <tibble [13 x 3]> <tibble [3 x 3]> 2004-01-23 2009-06-30
3 A 3 2005-03-31 2005-09-19 <tibble [13 x 3]> <tibble [2 x 3]> 2004-01-23 2009-06-30
4 A 4 2005-12-20 2005-12-30 <tibble [13 x 3]> <tibble [3 x 3]> 2004-01-23 2009-06-30
5 A 5 2005-12-20 2005-12-30 <tibble [13 x 3]> <tibble [3 x 3]> 2004-01-23 2009-06-30
6 A 6 2006-04-03 2006-06-19 <tibble [13 x 3]> <tibble [2 x 3]> 2004-01-23 2009-06-30
7 A 7 2007-11-26 2009-06-30 <tibble [13 x 3]> <tibble [2 x 3]> 2004-01-23 2009-06-30
8 A 8 2010-10-12 2010-11-05 <tibble [13 x 3]> <tibble [1 x 3]> 2010-10-12 2010-11-05
9 A 9 2011-08-08 2011-11-18 <tibble [13 x 3]> <tibble [1 x 3]> 2011-08-08 2011-11-18
10 A 10 2012-06-26 2012-06-26 <tibble [13 x 3]> <tibble [2 x 3]> 2012-06-26 2012-06-26
11 A 11 2012-06-26 2012-06-26 <tibble [13 x 3]> <tibble [2 x 3]> 2012-06-26 2012-06-26
12 A 12 2012-09-11 2012-09-11 <tibble [13 x 3]> <tibble [1 x 3]> 2012-09-11 2012-09-11
13 A 13 2012-10-03 2014-04-01 <tibble [13 x 3]> <tibble [1 x 3]> 2012-10-03 2014-04-01
14 B 14 2003-12-01 2012-08-29 <tibble [6 x 3]> <tibble [1 x 3]> 2003-12-01 2012-08-29
15 B 15 2006-02-28 2006-02-28 <tibble [6 x 3]> <tibble [2 x 3]> 2003-12-01 2012-08-29
16 B 16 2012-04-16 2012-04-16 <tibble [6 x 3]> <tibble [2 x 3]> 2003-12-01 2012-08-29
17 B 17 2012-08-30 2012-09-28 <tibble [6 x 3]> <tibble [1 x 3]> 2012-08-30 2012-09-28
18 B 18 2012-09-19 2013-10-11 <tibble [6 x 3]> <tibble [1 x 3]> 2012-09-19 2013-10-11
19 B 19 2012-09-28 2013-07-19 <tibble [6 x 3]> <tibble [2 x 3]> 2012-09-19 2013-10-11
我们在这里可以看到ID A的前7集是第1集的一部分 其余的都靠自己站着。
例如,如果数据集变大,则另一种选择是使用sqldf
。
require(sqldf)
result <- sqldf("select
df1.id, df1.episode, min(df2.start_date) AS start_date, max(df2.end_date) AS end_date
from df AS df1
inner join df AS df2
on df1.id = df2.id
and df1.start_date >= df2.start_date
and df1.end_date <= df2.end_date
group by df1.id, df1.episode
")
result %>%
select(id, start_date, end_date) %>%
distinct()
导致:
id start_date end_date
1 A 2004-01-23 2009-06-30
2 A 2010-10-12 2010-11-05
3 A 2011-08-08 2011-11-18
4 A 2012-06-26 2012-06-26
5 A 2012-09-11 2012-09-11
6 A 2012-10-03 2014-04-01
7 B 2003-12-01 2012-08-29
8 B 2012-08-30 2012-09-28
9 B 2012-09-19 2013-10-11
答案 1 :(得分:1)
如果我理解正确,则OP希望确定较长情节完全包含的重叠情节。此外,拥抱期的结束日期应显示在下一行(在id
内)
这可以通过更改David Arenburg's approach来实现:
df %>%
arrange(id, start_date) %>% # df must be ordered appropriately
group_by(id) %>% # create new grouping variable
mutate(grp = cumsum(cummax(lag(as.integer(end_date), default = 0)) < as.integer(end_date))) %>%
group_by(id, grp) %>%
mutate(target_date_new = max(end_date)) %>%
group_by(id) %>% # re-group ...
mutate(target_date_new = lag(target_date_new)) # ... for lagging
# A tibble: 19 x 6 # Groups: id [2] id start_date end_date target_date grp target_date_new <fct> <date> <date> <date> <int> <date> 1 A 2004-01-23 2009-06-30 NA 1 NA 2 A 2005-03-31 2005-09-17 2009-06-30 1 2009-06-30 3 A 2005-03-31 2005-09-19 2009-06-30 1 2009-06-30 4 A 2005-12-20 2005-12-30 2009-06-30 1 2009-06-30 5 A 2005-12-20 2005-12-30 2009-06-30 1 2009-06-30 6 A 2006-04-03 2006-06-19 2009-06-30 1 2009-06-30 7 A 2007-11-26 2009-06-30 2009-06-30 1 2009-06-30 8 A 2010-10-12 2010-11-05 2009-06-30 2 2009-06-30 9 A 2011-08-08 2011-11-18 2010-11-05 3 2010-11-05 10 A 2012-06-26 2012-06-26 2011-11-18 4 2011-11-18 11 A 2012-06-26 2012-06-26 2012-06-26 4 2012-06-26 12 A 2012-09-11 2012-09-11 2012-06-26 5 2012-06-26 13 A 2012-10-03 2014-04-01 2012-09-11 6 2012-09-11 14 B 2003-12-01 2012-08-29 NA 1 NA 15 B 2006-02-28 2006-02-28 2012-08-29 1 2012-08-29 16 B 2012-04-16 2012-04-16 2012-08-29 1 2012-08-29 17 B 2012-08-30 2012-09-28 2012-08-29 2 2012-08-29 18 B 2012-09-19 2013-10-11 2012-09-28 3 2012-09-28 19 B 2012-09-28 2013-07-19 2013-10-11 3 2013-10-11
在这里,比较end_date
是因为OP希望检测完全包含的时间段。因此,每当出现end_date
大于以前的end_date
中的任何一个时,情节计数器grp
都会前进,因为当前情节并未完全包含在前一期间中。
由于cummax()
对Date
类型的对象没有方法,因此日期被强制为整数值。