如果数据行具有连续的时间间隔,则合并

时间:2018-07-05 23:28:14

标签: r merge gaps-and-islands

我有一个包含Start.Date和Stop.Date的患者药物数据集。每个代表一行。我想合并时间间隔如下的行:

ID = c(2, 2, 2, 2, 3, 5) 
Medication = c("aspirin", "aspirin", "aspirin", "tylenol", "lipitor", "advil") 
Start.Date = c("05/01/2017", "05/05/2017", "06/20/2017", "05/01/2017", "05/06/2017", "05/28/2017")
Stop.Date = c("05/04/2017", "05/10/2017", "06/27/2017", "05/15/2017", "05/12/2017", "06/13/2017")
df = data.frame(ID, Medication, Start.Date, Stop.Date) 


  ID Medication Start.Date  Stop.Date
   2    aspirin 05/01/2017 05/04/2017
   2    aspirin 05/05/2017 05/10/2017
   2    aspirin 06/20/2017 06/27/2017
   2    tylenol 05/01/2017 05/15/2017
   3    lipitor 05/06/2017 05/12/2017
   5      advil 05/28/2017 06/13/2017

如果一个Stop.Date是下一个Start.Date的前一天,我想通过ID和药物减少行数。它应该如下所示:

  ID Medication Start.Date  Stop.Date
   2    aspirin 05/01/2017 05/10/2017
   2    aspirin 06/20/2017 06/27/2017
   2    tylenol 05/01/2017 05/15/2017
   3    lipitor 05/06/2017 05/12/2017
   5      advil 05/28/2017 06/13/2017

3 个答案:

答案 0 :(得分:1)

怎么样?

library(data.table)
setDT(df)[df[, (shift(mdy(Start.Date), type = 'lead', 
         fill = last(Start.Date)) - mdy(Stop.Date)) != 1 , ID]$V1]
#  ID Medication Start.Date  Stop.Date
#1:  2    aspirin 05/05/2017 05/10/2017
#2:  2    aspirin 06/20/2017 06/27/2017
#3:  2    tylenol 05/01/2017 05/15/2017
#4:  3    lipitor 05/06/2017 05/12/2017
#5:  5      advil 05/28/2017 06/13/2017

最好再用几个示例进行测试,以确保鲁棒性。让我们尝试一个更复杂的示例

Date

请注意,这里df %>% mutate_at(vars(ends_with("Date")), function(x) as.Date(x, format = "%m/%d/%Y")) %>% group_by(ID, Medication) %>% mutate( isConsecutive = lead(Start.Date) - Stop.Date == 1, isConsecutive = ifelse( is.na(isConsecutive) & lag(isConsecutive) == TRUE, FALSE, isConsecutive), grp = cumsum(isConsecutive)) %>% group_by(ID, Medication, grp) %>% mutate(Start.Date = min(Start.Date), Stop.Date = max(Stop.Date)) %>% slice(1) %>% ungroup() %>% select(-isConsecutive, -grp) ## A tibble: 5 x 4 # ID Medication Start.Date Stop.Date # <dbl> <fct> <date> <date> #1 2. aspirin 2017-05-01 2017-05-10 #2 2. aspirin 2017-06-20 2017-06-27 #3 2. tylenol 2017-05-01 2017-05-15 #4 3. lipitor 2017-05-06 2017-05-12 #5 5. advil 2017-05-28 2017-06-13 有两个连续的块(第1 + 2行和第3 + 4行),df <- structure(list(ID = c(2, 2, 2, 2, 2, 3, 5, 5), Medication = structure(c(2L, 2L, 2L, 2L, 4L, 3L, 1L, 1L), .Label = c("advil", "aspirin", "lipitor", "tylenol"), class = "factor"), Start.Date = structure(c(1L, 2L, 6L, 7L, 1L, 3L, 4L, 5L), .Label = c("05/01/2017", "05/05/2017", "05/06/2017", "05/28/2017", "06/14/2017", "06/20/2017", "06/28/2017" ), class = "factor"), Stop.Date = structure(c(2L, 3L, 8L, 1L, 5L, 4L, 6L, 7L), .Label = c("04/30/2017", "05/04/2017", "05/10/2017", "05/12/2017", "05/15/2017", "06/13/2017", "06/20/2017", "06/27/2017" ), class = "factor")), .Names = c("ID", "Medication", "Start.Date", "Stop.Date"), row.names = c(NA, -8L), class = "data.frame") df; # ID Medication Start.Date Stop.Date #1 2 aspirin 05/01/2017 05/04/2017 #2 2 aspirin 05/05/2017 05/10/2017 #3 2 aspirin 06/20/2017 06/27/2017 #4 2 aspirin 06/28/2017 04/30/2017 #5 2 tylenol 05/01/2017 05/15/2017 #6 3 lipitor 05/06/2017 05/12/2017 #7 5 advil 05/28/2017 06/13/2017 #8 5 advil 06/14/2017 06/20/2017 有一个连续的块(第7 + 8行)

输出为

ID=2

结果似乎很可靠。

答案 1 :(得分:1)

library(tidyverse)
library(lubridate)
df%>%
  group_by(Medication)%>%
  mutate_at(vars(3:4),mdy)%>%
  mutate(Start.Date = coalesce(
                 if_else((Start.Date-lag(Stop.Date))==1,lag(Start.Date),Start.Date),Start.Date),
         s = lead(Start.Date)!=Start.Date)%>%
  filter(s|is.na(s))%>%
  select(-s)

# A tibble: 5 x 4
# Groups:   ID, Medication [4]
     ID Medication Start.Date Stop.Date 
  <dbl> <chr>      <date>     <date>    
1     2 aspirin    2017-05-01 2017-05-10
2     2 aspirin    2017-06-20 2017-06-27
3     2 tylenol    2017-05-01 2017-05-15
4     3 lipitor    2017-05-06 2017-05-12
5     5 advil      2017-05-28 2017-06-13

答案 2 :(得分:0)

将“开始”和“停止”日期列转换为Date(来自mdy的{​​{1}}类,并按“ ID”,“药物”,{{1}分组}'Start.Date'和'Stop.Date'的'lead'的lubridate之差不等于1

filter

或使用abs

中的类似方法
library(dplyr)
library(lubridate)
df %>%
  mutate_at(3:4, mdy) %>% 
  group_by(ID, Medication) %>%
  filter(abs(lead(Start.Date, default = last(Start.Date)) - Stop.Date) != 1)
# A tibble: 5 x 4
# Groups:   ID, Medication [4]
#     ID Medication Start.Date Stop.Date 
#  <dbl> <fct>      <date>     <date>    
#1     2 aspirin    2017-05-05 2017-05-10
#2     2 aspirin    2017-06-20 2017-06-27
#3     2 tylenol    2017-05-01 2017-05-15
#4     3 lipitor    2017-05-06 2017-05-12
#5     5 advil      2017-05-28 2017-06-13

注意:我们可以像以前一样先将Date列转换为data.table

注2:两者都是基于OP提供的示例的简单方法