合并数据行(如果它们在指定时间段内具有连续时间间隔)

时间:2018-07-20 22:42:15

标签: r merge intervals gaps-and-islands

我有一个包含Start.Date和Stop.Date的患者药物数据集。每个代表一行。我想合并代表相应给定药物的行,但前提是晚一个间隔的Start.Date在较早间隔的终止日期起30天内(或我选择指定的任意间隔天数)。假设您的数据框如下

 ID = c(2, 2, 2, 2, 3, 5) 
    Medication = c("aspirin", "aspirin", "aspirin", "tylenol", "lipitor", "advil") 
    Start.Date = c("05/01/2017", "05/30/2017", "07/15/2017", "05/01/2017", "05/06/2017", "05/28/2017")
Stop.Date = c("05/04/2017", "06/10/2017", "07/27/2017", "05/15/2017", "05/12/2017", "06/13/2017")
    df = data.frame(ID, Medication, Start.Date, Stop.Date) 


  ID Medication Start.Date  Stop.Date
   2    aspirin 05/01/2017 05/04/2017
   2    aspirin 05/30/2017 06/10/2017
   2    aspirin 07/15/2017 07/27/2017
   2    tylenol 05/01/2017 05/15/2017
   3    lipitor 05/06/2017 05/12/2017
   5      advil 05/28/2017 06/13/2017

如果一个Stop.Date在下一个Start.Date之后指定的30天内,我想按ID和药物减少行数。新的Start.Date和Stop.Date将包含两种药物的时间间隔以及两者之间的30天或更短的时间间隔。它应该如下所示:

ID Medication Start.Date  Stop.Date
   2    aspirin 05/01/2017 06/10/2017
   2    aspirin 07/15/2017 07/27/2017
   2    tylenol 05/01/2017 05/15/2017
   3    lipitor 05/06/2017 05/12/2017
   5      advil 05/28/2017 06/13/2017

3 个答案:

答案 0 :(得分:1)

首先将您的日期转换为日期格式,以便您可以计算时间间隔:

df$Start.Date <- as.Date(df$Start.Date, '%m/%d/%Y')
df$Stop.Date <- as.Date(df$Stop.Date, '%m/%d/%Y')

> df$Stop.Date - df$Start.Date
Time differences in days
[1]  3 11 12 14  6 16

要计算停止日期和下一个开始日期之间的差值:

c(Inf, df[-1,'Start.Date'] - df[-nrow(df),'Stop.Date'])

使用c(SOMETHING,...)可以保持相同的长度,因为第一个日期没有间隔。首先对数据进行排序,以确保日期顺序正确:

df <- df[order(df$ID, df$Medication, df$Start.Date), ]

现在计算每个患者和药物的间隔时间。可以使用dplyr或data.table方便地完成此操作:

# Using dplyr:
library(dplyr)
df %<>% group_by(ID, Medication) %>% mutate(interval = c(Inf, Start.Date[-1] - Stop.Date[-n()]))

# Using data.table:
library(data.table)
df <- as.data.table(df)
df[, interval := c(Inf, Start.Date[-1] - Stop.Date[-.N]), by = .(ID, Medication)]

更新间隔为30天或更短的行的Stop.Date:

for(i in 1:nrow(df)) if(df$interval[i]<=30) df$Stop.Date[i-1] <- df$Stop.Date[i]

最后,排除间隔<= 30和列间隔的行:

# If you're using dplyr:
df %<>% filter(!interval<=30) %>% select(-interval)

# If you're using data.table:
df <- df[!interval<=30, ]; df[, interval := NULL]

> df
   ID Medication Start.Date  Stop.Date
1:  2    aspirin 2017-05-01 2017-06-10
2:  2    aspirin 2017-07-15 2017-07-27
3:  2    tylenol 2017-05-01 2017-05-15
4:  3    lipitor 2017-05-06 2017-05-12
5:  5      advil 2017-05-28 2017-06-13

答案 1 :(得分:1)

如果一个疗程结束与下一个疗程开始之间的间隔不超过30天,则OP要求取消服药期限。

下面的解决方案要求同一个人和同一药物的服药时间绝不重叠,这是一个明智的假设(并经过检查)。

1。 dplyr

library(dplyr)
library(magrittr)
min_gap <- 30
df %>%
  # convert date strings to class Date
  mutate_at(c("Start.Date", "Stop.Date"), lubridate::mdy) %>%
  arrange(ID, Medication, Start.Date) %>% 
  group_by(ID, Medication) %T>%
  # medication periods must not overlap for ID and Medication
  {summarize(., tmp = all(Start.Date >= lag(Stop.Date, default = Start.Date[1] - 1))) %$% 
      stopifnot(all(tmp))} %>% 
  # count non-subsequent medication periods, i.e., with gaps of at least min_gap days
  mutate(Medic.Period = cumsum(Start.Date > lag(Stop.Date, default = Start.Date[1]) + min_gap)) %>%
  # determine start and stop dates for each collapsed period
  group_by(ID, Medication, Medic.Period) %>%
  summarise(Start.Date = first(Start.Date), Stop.Date = last(Stop.Date))
# A tibble: 5 x 5
# Groups:   ID, Medication [?]
     ID Medication Medic.Period Start.Date Stop.Date 
  <dbl> <fct>             <int> <date>     <date>    
1     2 aspirin               0 2017-05-01 2017-06-10
2     2 aspirin               1 2017-07-15 2017-07-27
3     2 tylenol               0 2017-05-01 2017-05-15
4     3 lipitor               0 2017-05-06 2017-05-12
5     5 advil                 0 2017-05-28 2017-06-13

cumsum()函数用于在遇到新的周期时(即,从上一个周期的停止到实际周期的开始之间的间隔超过30天)来递增用药周期计数器。

2。 data.table

library(data.table)
min_gap <- 30
# coerce date strings to class Date
cols <- stringr::str_subset(names(df), "Date$")
setDT(df)[, (cols) := lapply(.SD, lubridate::mdy), .SDcols = cols][
  # create medication counters for each ID and Medication
  order(Start.Date), 
   Medic.Period := {
     tmp <- shift(Stop.Date, fill = Start.Date[1] - 1)
     stopifnot(all(Start.Date > tmp))
     cumsum(Start.Date > tmp + min_gap)
   }, 
   by = .(ID, Medication)][
     # aggregate to find the overal start and stop dates for each medication period
     , .(Start.Date = min(Start.Date), Stop.Date = max(Stop.Date)), 
     by = .(ID, Medication, Medic.Period)]
   ID Medication Medic.Period Start.Date  Stop.Date
1:  2    aspirin            0 2017-05-01 2017-06-10
2:  2    aspirin            1 2017-07-15 2017-07-27
3:  2    tylenol            0 2017-05-01 2017-05-15
4:  3    lipitor            0 2017-05-06 2017-05-12
5:  5      advil            0 2017-05-28 2017-06-13

答案 2 :(得分:0)

能否请您尝试以下操作,如果有帮助,请告诉我。

df$date_diff <- as.Date(as.character(df$Stop.Date), format="%m/%d/%Y")-as.Date(as.character(df$Start.Date), format="%m/%d/%Y")
ind <- apply( df[5] , 1 , function(x) any( x < 30 ) )
df[ ind , ]