如何过滤第一个日期365天(+ / 90)的行?

时间:2018-05-25 06:53:25

标签: r dplyr

这实际上是another question I posted previously的改编。我希望其他用户'可以提供有关我的代码的反馈或提供更好的替代品。谢谢!

我有一个包含实验室测试的数据集,我只想要从第一次实验室测试开始365天(+/- 90天)的实验室测试。

  1. 如果患者的任何值不在第一天的365 +/- 90天内,我们只会输出第一个日期,例如PATIENT_ID == 1
  2. 我想在365 +/- 90天范围内输出(i)第一个日期,(ii)下一个日期并且最接近365天指向第一个日期,然后是(iii)下一个日期在365 +/- 90天范围内,并且最接近第二个日期的365天点,依此类推。对于PATIENT_ID == 230/05/201601/08/2016都在距离第一个日期365 +/- 90天的范围内,但只选择后者,因为它更接近365天标记。选择第三个日期27/07/2017是因为它在第二个日期的365 +/- 90天范围内,依此类推。
  3. 数据

        PATIENT_ID LAB_TEST_DATE LAB_TEST
     1:          1    2012-11-19       31
     2:          1    2012-11-21       30
     3:          1    2012-11-23       31
     4:          1    2012-11-26       30
     5:          1    2012-11-28       30
     6:          1    2012-12-01       30
     7:          1    2012-12-05       29
     8:          1    2012-12-06       30
     9:          2    2015-07-23       43
    10:          2    2015-08-05       41
    11:          2    2015-08-19       44
    12:          2    2015-09-02       41
    13:          2    2015-09-30       40
    14:          2    2015-12-23       45
    15:          2    2016-03-16       46
    16:          2    2016-05-30       40
    17:          2    2016-08-01       46
    18:          2    2017-07-27       44
    19:          2    2018-10-15       49
    20:          3    2011-08-11       30
    
    ...trunc...
    

    期望输出

    PATIENT_ID  LAB_TEST_DATE   LAB_TEST
             1     19/11/2012   31
             2     23/07/2015   43
             2     01/08/2016   46
             2     27/07/2017   44
             2     15/10/2018   49
             3     11/08/2011   30
             3     13/08/2012   36
             4     01/10/2014   41
             4     26/08/2015   42
    

    输入数据:

    df <- structure(list(PATIENT_ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), 
    LAB_TEST_DATE = structure(c(15663, 15665, 15667, 15670, 15672, 
    15675, 15679, 15680, 16639, 16652, 16666, 16680, 16708, 16792, 
    16876, 16951, 17014, 17374, 17819, 15197, 15202, 15217, 15300, 
    15335, 15357, 15405, 15413, 15434, 15453, 15565, 16344, 16352, 
    16364, 16379, 16414, 16442, 16505, 16589, 16673), class = "Date"), 
    LAB_TEST = c(31L, 30L, 31L, 30L, 30L, 30L, 29L, 30L, 43L, 
    41L, 44L, 41L, 40L, 45L, 46L, 40L, 46L, 44L, 49L, 30L, 31L, 
    34L, 34L, 36L, 36L, 33L, 36L, 33L, 35L, 36L, 41L, 43L, 43L, 
    40L, 39L, 42L, 40L, 40L, 42L)), class = "data.frame", .Names = c("PATIENT_ID", 
    "LAB_TEST_DATE", "LAB_TEST"), row.names = c(NA, -39L))
    

    代码

    我写了一个递归函数,如果日期在范围内并且最接近365天标记,那么我将过滤该日期。

    f <- function(d, ind = 1) {
    
      datediff <- difftime(d, d[ind], units = "days")
      ind.range <- which(datediff >= 275 & datediff <= 455)
      ind.min <- which.min(abs(datediff - 365))
      ind.next <- first(intersect(ind.range, ind.min))
    
      if (is.na(ind.next))
        return(ind)
      else
        return(c(ind, f(d, ind.next)))
    }
    
    df %>% group_by(PATIENT_ID) %>% slice(f(LAB_TEST_DATE))
    

1 个答案:

答案 0 :(得分:0)

以下是使用data.table的解决方案。内联说明。

library(data.table)
setDT(df)

#extract the first visit for each patient
firstDates <- df[, .SD[1L], by=PATIENT_ID]

#create the period for each lab test
df[, ':=' (STARTDATE=LAB_TEST_DATE+365-90, ENDDATE=LAB_TEST_DATE+365+90)]

#for each lab test, find the lab tests that are within 365 +/- 90 days 
#after that lab test by performing a non-equi self join
withinPeriod <- df[
    df, 
    .(PATIENT_ID, x.LAB_TEST, x.LAB_TEST_DATE, i.LAB_TEST_DATE, i.STARTDATE, i.ENDDATE, i.LAB_TEST), 
    by=.EACHI,
    on=.(PATIENT_ID, LAB_TEST_DATE >= STARTDATE, LAB_TEST_DATE <= ENDDATE)][
        !is.na(x.LAB_TEST), -3L:-1L]

#find the lab test that is closest to the 365 days after that lab test 
#and extract only relevant columns
selected <- withinPeriod[, .SD[which.min(abs(i.LAB_TEST_DATE + 365 - x.LAB_TEST_DATE))], 
    by=.(PATIENT_ID, i.LAB_TEST_DATE, i.STARTDATE, i.ENDDATE, i.LAB_TEST)][, 
        .(PATIENT_ID, LAB_TEST_DATE=x.LAB_TEST_DATE, LAB_TEST=x.LAB_TEST)]

#cbind first dates with those selected
ans <- rbindlist(list(firstDates, unique(selected)), use.names=TRUE)
setorder(ans, PATIENT_ID, LAB_TEST_DATE)
ans

#   PATIENT_ID LAB_TEST_DATE LAB_TEST
#1:          1    2012-11-19       31
#2:          2    2015-07-23       43
#3:          2    2016-08-01       46
#4:          2    2017-07-27       44
#5:          2    2018-10-15       49
#6:          3    2011-08-11       30
#7:          3    2012-08-13       36
#8:          4    2014-10-01       41
#9:          4    2015-08-26       42