根据案例在组内的位置和时间间隔删除每组内的案例

时间:2021-03-15 21:27:21

标签: r dplyr

这是我有的示例数据:

data<- data.frame(ID=c(rep(1,4),rep(2,5),rep(3,5),rep(4,4),rep(5,3),rep(6,3),rep(7,4),rep(8,5)),
                  test_results=c("POS","NEG","NA","NA", 
                                 "NA","NEG","POS","NA","NA",
                                 "NEG","NEG","NEG","POS","NA",
                                 "NA","NA","NA","NA",
                                 "NEG","NEG","NEG",
                                 "POS","POS","POS",
                                 "NEG","NEG","NEG","NA",
                                 "POS","POS","POS","NA","POS"),
                  Test_date=c("2000-1-1","2002-1-2","2003-1-1","2004-1-1",
                              "2000-2-1","2000-10-1","2002-10-2","2002-11-1","2002-12-1",
                              "2000-1-1","2002-1-1","2004-1-1","2006-1-1","2008-1-1",
                              "2000-1-1","2001-1-1","2002-1-1","2003-1-1",
                              "2000-1-1","2002-1-1","2004-1-1",
                              "2002-1-1","2004-1-1","2006-1-1",
                              "2000-1-1","2002-2-1","2003-12-1","2003-12-30",
                              "2002-3-1","2004-5-2","2005-12-30","2005-12-31","2007-9-10"))

如果“NA”和“POS”之间的时间间隔小于3个月,我想删除每个ID中“POS”后面的“NA”。 这是预期的结果:

data.frame(ID=c(rep(1,4),rep(2,3),rep(3,5),rep(4,4),rep(5,3),rep(6,3),rep(7,4),rep(8,4)),
                  test_results=c("POS","NEG","NA","NA", 
                                 "NA","NEG","POS",
                                 "NEG","NEG","NEG","POS","NA",
                                 "NA","NA","NA","NA",
                                 "NEG","NEG","NEG",
                                 "POS","POS","POS",
                                 "NEG","NEG","NEG","NA",
                                 "POS","POS","POS","POS"),
                  Test_date=c("2000-1-1","2002-1-2","2003-1-1","2004-1-1",
                              "2000-2-1","2000-10-1","2002-10-2",
                              "2000-1-1","2002-1-1","2004-1-1","2006-1-1","2008-1-1",
                              "2000-1-1","2001-1-1","2002-1-1","2003-1-1",
                              "2000-1-1","2002-1-1","2004-1-1",
                              "2002-1-1","2004-1-1","2006-1-1",
                              "2000-1-1","2002-2-1","2003-12-1","2003-12-30",
                              "2002-3-1","2004-5-2","2005-12-30","2007-9-10"))

我曾多次尝试寻找实现这一目标的好方法,但没有得到解决方案。任何见解将不胜感激。谢谢!

1 个答案:

答案 0 :(得分:2)

这是一个 data.table 选项

library(data.table)
library(lubridate)
setDT(data)[
  ,
  Test_date := ymd(Test_date)
][
  ,
  Q := c(NA, Test_date[test_results == "POS"] %m+% months(3))[cumsum(test_results == "POS") + 1],
  ID
][!replace(rep(FALSE, .N), test_results == "NA" & Test_date <= Q, TRUE)][
  ,
  Q := NULL
][]

给出

   ID test_results  Test_date
 1:  1          POS 2000-01-01
 2:  1          NEG 2002-01-02
 3:  1           NA 2003-01-01
 4:  1           NA 2004-01-01
 5:  2           NA 2000-02-01
 6:  2          NEG 2000-10-01
 7:  2          POS 2002-10-02
 8:  3          NEG 2000-01-01
 9:  3          NEG 2002-01-01
10:  3          NEG 2004-01-01
11:  3          POS 2006-01-01
12:  3           NA 2008-01-01
13:  4           NA 2000-01-01
14:  4           NA 2001-01-01
15:  4           NA 2002-01-01
16:  4           NA 2003-01-01
17:  5          NEG 2000-01-01
18:  5          NEG 2002-01-01
19:  5          NEG 2004-01-01
20:  6          POS 2002-01-01
21:  6          POS 2004-01-01
22:  6          POS 2006-01-01
23:  7          NEG 2000-01-01
24:  7          NEG 2002-02-01
25:  7          NEG 2003-12-01
26:  7           NA 2003-12-30
27:  8          POS 2002-03-01
28:  8          POS 2004-05-02
29:  8          POS 2005-12-30
30:  8          POS 2007-09-10
    ID test_results  Test_date

遵循类似想法的 dplyr 选项

library(tidyverse)
library(lubridate)
data %>%
  mutate(Test_date = ymd(Test_date)) %>%
  group_by(ID) %>%
  mutate(Q = c(NA, Test_date[test_results == "POS"] %m+% months(3))[cumsum(test_results == "POS") + 1]) %>%
  filter(!replace(rep(FALSE, n()), test_results == "NA" & Test_date <= Q, TRUE)) %>%
  select(-Q) %>%
  ungroup()