这是我有的示例数据:
data<- data.frame(ID=c(rep(1,4),rep(2,5),rep(3,5),rep(4,4),rep(5,3),rep(6,3),rep(7,4),rep(8,5)),
test_results=c("POS","NEG","NA","NA",
"NA","NEG","POS","NA","NA",
"NEG","NEG","NEG","POS","NA",
"NA","NA","NA","NA",
"NEG","NEG","NEG",
"POS","POS","POS",
"NEG","NEG","NEG","NA",
"POS","POS","POS","NA","POS"),
Test_date=c("2000-1-1","2002-1-2","2003-1-1","2004-1-1",
"2000-2-1","2000-10-1","2002-10-2","2002-11-1","2002-12-1",
"2000-1-1","2002-1-1","2004-1-1","2006-1-1","2008-1-1",
"2000-1-1","2001-1-1","2002-1-1","2003-1-1",
"2000-1-1","2002-1-1","2004-1-1",
"2002-1-1","2004-1-1","2006-1-1",
"2000-1-1","2002-2-1","2003-12-1","2003-12-30",
"2002-3-1","2004-5-2","2005-12-30","2005-12-31","2007-9-10"))
如果“NA”和“POS”之间的时间间隔小于3个月,我想删除每个ID中“POS”后面的“NA”。 这是预期的结果:
data.frame(ID=c(rep(1,4),rep(2,3),rep(3,5),rep(4,4),rep(5,3),rep(6,3),rep(7,4),rep(8,4)),
test_results=c("POS","NEG","NA","NA",
"NA","NEG","POS",
"NEG","NEG","NEG","POS","NA",
"NA","NA","NA","NA",
"NEG","NEG","NEG",
"POS","POS","POS",
"NEG","NEG","NEG","NA",
"POS","POS","POS","POS"),
Test_date=c("2000-1-1","2002-1-2","2003-1-1","2004-1-1",
"2000-2-1","2000-10-1","2002-10-2",
"2000-1-1","2002-1-1","2004-1-1","2006-1-1","2008-1-1",
"2000-1-1","2001-1-1","2002-1-1","2003-1-1",
"2000-1-1","2002-1-1","2004-1-1",
"2002-1-1","2004-1-1","2006-1-1",
"2000-1-1","2002-2-1","2003-12-1","2003-12-30",
"2002-3-1","2004-5-2","2005-12-30","2007-9-10"))
我曾多次尝试寻找实现这一目标的好方法,但没有得到解决方案。任何见解将不胜感激。谢谢!
答案 0 :(得分:2)
这是一个 data.table
选项
library(data.table)
library(lubridate)
setDT(data)[
,
Test_date := ymd(Test_date)
][
,
Q := c(NA, Test_date[test_results == "POS"] %m+% months(3))[cumsum(test_results == "POS") + 1],
ID
][!replace(rep(FALSE, .N), test_results == "NA" & Test_date <= Q, TRUE)][
,
Q := NULL
][]
给出
ID test_results Test_date
1: 1 POS 2000-01-01
2: 1 NEG 2002-01-02
3: 1 NA 2003-01-01
4: 1 NA 2004-01-01
5: 2 NA 2000-02-01
6: 2 NEG 2000-10-01
7: 2 POS 2002-10-02
8: 3 NEG 2000-01-01
9: 3 NEG 2002-01-01
10: 3 NEG 2004-01-01
11: 3 POS 2006-01-01
12: 3 NA 2008-01-01
13: 4 NA 2000-01-01
14: 4 NA 2001-01-01
15: 4 NA 2002-01-01
16: 4 NA 2003-01-01
17: 5 NEG 2000-01-01
18: 5 NEG 2002-01-01
19: 5 NEG 2004-01-01
20: 6 POS 2002-01-01
21: 6 POS 2004-01-01
22: 6 POS 2006-01-01
23: 7 NEG 2000-01-01
24: 7 NEG 2002-02-01
25: 7 NEG 2003-12-01
26: 7 NA 2003-12-30
27: 8 POS 2002-03-01
28: 8 POS 2004-05-02
29: 8 POS 2005-12-30
30: 8 POS 2007-09-10
ID test_results Test_date
遵循类似想法的 dplyr
选项
library(tidyverse)
library(lubridate)
data %>%
mutate(Test_date = ymd(Test_date)) %>%
group_by(ID) %>%
mutate(Q = c(NA, Test_date[test_results == "POS"] %m+% months(3))[cumsum(test_results == "POS") + 1]) %>%
filter(!replace(rep(FALSE, n()), test_results == "NA" & Test_date <= Q, TRUE)) %>%
select(-Q) %>%
ungroup()