如果我有数据集
[]
某些人有多个开始日期
我正在尝试对其进行操作,以使我们在DOB之后的10年和20年(直到到达ENDDATE)有一个新的STARTDATE条目。所以
ID DOB STARTDATE ENDDATE FAILURE
1 10/10/75 5/10/84 15/5/03 0
2 15/9/76 10/3/84 19/6/92 0
2 15/9/76 22/2/93 15/1/99 0
2 15/9/76 15/4/99 15/1/03 0
到目前为止,我已经尝试通过添加一个新列来解决此问题,该列用于计算进入年龄(STARTDATE-DOB):
ID DOB STARTDATE ENDDATE FAILURE
1 10/10/75 5/10/84 15/5/03 0
2 15/9/76 10/3/84 19/6/92 0
2 15/9/76 14/9/86 19/6/92 0
2 15/9/76 22/2/93 15/1/99 0
2 15/9/76 15/9/96 15/1/99 0
2 15/9/76 15/4/99 15/1/03 0
然后像这样运行library(eeptools)
AGEENTRY <- age_calc(DOB, STARTDATE, units = "years")
survSplit
我知道在STATA中它可以很好地完成
survSplit(DATA, cut = c(10, 20), end = "AGEENTRY",
event = "FAILURE", start = "START")
但是,这并没有完全按照我希望的那样进行。我已经在这个问题上停留了一天多,因此我们将不胜感激!
答案 0 :(得分:0)
其中一种方法可能是
library(tidyverse)
library(lubridate)
library(anytime)
#convert character columns of sample data to date columns
df <- df %>%
mutate_if(grepl("/", .), ~ as.Date(., format = "%d/%m/%Y"))
#identify STARTDATE + 10 (20, 30 ...) years and then process data to have the desired result
df %>%
group_by(ID, DOB) %>%
mutate(temp = paste(seq.Date(unique(DOB), max(STARTDATE), by = "10 year")[-1], collapse = ',')) %>%
separate_rows(temp, sep = ',') %>%
group_by(STARTDATE, add = T) %>%
filter(anydate(temp) == min(anydate(temp)[anydate(temp) > STARTDATE])) %>%
ungroup() %>% #STARTDATE + 10 (20, 30 ...) years are identified here
right_join(df) %>%
mutate(STARTDATE = ifelse(!is.na(temp), paste(STARTDATE, temp, sep = ','), as.character(STARTDATE))) %>%
separate_rows(STARTDATE, sep = ',') %>%
select(-temp)
给出
ID DOB STARTDATE ENDDATE FAILURE
1 1 1975-10-10 1984-10-05 2003-05-15 0
2 2 1976-09-15 1984-03-10 1992-06-19 0
3 2 1976-09-15 1986-09-15 1992-06-19 0
4 2 1976-09-15 1993-02-22 1999-01-15 0
5 2 1976-09-15 1996-09-15 1999-01-15 0
6 2 1976-09-15 1999-04-15 2003-01-15 0
示例数据
df <- structure(list(ID = c(1L, 2L, 2L, 2L), DOB = c("10/10/1975",
"15/9/1976", "15/9/1976", "15/9/1976"), STARTDATE = c("5/10/1984",
"10/3/1984", "22/2/1993", "15/4/1999"), ENDDATE = c("15/5/2003",
"19/6/1992", "15/1/1999", "15/1/2003"), FAILURE = c(0L, 0L, 0L,
0L)), .Names = c("ID", "DOB", "STARTDATE", "ENDDATE", "FAILURE"
), class = "data.frame", row.names = c(NA, -4L))
# ID DOB STARTDATE ENDDATE FAILURE
#1 1 10/10/1975 5/10/1984 15/5/2003 0
#2 2 15/9/1976 10/3/1984 19/6/1992 0
#3 2 15/9/1976 22/2/1993 15/1/1999 0
#4 2 15/9/1976 15/4/1999 15/1/2003 0