我有一个带有生存观察数据的大型辅助数据框(每个主题ID有多个条目)。我试图找出哪些受试者在研究观察期结束前记录了他们的最后观察数据(例如,在本研究的情况下在第100周之前)。从本质上讲,我试图找出谁失去了跟进。有没有这样做的功能?对不起,如果已经回答了类似的问题,但是我无法想到技术上足够的术语来在网络搜索中找到任何内容。我在R中有基本的读写能力,但我没有非常强大的技术背景。感谢您的时间和帮助!
在以下问题数据框的摘录中。有一个例子,最后一次观察不到105周(104)。
structure(list(ID = c(140L, 140L, 141L, 142L, 142L, 143L, 143L,
144L, 144L, 144L, 144L), WEEK = c(40L, 105L, 105L, 11L, 105L,
103L, 104L, 37L, 48L, 65L, 105L), OBSDATE = structure(c(40L,
107L, 107L, 11L, 107L, 105L, 106L, 37L, 48L, 65L, 107L), .Label = c("2002-12-29",
"2003-01-05", "2003-01-12", "2003-01-19", "2003-01-26", "2003-02-02",
"2003-02-09", "2003-02-16", "2003-02-23", "2003-03-02", "2003-03-09",
"2003-03-16", "2003-03-23", "2003-03-30", "2003-04-06", "2003-04-13",
"2003-04-20", "2003-04-27", "2003-05-04", "2003-05-11", "2003-05-18",
"2003-05-25", "2003-06-01", "2003-06-08", "2003-06-15", "2003-06-22",
"2003-06-29", "2003-07-06", "2003-07-13", "2003-07-20", "2003-07-27",
"2003-08-03", "2003-08-10", "2003-08-17", "2003-08-24", "2003-08-31",
"2003-09-07", "2003-09-14", "2003-09-21", "2003-09-28", "2003-10-05",
"2003-10-12", "2003-10-19", "2003-10-26", "2003-11-02", "2003-11-09",
"2003-11-16", "2003-11-23", "2003-11-30", "2003-12-07", "2003-12-14",
"2003-12-21", "2003-12-28", "2004-01-04", "2004-01-11", "2004-01-18",
"2004-01-25", "2004-02-01", "2004-02-08", "2004-02-15", "2004-02-22",
"2004-02-29", "2004-03-07", "2004-03-14", "2004-03-21", "2004-03-27",
"2004-03-28", "2004-04-04", "2004-04-11", "2004-04-18", "2004-04-25",
"2004-05-02", "2004-05-09", "2004-05-16", "2004-05-23", "2004-05-30",
"2004-06-06", "2004-06-10", "2004-06-13", "2004-06-20", "2004-06-27",
"2004-07-04", "2004-07-11", "2004-07-18", "2004-07-25", "2004-08-01",
"2004-08-08", "2004-08-15", "2004-08-22", "2004-08-29", "2004-09-05",
"2004-09-12", "2004-09-19", "2004-09-26", "2004-10-03", "2004-10-10",
"2004-10-17", "2004-10-24", "2004-10-31", "2004-11-07", "2004-11-14",
"2004-11-21", "2004-11-28", "2004-12-05", "2004-12-12", "2004-12-19",
"2004-12-26", "2005-11-24", "2006-11-02", "2007-02-26", "2009-05-18",
"2010-08-11", "2011-01-29", "2013-09-06", "2017-04-23", "2017-05-13",
"2019-05-01", "2022-11-22", "2026-03-20", "2026-08-15", "2028-09-26",
"2030-02-08", "2034-08-30", "2035-01-22", "2035-10-14", "2037-09-20",
"2038-05-09", "2043-01-31", "2043-08-19", "2045-03-29", "2046-05-15",
"2050-03-06", "2053-10-15", "2054-05-22", "2056-06-09", "2060-03-13",
"2061-04-15", "2061-08-30", "2062-07-10"), class = "factor")), .Names = c("ID",
"WEEK", "OBSDATE"), row.names = 231:241, class = "data.frame")
答案 0 :(得分:0)
解决这个问题的一种方法是使用我用来分析受控trais研究的旧函数。
followup <- function (id, time) {
if(length(id) !=length(time)) stop("The length of these two variables must be equal")
if(any(duplicated(paste(id,time)))) stop("The combination of id and time must be unique")
original.order <- 1:length(id)
if(any(data.frame(id, time) != data.frame(id[order(id, time)], time[order(id,time)]))){
new.order <- original.order[order(id,time)]
id <- id[order(id,time)]
time <- time[order(id,time)]
}
list1 <- rle(as.vector(id))
unlist(sapply(X=list1$lengths, FUN=function(x) 1:x, simplify=FALSE)) -> visit
visit[order(original.order)]
}
由于您没有提供有关您数据的任何线索,所以我在这里模拟一些:
data=as.data.frame(list(ID=sample(LETTERS, 50, rep=TRUE),variable=rnorm(50,50,10)))
rand.date=function(start.day,end.day,data){
size=dim(data)[1]
days=seq.Date(as.Date(start.day),as.Date(end.day),by="day")
pick.day=runif(size,1,length(days))
date=days[pick.day]
}
data$date=rand.date("2010-01-01","2015-07-18",data)
> data
ID variable date
1 L 52.75080 2010-12-28
2 W 51.36106 2011-11-24
3 S 46.52550 2011-06-19
4 S 64.37270 2013-06-18
5 X 68.47047 2015-03-17
6 Y 44.52643 2010-11-18
7 O 51.61603 2015-04-13
..... ......
# Executing the function:
data$follow<- followup(data$ID, data$date)
> data
ID variable date follow
1 L 52.75080 2010-12-28 1
2 W 51.36106 2011-11-24 2
3 S 46.52550 2011-06-19 1
4 S 64.37270 2013-06-18 2
5 X 68.47047 2015-03-17 3
6 Y 44.52643 2010-11-18 1
7 O 51.61603 2015-04-13 2
8 C 60.06102 2014-06-22 3
因此,您所要做的就是通过follow
列对data.frame进行排序,并查看最后一次在研究中看到主题的时间。
library(dplyr)
> data %>% group_by(ID) %>% arrange(follow)
Source: local data frame [50 x 4]
Groups: ID
ID variable date follow
1 A 61.75308 2014-06-28 1
2 A 32.19119 2015-05-15 2
3 B 45.40385 2011-09-07 1
4 B 52.31812 2014-12-24 2
5 C 50.75906 2014-06-09 1
6 C 34.27607 2012-10-29 2
7 C 60.06102 2014-06-22 3
8 D 61.69071 2014-06-17 1
9 D 51.49701 2014-05-22 2
.. .. ... ... ...
答案 1 :(得分:0)
使用您提供的数据(并将其称为dat
):
library(dplyr)
group_by(dat, ID) %>%
summarize(censored = max(WEEK) < 105)
# Source: local data frame [5 x 2]
#
# ID censored
# 1 140 FALSE
# 2 141 FALSE
# 3 142 FALSE
# 4 143 TRUE
# 5 144 FALSE
如果您希望被审查的主题ID的原始数据中的索引:
cens_id = group_by(dat, ID) %>%
summarize(censored = max(WEEK) < 105) %>%
filter(censored)
which(dat$ID %in% cens_id$ID)
# [1] 6 7