我试图进行一项研究,但不确定是否可行。但是,我对这项研究有疑问,并希望对其他人也有用。
我有这样的数据集
id age gender v1 v2 v3 event
1 30 0 2.3 3.7 NA 1
2 31 0 1.3 4.3 4.1 0
3 40 1 3.1 NA NA 1
4 41 1 2.3 2.7 NA 0
5 42 1 2.6 3.2 NA 0
6 53 1 2.5 2.4 NA 0
第一种方法,如果是一个案例(事件== 1),那么想要找到一个控件(事件== 0)并按年龄,性别匹配它们。作为第二种方法,根据缺失的措施(v2,v3),删除控制措施(v2,v3)。
所需的数据集应如下所示:
id age gender v1 v2 v3 event
1 30 0 2.3 3.7 NA 1
2 31 0 1.3 4.3 NA 0
3 40 1 3.1 NA NA 1
4 41 1 2.3 NA NA 0
5 42 1 2.6 NA NA 0
我希望每个人都清楚这一点,并且可能对其他人有用。
答案 0 :(得分:2)
尝试
library(data.table)
df$ageGrp <- cut(df$age, breaks=c(29,39,49,59), labels=c(30,40,50))
indx <- with(df, !!ave(event, ageGrp, gender,
FUN=function(x) any(!x) & any(!!x)))
df1 <- df[indx,]
fun1 <- function(x) {if(any(is.na(x))) rep(NA_real_, length(x)) else x}
nm1 <- paste0("v", 1:3)
res <- setDT(df1)[, lapply(.SD, fun1),by=list(gender, ageGrp),
.SDcols=nm1][,c("id", "age", "event"):= list(df1$id, df1$age,
df1$event)][,ageGrp:=NULL]
res
# gender v1 v2 v3 id age event
#1: 0 2.3 3.7 NA 1 30 1
#2: 0 1.3 4.3 NA 2 31 0
#3: 1 3.1 NA NA 3 40 1
#4: 1 2.3 NA NA 4 41 0
#5: 1 2.6 NA NA 5 42 0
或者您可以使用dplyr
library(dplyr)
df %>%
group_by(gender, ageGrp) %>%
filter(any(event==1)&any(event==0)) %>%
mutate_each(funs(fun1), starts_with("v")) %>%
ungroup() %>%
select(-ageGrp)
# id age gender v1 v2 v3 event
#1 1 30 0 2.3 3.7 NA 1
#2 2 31 0 1.3 4.3 NA 0
#3 3 40 1 3.1 NA NA 1
#4 4 41 1 2.3 NA NA 0
#5 5 42 1 2.6 NA NA 0
如果NA
替换为25
,并且想要为该群组填写25
,则该事件包含该内容。
df$v2[is.na(df$v2)] <- 25 #change the NAs to 25 in the dataset for testing
df$v3[is.na(df$v3)] <- 25
fun2 <- function(x) {if(any(x==25)) rep(25,length(x)) else x}
df %>%
group_by(gender, ageGrp) %>%
filter(any(event==1)&any(event==0)) %>%
mutate_each(funs(fun2), starts_with("v")) %>%
ungroup() %>%
select(-ageGrp)
#Source: local data frame [5 x 7]
# id age gender v1 v2 v3 event
#1 1 30 0 2.3 3.7 25 1
#2 2 31 0 1.3 4.3 25 0
#3 3 40 1 3.1 25.0 25 1
#4 4 41 1 2.3 25.0 25 0
#5 5 42 1 2.6 25.0 25 0
df <- structure(list(id = 1:6, age = c(30L, 31L, 40L, 41L, 42L, 53L
), gender = c(0L, 0L, 1L, 1L, 1L, 1L), v1 = c(2.3, 1.3, 3.1,
2.3, 2.6, 2.5), v2 = c(3.7, 4.3, NA, 2.7, 3.2, 2.4), v3 = c(NA,
4.1, NA, NA, NA, NA), event = c(1L, 0L, 1L, 0L, 0L, 0L)), .Names = c("id",
"age", "gender", "v1", "v2", "v3", "event"), class = "data.frame", row.names = c(NA,
-6L))