我试图在没有运气的情况下得到答案。希望有人可以帮助我。我有一组患者数据。
PatientID <- c('1', "1", "1","1", "2","2","2","2","3","3","3","3")
admission.duration.minutes <- c(0,0.5,1.2,2,0,2.5,3.6,8,0,4,22,24)
has.fever <- c(1,1,NA,0,1,NA,1,1,NA,0,1,NA)
on.ventilator<-c(1,0,1,1,0,1,0,1,NA,1,0,NA)
high.bloodpressure<-c(1,0,1,0,1,0,1,1,1,1,NA,1)
df <- data.frame(PatientID, admission.duration.minutes, has.fever,on.ventilator,high.bloodpressure)
我想更改数据集,因此我每个病人都有一行,我想计算在第1小时有多少患者发烧,第1小时用呼吸机,1小时高血压,发烧和呼吸机和血液的组合小时压力1.小时2,3等相同
所以我相信我首先需要添加一个定义小时1,2,3等的时间层变量。因此小时1 = 0.0 - 1.0,小时2大于1.0到2.0。然后做一个条件计数或类似的东西。
我已尝试使用发布包,但无法正确输出。
新数据框的输出应如下所示:
PatientID hour1.fev hour1.vent hour1.BP hour1.fev&vent hour1.fev&BP
1 1 1 1 1 1
hour1.vent&BP hour2.fev hour2.vent hour2.BP hour2.fev&vent hour2.fev&BP
1 0 1 0 1 1
hour2.vent&BP
1
你能帮帮我吗?
答案 0 :(得分:1)
作为初步方法,我会提出以下方法。首先,按患者分组数据和时间跨度
library("dplyr")
# definition of time spans
df$strata <- if_else(df$admission.duration.minutes == 0, 1, ceiling(df$admission.duration.minutes))
# note that NA measurments are silently transformed here to zeros
df_groupped <- df %>% group_by(PatientID, strata) %>% summarise_at(vars(has.fever:high.bloodpressure),
sum, na.rm = TRUE)
如果我们想以其他方式处理NA,解决方案可能是
# the result is NA only if all parameters in the strata are NA
df_groupped <- df %>% group_by(PatientID, strata) %>%
summarise_at(.vars = vars(has.fever:high.bloodpressure),
.funs = funs(if (all(is.na(.))) NA else sum(., na.rm = TRUE)),
na.rm = FALSE)
因此,我们以长格式
获取分组数据框# transform numbers of measurments to booleans
df_groupped <- df_groupped %>% mutate(
has.fever = as.integer(as.logical(has.fever)),
on.ventilator = as.integer(as.logical(on.ventilator)),
high.bloodpressure = as.integer(as.logical(high.bloodpressure)),
# ".and."" means `*` instead of `+`
fev.and.BP = as.integer(as.logical(has.fever * high.bloodpressure)),
fev.and.vent = as.integer(as.logical(has.fever * high.bloodpressure))
)
然后创建一个函数来生成所需结构的数据框:
fill_form <- function(periods, df_Patient, n_param){
# obtain names of the measured parameters & the first column
long_col_names <- names(df_Patient)[-(1:2)]
long_df_names <- sapply(function(i) paste("hour", periods[i], ".", long_col_names, sep =""), X = periods)
# add the names of the first column with the Patient's ID
long_df_names <- c(names(df_Patient)[1], long_df_names)
long_df <- as.data.frame(matrix(NA, nrow = 1, ncol = 1 + length(periods) * n_param))
names(long_df) <- long_df_names
long_df[, 1] <- as.character(df_Patient[1, 1])
for (i in seq(along.with = periods)) {
if (nrow(filter(df_Patient, strata == periods[i])) > 0) {
long_df[ ,(2 + n_param * (i - 1)):(2 + n_param * i)] <- filter(df_Patient, strata == periods[i])[-(1:2)]
}
}
return(long_df)
}
然后将此功能精确应用于每个患者的数据
# the ID's of the patients extracted from the initial df
PatientIDs_names <- unique(unlist(lapply(df["PatientID"], as.character)))
n_of_patients <- length(PatientIDs_names)
n_monit_param <- (ncol(df_groupped) - 2)
# outputted periods are restricted for demonstration purposes
hours_to_monitor <- c(1:5)
records <- lapply(function(i) fill_form(periods = hours_to_monitor,
df_Patient = filter(df_groupped, PatientID == PatientIDs_names[i]), n_param = n_monit_param),
X = seq(along.with = PatientIDs_names))
希望,这会有所帮助。但是,我不确定两件事:
1)输出示例中的hour2.fev
和hour2.BP
都是0
,为什么hour2.fev&vent
为1
?
2)为什么high.bloodpressure
在第二个时间范围内为PatientID == 1
为0? high.bloodpressure == 1
时间1.2
{{1}}。这个时间应该包括在第二个时间跨度(1和2之间的小时2),不是吗?