在r

时间:2018-02-12 11:10:36

标签: r dataframe dplyr

我试图在没有运气的情况下得到答案。希望有人可以帮助我。我有一组患者数据。

PatientID <- c('1', "1", "1","1", "2","2","2","2","3","3","3","3")
admission.duration.minutes <- c(0,0.5,1.2,2,0,2.5,3.6,8,0,4,22,24)
has.fever <- c(1,1,NA,0,1,NA,1,1,NA,0,1,NA)
on.ventilator<-c(1,0,1,1,0,1,0,1,NA,1,0,NA)
high.bloodpressure<-c(1,0,1,0,1,0,1,1,1,1,NA,1)
df <- data.frame(PatientID, admission.duration.minutes, has.fever,on.ventilator,high.bloodpressure)

我想更改数据集,因此我每个病人都有一行,我想计算在第1小时有多少患者发烧,第1小时用呼吸机,1小时高血压,发烧和呼吸机和血液的组合小时压力1.小时2,3等相同

所以我相信我首先需要添加一个定义小时1,2,3等的时间层变量。因此小时1 = 0.0 - 1.0,小时2大于1.0到2.0。然后做一个条件计数或类似的东西。

我已尝试使用发布包,但无法正确输出。

新数据框的输出应如下所示:

PatientID       hour1.fev   hour1.vent  hour1.BP    hour1.fev&vent  hour1.fev&BP    
1               1           1           1           1               1
hour1.vent&BP   hour2.fev   hour2.vent  hour2.BP    hour2.fev&vent  hour2.fev&BP
1               0           1           0           1               1 
hour2.vent&BP
1
你能帮帮我吗?

Current data frame

How the new dataframe could look like

1 个答案:

答案 0 :(得分:1)

作为初步方法,我会提出以下方法。首先,按患者分组数据和时间跨度

library("dplyr")
# definition of time spans
df$strata <- if_else(df$admission.duration.minutes == 0, 1, ceiling(df$admission.duration.minutes))
# note that NA measurments are silently transformed here to zeros
df_groupped <- df %>% group_by(PatientID, strata) %>% summarise_at(vars(has.fever:high.bloodpressure), 
    sum, na.rm = TRUE)

如果我们想以其他方式处理NA,解决方案可能是

# the result is NA only if all parameters in the strata are NA
df_groupped <- df %>% group_by(PatientID, strata) %>% 
    summarise_at(.vars = vars(has.fever:high.bloodpressure), 
        .funs = funs(if (all(is.na(.))) NA else sum(., na.rm = TRUE)), 
        na.rm = FALSE)

因此,我们以长格式

获取分组数据框
# transform numbers of measurments to booleans
df_groupped <- df_groupped %>% mutate(
    has.fever = as.integer(as.logical(has.fever)),
    on.ventilator = as.integer(as.logical(on.ventilator)),
    high.bloodpressure = as.integer(as.logical(high.bloodpressure)),
    # ".and."" means `*` instead of `+`
    fev.and.BP = as.integer(as.logical(has.fever * high.bloodpressure)),
    fev.and.vent = as.integer(as.logical(has.fever * high.bloodpressure))
)

然后创建一个函数来生成所需结构的数据框:

fill_form <- function(periods, df_Patient, n_param){
    # obtain names of the measured parameters & the first column
    long_col_names <- names(df_Patient)[-(1:2)]
    long_df_names <- sapply(function(i) paste("hour", periods[i], ".", long_col_names, sep =""), X = periods)
    # add the names of the first column with the Patient's ID
    long_df_names <- c(names(df_Patient)[1], long_df_names)
    long_df <- as.data.frame(matrix(NA, nrow = 1, ncol = 1 + length(periods) * n_param))
    names(long_df) <- long_df_names
    long_df[, 1] <- as.character(df_Patient[1, 1])
    for (i in seq(along.with = periods)) {
        if (nrow(filter(df_Patient, strata == periods[i])) > 0) {
            long_df[ ,(2 + n_param * (i - 1)):(2 + n_param * i)] <- filter(df_Patient, strata == periods[i])[-(1:2)]
        }
    }
return(long_df) 
}

然后将此功能精确应用于每个患者的数据

# the ID's of the patients extracted from the initial df
PatientIDs_names <- unique(unlist(lapply(df["PatientID"], as.character)))
n_of_patients <- length(PatientIDs_names)
n_monit_param <- (ncol(df_groupped) - 2)
# outputted periods are restricted for demonstration purposes
hours_to_monitor <- c(1:5)
records <- lapply(function(i) fill_form(periods = hours_to_monitor, 
    df_Patient = filter(df_groupped, PatientID == PatientIDs_names[i]), n_param = n_monit_param), 
    X = seq(along.with = PatientIDs_names))

希望,这会有所帮助。但是,我不确定两件事:

1)输出示例中的hour2.fevhour2.BP都是0,为什么hour2.fev&vent1

2)为什么high.bloodpressure在第二个时间范围内为PatientID == 1为0? high.bloodpressure == 1时间1.2 {{1}}。这个时间应该包括在第二个时间跨度(1和2之间的小时2),不是吗?