我的数据格式如下 -
ID EVID ADMIT DC DRG CLIN_C PRIN_DX
1 AA 1/1/13 2/1/13 ABC 1A234 Y
1 AA 1/1/13 2/1/13 ABC 1B345 N
1 AA 1/1/13 2/1/13 ABC 1C234 N
1 AA 1/1/13 2/1/13 ABC 1234C N
1 BB 3/1/13 2/15/13 EEE C12C3 Y
1 BB 3/1/13 2/15/13 EEE 1B345 N
1 BB 3/1/13 2/15/13 EEE 1C234 N
1 BB 3/1/13 2/15/13 EEE 987D N
2 CC 3/1/13 2/15/13 EEE C12C3 Y
2 CC 3/1/13 2/15/13 EEE 546X N
2 CC 3/1/13 2/15/13 EEE 1C234 N
2 CC 3/1/13 2/15/13 EEE 1234C N
我希望数据采用以下格式:
ID EVID ADMIT DC DRG PRIN_DX 1B345 1C234 1234C 987D 546X
1 AA 1/1/13 2/1/13 ABC 1A234 1 1 1 0 0
1 BB 3/1/13 2/15/13 EEE C12C3 1 1 0 1 0
2 CC 3/1/13 2/15/13 EEE C12C3 0 1 0 0 1
如果可能的话,我想用R做这个。我已经尝试了reshape / reshape2,但是找不到明显的方法来处理分组的行 - 将分组的行拆分成列,并聚合剩余的行。
数据是几百家医院入院的记录 - 如此合理。
答案 0 :(得分:5)
尝试此假设DF
是输入数据框:
library(reshape2)
FUN <- function(i) with(DF[i, ], CLIN_C[PRIN_DX == "Y"])
DF$PRIN_DX <- ave(1:nrow(DF), DF$ID, DF$EVID, FUN = FUN)
dcast(DF, ... ~ CLIN_C, fun = length, value.var = 1)
给出:
ID EVID ADMIT DC DRG PRIN_DX 1234C 1A234 1B345 1C234 546X 987D C12C3
1 1 AA 1/1/13 2/1/13 ABC 1A234 1 1 1 1 0 0 0
2 1 BB 3/1/13 2/15/13 EEE C12C3 0 0 1 1 0 1 1
3 2 CC 3/1/13 2/15/13 EEE C12C3 1 0 0 1 1 0 1
更新:简化
答案 1 :(得分:1)
另一种方法是使用plyr和model.matrix将因子强制转换为虚拟变量。我简化了数据,并假设总是有一个PRIN_DX。
df <- data.frame(ID=c(1,1,2,2,3,3), EVID=c(0,0,1,1,3,3), CLIN_C = c('A1','B1','C1','D1','C1','D2'), PRIN_DX=c('Y','N','Y','N','Y','N'))
df$CLIN_C <- factor(df$CLIN_C)
agg_fun <- function(x) {
temp1 <- x$CLIN[which(x$PRIN_DX=='Y')[1]]
temp2 <- apply(model.matrix(~x$CLIN_C-1), 2, sum)
out <- data.frame(temp1, t(temp2))
names(out) <- c('PRIN_DX', levels(x$CLIN_C))
return(out)
}
library(plyr)
ddply(df, .(ID, EVID), agg_fun)
答案 2 :(得分:1)
我注意到在原始问题中,原理诊断(PRIN_DX)不包含在所需输出数据集中的列中。所以这里有一个选项,使用plyr和reshape2来获得结果。
require(reshape2)
require(plyr)
# Make a variable specifically for the principle diagnosis
df2 = ddply(df, .(ID, EVID, ADMIT, DC, DRG), transform, PRIN_DX2 = CLIN_C[PRIN_DX == "Y"] )
# Pull out the non-principle diagnoses
df2$CLIN_C = ifelse(df2$PRIN_DX == "N", as.character(df2$CLIN_C), NA)
# Make the order of CLIN_C match the order of appearance
df2$CLIN_C = factor(df2$CLIN_C, levels = unique(df2$CLIN_C) )
dcast(na.omit(df2), ID + EVID + ADMIT + DC + DRG + PRIN_DX2 ~ CLIN_C, fun = length)
给出了:
ID EVID ADMIT DC DRG PRIN_DX2 1B345 1C234 1234C 987D 546X
1 1 AA 1/1/13 2/1/13 ABC 1A234 1 1 1 0 0
2 1 BB 3/1/13 2/15/13 EEE C12C3 1 1 0 1 0
3 2 CC 3/1/13 2/15/13 EEE C12C3 0 1 1 0 1