我有一个样本数据集,如下所示:
PID=numeric(101:110)
Subject_ID=c(1, 1, 1, 1, 2, 2, 2, 3, 3, 4)
status=c("Active","Active", "Active", "Active", "Active", "Active", "Withdrawn", "Withdrawn", "Withdrawn", "Active")
mydata=data.frame(PID, Subject_ID, status)
我想基本上创建三个包含所有源数据的数据子集(包括重复的Subject_ID):1)具有所有活动状态的Subject_ID 2)具有所有撤销状态的Subject_ID 3)具有活动和撤销状态的Subject_ID。相同的Subject_ID必须全部出现在同一组中。
到目前为止,我的想法是为Status创建虚拟变量并将此派生变量求和以分解不同的组。
mydata$Status_Code [((mydata$status=="Active"))] <- "0" #11414
mydata$Status_Code [((mydata$status=="Withdrawn"))] <- "1" #386
mydata$Status_Code=as.numeric(mydata$Status_Code)
DT=data.table(mydata)
DT[, Flag := as.numeric((sum(Status_Code))), by=Subject_ID]
这是我想要的输出 - 我想要三个单独的表(subject_id的组必须在同一个子集中,并且只出现在一个子集中)
第一张表:
PID=c(101, 102, 103, 104, 110)
Subject_ID=c(1, 1, 1, 1, 4)
status=c("Active","Active", "Active", "Active", "Active")
active=data.frame(PID, Subject_ID, status)
第二张表:
PID=c(108, 109)
Subject_ID=c(3, 3)
status=c("Withdrawn", "Withdrawn")
withdrawn=data.frame(PID, Subject_ID, status)
第三表
PID=c(105, 106, 107)
Subject_ID=c(2, 2, 2)
status=c("Active","Active", "Withdrawn")
mixed=data.frame(PID, Subject_ID, status)
我认为除了对虚拟变量求和之外,可能还有一种更简单的方法来创建这些子集。有什么想法吗?