按组子集数据以获取重复记录

时间:2015-11-12 21:53:13

标签: r duplicates subset

我有一个样本数据集,如下所示:

PID=numeric(101:110)
Subject_ID=c(1, 1, 1, 1, 2, 2, 2, 3, 3, 4)
status=c("Active","Active", "Active", "Active", "Active", "Active", "Withdrawn", "Withdrawn", "Withdrawn", "Active")
mydata=data.frame(PID, Subject_ID, status)

我想基本上创建三个包含所有源数据的数据子集(包括重复的Subject_ID):1)具有所有活动状态的Subject_ID 2)具有所有撤销状态的Subject_ID 3)具有活动和撤销状态的Subject_ID。相同的Subject_ID必须全部出现在同一组中。

到目前为止,我的想法是为Status创建虚拟变量并将此派生变量求和以分解不同的组。

mydata$Status_Code [((mydata$status=="Active"))] <- "0" #11414
mydata$Status_Code [((mydata$status=="Withdrawn"))] <- "1" #386
mydata$Status_Code=as.numeric(mydata$Status_Code)

DT=data.table(mydata)
DT[, Flag := as.numeric((sum(Status_Code))), by=Subject_ID]

这是我想要的输出 - 我想要三个单独的表(subject_id的组必须在同一个子集中,并且只出现在一个子集中)

第一张表:

PID=c(101, 102, 103, 104, 110)
Subject_ID=c(1, 1, 1, 1,  4)
status=c("Active","Active", "Active", "Active", "Active")
active=data.frame(PID, Subject_ID, status)

第二张表:

PID=c(108, 109)
Subject_ID=c(3, 3)
status=c("Withdrawn", "Withdrawn")
withdrawn=data.frame(PID, Subject_ID, status)

第三表

PID=c(105, 106, 107)
Subject_ID=c(2, 2, 2)
status=c("Active","Active", "Withdrawn")
mixed=data.frame(PID, Subject_ID, status)

我认为除了对虚拟变量求和之外,可能还有一种更简单的方法来创建这些子集。有什么想法吗?

0 个答案:

没有答案