我有以下数据集
PatientName BVAID Rank TreatmentCode TreatmentID DoseID
Tim Stuart BVA-027 3 OP_TBC 1 1
Tim Stuart BVA-041 4 OP_TBC 1 1
Tim Stuart BVA-021 7 OP_TBC 1 1
Tim Stuart BVA-048 10 OP_TBC 1 1
Tim Stuart BVA-020 14 OP_TBC 1 1
Tim Stuart BVA-024 15 OP_TBC 1 1
Tim Stuart BVA-001 16 OP_TBC 1 1
Tim Stuart BVA-013 27 OP_TBC 1 1
Tim Stuart BVA-018 28 OP_TBC 1 1
Tim Stuart BVA-051 29 OP_TBC 1 1
Tim Stuart BVA-027 3 OP_TC 2 1
Tim Stuart BVA-041 4 OP_TC 2 1
Tim Stuart BVA-048 10 OP_TC 2 1
Tim Stuart BVA-020 14 OP_TC 2 1
Tim Stuart BVA-001 16 OP_TC 2 1
Tim Stuart BVA-002 17 OP_TC 2 1
Tim Stuart BVA-019 18 OP_TC 2 1
Tim Stuart BVA-044 22 OP_TC 2 1
Tim Stuart BVA-025 23 OP_TC 2 1
Tim Stuart BVA-016 26 OP_TC 2 1
Tim Stuart BVA-013 27 OP_TC 2 1
Tim Stuart BVA-001 16 OP_SICO 3 1
Tim Stuart BVA-002 17 OP_SICO 3 1
Tim Stuart BVA-013 27 OP_SICO 3 1
我需要在每个rank
组中输出最小TreatmentID
的记录,但如果记录是在上一个TreatmentID
组中输出的,我需要选择下一个rank
并输出TreamtmentID
组的记录 - 我每个TreatmentID
组只需要一条记录。
这需要是一个可以自动化的可扩展解决方案。
我的输出文件只有树唯一记录,即每个组一个记录,每个记录在BVAID
中是唯一的,并且在该组中的排名最小。
PatientName BVAID Rank TreatmentCode TreatmentID DoseID
Tim Stuart BVA-027 3 OP_TBC 1 1
Tim Stuart BVA-041 4 OP_TC 2 1
Tim Stuart BVA-001 16 OP_SICO 3 1
哪个程序可以处理这个更好的SAS或R
答案 0 :(得分:13)
紧凑,可扩展且可读的R解决方案:
require(data.table)
DT = as.data.table(dat) # dat input from Brian's answer
r = 0
DT[,{r<<-min(Rank[Rank>r]); .SD[Rank==r]}, by=TreatmentID]
TreatmentID PatientName BVAID Rank TreatmentCode DoseID
[1,] 1 Tim Stuart BVA-027 3 OP_TBC 1
[2,] 2 Tim Stuart BVA-041 4 OP_TC 1
[3,] 3 Tim Stuart BVA-001 16 OP_SICO 1
答案 1 :(得分:5)
这是一个R解决方案。我真的很想知道是否有比这更紧凑的方法。
library(plyr)
df <- df[order(df$PatientName, df$TreatmentID),]
ddply(df, .(PatientName), function(DF) {
# For each Treatment, find the value of Rank to be kept
splitRanks <- split(DF$Rank, DF$TreatmentID)
minRanks <- Reduce(f = function(X, Y) min(Y[Y>min(X)]),
x = splitRanks[-1],
init = min(splitRanks[[1]]), accumulate = TRUE)
# For each Treatment, extract row w/ Rank determined by the calculation above
splitDF <- split(DF, DF$TreatmentID)
rows <- mapply(FUN = function(X, Y) X[X$Rank==Y,], splitDF, minRanks,
SIMPLIFY = FALSE)
# Bind the extracted rows back together in a data frame
do.call("rbind", rows)
})
# PatientName BVAID Rank TreatmentCode TreatmentID DoseID
# 1 Tim Stuart BVA-027 3 OP_TBC 1 1
# 2 Tim Stuart BVA-041 4 OP_TC 2 1
# 3 Tim Stuart BVA-001 16 OP_SICO 3 1
答案 2 :(得分:5)
我的SAS解决方案。所有步骤都是可扩展的:
data test;
input PatientName $ 1-10
BVAID $
Rank
TreatmentCode $
TreatmentID
DoseID
;
datalines;
Tim Stuart BVA-027 3 OP_TBC 1 1
Tim Stuart BVA-041 4 OP_TBC 1 1
Tim Stuart BVA-021 7 OP_TBC 1 1
Tim Stuart BVA-048 10 OP_TBC 1 1
Tim Stuart BVA-020 14 OP_TBC 1 1
Tim Stuart BVA-024 15 OP_TBC 1 1
Tim Stuart BVA-001 16 OP_TBC 1 1
Tim Stuart BVA-013 27 OP_TBC 1 1
Tim Stuart BVA-018 28 OP_TBC 1 1
Tim Stuart BVA-051 29 OP_TBC 1 1
Tim Stuart BVA-027 3 OP_TC 2 1
Tim Stuart BVA-041 4 OP_TC 2 1
Tim Stuart BVA-048 10 OP_TC 2 1
Tim Stuart BVA-020 14 OP_TC 2 1
Tim Stuart BVA-001 16 OP_TC 2 1
Tim Stuart BVA-002 17 OP_TC 2 1
Tim Stuart BVA-019 18 OP_TC 2 1
Tim Stuart BVA-044 22 OP_TC 2 1
Tim Stuart BVA-025 23 OP_TC 2 1
Tim Stuart BVA-016 26 OP_TC 2 1
Tim Stuart BVA-013 27 OP_TC 2 1
Tim Stuart BVA-001 16 OP_SICO 3 1
Tim Stuart BVA-002 17 OP_SICO 3 1
Tim Stuart BVA-013 27 OP_SICO 3 1
;
run;
proc sort data=test;
by treatmentid;
run;
data test2;
set test;
by treatmentid;
retain smallest;
**
** CREATE AN EMPTY HASH TABLE THAT WE CAN STORE A LIST OF
** RANKS IN THAT HAVE ALREADY BEEN USED. DONE THIS WAY FOR
** SCALABILITY.
*;
if _n_ eq 1 then do;
declare hash ht();
ht.definekey ('rank');
ht.definedone();
end;
if first.treatmentid then do;
smallest = .;
end;
**
** IF THE CURRENT RANK HAS NOT ALREADY BEEN USED THEN
** EVALUATE IT TO SEE IF ITS THE SMALLEST VALUE.
*;
if ht.find() ne 0 then do;
smallest = min(smallest,rank);
end;
**
** SAVE THE SMALLEST UNUSED RANK BACK TO THE RANK VALUE.
** THEN ADD IT TO THE HASH TABLE AND FINALLY OUTPUT THE
** OBSERVATION.
*;
if last.treatmentid then do;
rank = smallest;
ht.add();
output;
end;
drop smallest;
run;
SAS赢了吗? JK! ; - )
答案 3 :(得分:3)
这是另一个R
解决方案。使这个问题比大多数问题更难的是它不能被视为拆分 - 应用 - 组合问题,因为要选择的行不仅取决于具有给定TreatmentID
的所有行,而且还取决于什么行的结果由前一个决定(假设这意味着下一个最小的)TreatmentID
。
首先,以可粘贴的形式提供数据(以防其他任何人想要破解它):
dat <-
structure(list(PatientName = c("Tim Stuart", "Tim Stuart", "Tim Stuart",
"Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart",
"Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart",
"Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart",
"Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart",
"Tim Stuart"), BVAID = c("BVA-027", "BVA-041", "BVA-021", "BVA-048",
"BVA-020", "BVA-024", "BVA-001", "BVA-013", "BVA-018", "BVA-051",
"BVA-027", "BVA-041", "BVA-048", "BVA-020", "BVA-001", "BVA-002",
"BVA-019", "BVA-044", "BVA-025", "BVA-016", "BVA-013", "BVA-001",
"BVA-002", "BVA-013"), Rank = c(3L, 4L, 7L, 10L, 14L, 15L, 16L,
27L, 28L, 29L, 3L, 4L, 10L, 14L, 16L, 17L, 18L, 22L, 23L, 26L,
27L, 16L, 17L, 27L), TreatmentCode = c("OP_TBC", "OP_TBC", "OP_TBC",
"OP_TBC", "OP_TBC", "OP_TBC", "OP_TBC", "OP_TBC", "OP_TBC", "OP_TBC",
"OP_TC", "OP_TC", "OP_TC", "OP_TC", "OP_TC", "OP_TC", "OP_TC",
"OP_TC", "OP_TC", "OP_TC", "OP_TC", "OP_SICO", "OP_SICO", "OP_SICO"
), TreatmentID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L), DoseID = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("PatientName", "BVAID",
"Rank", "TreatmentCode", "TreatmentID", "DoseID"), class = "data.frame",
row.names = c(NA, -24L))
现在我的解决方案
matches <- dat[0,]
TreatmentIDs <- sort(unique(dat$TreatmentID))
for (TreatmentIDidx in seq_along(TreatmentIDs)) {
TreatmentID <- TreatmentIDs[TreatmentIDidx]
treat.flg <- dat$TreatmentID == TreatmentID
match <- dat[treat.flg &
dat$Rank == min(setdiff(dat$Rank[treat.flg],
matches$Rank[matches$TreatmentID==
TreatmentIDs[TreatmentIDidx-1]])),]
matches <- rbind(matches, match)
}
给出了期望的结果:
> matches
PatientName BVAID Rank TreatmentCode TreatmentID DoseID
1 Tim Stuart BVA-027 3 OP_TBC 1 1
12 Tim Stuart BVA-041 4 OP_TC 2 1
22 Tim Stuart BVA-001 16 OP_SICO 3 1
我SAS
生锈了,我现在没有副本可以试用,所以我会留给其他人制作一个SAS
解决方案来与之比较。< / p>
答案 4 :(得分:2)
我的解决方案。
假设您有数据集(测试)并将其排序为您在此处所做的(按患者名称,治疗然后排名)。此代码适用于多个患者姓名情况,并假设这些步骤是针对每个患者名称执行的(如果您不想要此级别,请删除所有相关患者名称)
%macro m1();
%begin: proc append base=new data=test(firstobs=1 obs=1);
data _null_;
set test(firstobs=1 obs=1);
call symput('r', rank);
call symput('id',Treatmentid);
call symput('name',patientname);
data test;
set test;
if (rank=&r or Treatmentid=&id) and patientname=symget('name') then delete;
%let dsid=%sysfunc(open(test));
%let nobs=%sysfunc(attrn(&dsid,nobs));
%let rc=%sysfunc(close(&dsid));
%if &nobs^=0 %then %goto begin;
%mend;
%m1(); run;