使用SAS和R对记录进行排序和输出

时间:2012-01-11 21:09:03

标签: r sas

我有以下数据集

PatientName BVAID   Rank    TreatmentCode   TreatmentID DoseID  
Tim Stuart  BVA-027 3   OP_TBC            1             1  
Tim Stuart  BVA-041 4   OP_TBC            1             1  
Tim Stuart  BVA-021 7   OP_TBC            1             1  
Tim Stuart  BVA-048 10  OP_TBC            1             1  
Tim Stuart  BVA-020 14  OP_TBC            1             1  
Tim Stuart  BVA-024 15  OP_TBC            1             1  
Tim Stuart  BVA-001 16  OP_TBC            1             1  
Tim Stuart  BVA-013 27  OP_TBC            1             1  
Tim Stuart  BVA-018 28  OP_TBC            1             1  
Tim Stuart  BVA-051 29  OP_TBC            1             1  
Tim Stuart  BVA-027 3   OP_TC             2             1  
Tim Stuart  BVA-041 4   OP_TC             2             1  
Tim Stuart  BVA-048 10  OP_TC             2             1  
Tim Stuart  BVA-020 14  OP_TC             2             1    
Tim Stuart  BVA-001 16  OP_TC             2             1  
Tim Stuart  BVA-002 17  OP_TC             2             1    
Tim Stuart  BVA-019 18  OP_TC             2             1  
Tim Stuart  BVA-044 22  OP_TC             2             1  
Tim Stuart  BVA-025 23  OP_TC             2             1  
Tim Stuart  BVA-016 26  OP_TC             2             1  
Tim Stuart  BVA-013 27  OP_TC             2             1  
Tim Stuart  BVA-001 16  OP_SICO           3             1  
Tim Stuart  BVA-002 17  OP_SICO           3             1  
Tim Stuart  BVA-013 27  OP_SICO           3             1  

我需要在每个rank组中输出最小TreatmentID的记录,但如果记录是在上一个TreatmentID组中输出的,我需要选择下一个rank并输出TreamtmentID组的记录 - 我每个TreatmentID组只需要一条记录。 这需要是一个可以自动化的可扩展解决方案。 我的输出文件只有树唯一记录,即每个组一个记录,每个记录在BVAID中是唯一的,并且在该组中的排名最小。

PatientName BVAID   Rank    TreatmentCode   TreatmentID DoseID  
Tim Stuart  BVA-027 3   OP_TBC            1             1  
Tim Stuart  BVA-041 4   OP_TC             2             1  
Tim Stuart  BVA-001 16  OP_SICO           3             1  

哪个程序可以处理这个更好的SAS或R

5 个答案:

答案 0 :(得分:13)

紧凑,可扩展且可读的R解决方案:

require(data.table)
DT = as.data.table(dat)   # dat input from Brian's answer
r = 0
DT[,{r<<-min(Rank[Rank>r]); .SD[Rank==r]}, by=TreatmentID]

     TreatmentID PatientName   BVAID Rank TreatmentCode DoseID
[1,]           1  Tim Stuart BVA-027    3        OP_TBC      1
[2,]           2  Tim Stuart BVA-041    4         OP_TC      1
[3,]           3  Tim Stuart BVA-001   16       OP_SICO      1

答案 1 :(得分:5)

这是一个R解决方案。我真的很想知道是否有比这更紧凑的方法。

library(plyr)

df <- df[order(df$PatientName, df$TreatmentID),]

ddply(df, .(PatientName), function(DF) {
    # For each Treatment, find the value of Rank to be kept    
    splitRanks <- split(DF$Rank, DF$TreatmentID)
    minRanks <- Reduce(f = function(X, Y) min(Y[Y>min(X)]), 
                       x = splitRanks[-1], 
                       init = min(splitRanks[[1]]), accumulate = TRUE)
    # For each Treatment, extract row w/ Rank determined by the calculation above
    splitDF <- split(DF, DF$TreatmentID)
    rows <- mapply(FUN = function(X, Y) X[X$Rank==Y,], splitDF, minRanks, 
                   SIMPLIFY = FALSE)
    # Bind the extracted rows back together in a data frame
    do.call("rbind", rows)
})

#   PatientName   BVAID Rank TreatmentCode TreatmentID DoseID
# 1  Tim Stuart BVA-027    3        OP_TBC           1      1
# 2  Tim Stuart BVA-041    4         OP_TC           2      1
# 3  Tim Stuart BVA-001   16       OP_SICO           3      1

答案 2 :(得分:5)

我的SAS解决方案。所有步骤都是可扩展的:

data test;
  input  PatientName $ 1-10
         BVAID $ 
         Rank    
         TreatmentCode $  
         TreatmentID 
         DoseID 
        ;
datalines;
Tim Stuart  BVA-027 3   OP_TBC            1             1  
Tim Stuart  BVA-041 4   OP_TBC            1             1  
Tim Stuart  BVA-021 7   OP_TBC            1             1  
Tim Stuart  BVA-048 10  OP_TBC            1             1  
Tim Stuart  BVA-020 14  OP_TBC            1             1  
Tim Stuart  BVA-024 15  OP_TBC            1             1  
Tim Stuart  BVA-001 16  OP_TBC            1             1  
Tim Stuart  BVA-013 27  OP_TBC            1             1  
Tim Stuart  BVA-018 28  OP_TBC            1             1  
Tim Stuart  BVA-051 29  OP_TBC            1             1  
Tim Stuart  BVA-027 3   OP_TC             2             1  
Tim Stuart  BVA-041 4   OP_TC             2             1  
Tim Stuart  BVA-048 10  OP_TC             2             1  
Tim Stuart  BVA-020 14  OP_TC             2             1    
Tim Stuart  BVA-001 16  OP_TC             2             1  
Tim Stuart  BVA-002 17  OP_TC             2             1    
Tim Stuart  BVA-019 18  OP_TC             2             1  
Tim Stuart  BVA-044 22  OP_TC             2             1  
Tim Stuart  BVA-025 23  OP_TC             2             1  
Tim Stuart  BVA-016 26  OP_TC             2             1  
Tim Stuart  BVA-013 27  OP_TC             2             1  
Tim Stuart  BVA-001 16  OP_SICO           3             1  
Tim Stuart  BVA-002 17  OP_SICO           3             1  
Tim Stuart  BVA-013 27  OP_SICO           3             1  
;
run;

proc sort data=test;
  by treatmentid;
run;

data test2;
  set test;
  by treatmentid;
  retain smallest;

  **
  ** CREATE AN EMPTY HASH TABLE THAT WE CAN STORE A LIST OF 
  ** RANKS IN THAT HAVE ALREADY BEEN USED. DONE THIS WAY FOR
  ** SCALABILITY.
  *;
  if _n_ eq 1 then do;
    declare hash ht();
    ht.definekey ('rank');
    ht.definedone();
  end;

  if first.treatmentid then do;
    smallest = .;
  end;

  **
  ** IF THE CURRENT RANK HAS NOT ALREADY BEEN USED THEN 
  ** EVALUATE IT TO SEE IF ITS THE SMALLEST VALUE.
  *;
  if ht.find() ne 0 then do;
    smallest = min(smallest,rank);
  end;

  **
  ** SAVE THE SMALLEST UNUSED RANK BACK TO THE RANK VALUE.
  ** THEN ADD IT TO THE HASH TABLE AND FINALLY OUTPUT THE
  ** OBSERVATION.
  *;
  if last.treatmentid then do;
    rank = smallest;
    ht.add();    
    output;
  end;

  drop smallest;
run;

SAS赢了吗? JK! ; - )

答案 3 :(得分:3)

这是另一个R解决方案。使这个问题比大多数问题更难的是它不能被视为拆分 - 应用 - 组合问题,因为要选择的行不仅取决于具有给定TreatmentID的所有行,而且还取决于什么行的结果由前一个决定(假设这意味着下一个最小的)TreatmentID

首先,以可粘贴的形式提供数据(以防其他任何人想要破解它):

dat <-
structure(list(PatientName = c("Tim Stuart", "Tim Stuart", "Tim Stuart", 
"Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart", 
"Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart", 
"Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart", 
"Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart", 
"Tim Stuart"), BVAID = c("BVA-027", "BVA-041", "BVA-021", "BVA-048", 
"BVA-020", "BVA-024", "BVA-001", "BVA-013", "BVA-018", "BVA-051", 
"BVA-027", "BVA-041", "BVA-048", "BVA-020", "BVA-001", "BVA-002", 
"BVA-019", "BVA-044", "BVA-025", "BVA-016", "BVA-013", "BVA-001", 
"BVA-002", "BVA-013"), Rank = c(3L, 4L, 7L, 10L, 14L, 15L, 16L, 
27L, 28L, 29L, 3L, 4L, 10L, 14L, 16L, 17L, 18L, 22L, 23L, 26L, 
27L, 16L, 17L, 27L), TreatmentCode = c("OP_TBC", "OP_TBC", "OP_TBC", 
"OP_TBC", "OP_TBC", "OP_TBC", "OP_TBC", "OP_TBC", "OP_TBC", "OP_TBC", 
"OP_TC", "OP_TC", "OP_TC", "OP_TC", "OP_TC", "OP_TC", "OP_TC", 
"OP_TC", "OP_TC", "OP_TC", "OP_TC", "OP_SICO", "OP_SICO", "OP_SICO"
), TreatmentID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L), DoseID = c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("PatientName", "BVAID", 
"Rank", "TreatmentCode", "TreatmentID", "DoseID"), class = "data.frame", 
row.names = c(NA, -24L))

现在我的解决方案

matches <- dat[0,]
TreatmentIDs <- sort(unique(dat$TreatmentID))
for (TreatmentIDidx in seq_along(TreatmentIDs)) {
    TreatmentID <- TreatmentIDs[TreatmentIDidx]
    treat.flg <- dat$TreatmentID == TreatmentID
    match <- dat[treat.flg &
                 dat$Rank == min(setdiff(dat$Rank[treat.flg],
                                         matches$Rank[matches$TreatmentID==
                                                      TreatmentIDs[TreatmentIDidx-1]])),]
    matches <- rbind(matches, match)
}

给出了期望的结果:

> matches
   PatientName   BVAID Rank TreatmentCode TreatmentID DoseID
1   Tim Stuart BVA-027    3        OP_TBC           1      1
12  Tim Stuart BVA-041    4         OP_TC           2      1
22  Tim Stuart BVA-001   16       OP_SICO           3      1

SAS生锈了,我现在没有副本可以试用,所以我会留给其他人制作一个SAS解决方案来与之比较。< / p>

答案 4 :(得分:2)

我的解决方案。

假设您有数据集(测试)并将其排序为您在此处所做的(按患者名称,治疗然后排名)。此代码适用于多个患者姓名情况,并假设这些步骤是针对每个患者名称执行的(如果您不想要此级别,请删除所有相关患者名称)

%macro m1();
%begin: proc append base=new data=test(firstobs=1 obs=1);

data _null_;
set test(firstobs=1 obs=1);
call symput('r', rank);
call symput('id',Treatmentid);
call symput('name',patientname);

data test;
set test;
if (rank=&r or Treatmentid=&id) and patientname=symget('name') then delete;

%let dsid=%sysfunc(open(test));
%let nobs=%sysfunc(attrn(&dsid,nobs)); 
%let rc=%sysfunc(close(&dsid));

%if &nobs^=0 %then %goto begin;
%mend;

%m1(); run;