在SAS中按变量聚合数据的问题

时间:2015-08-28 21:36:04

标签: sas

我的数据如下:

ID  FileSource      Age MamUlt  ProcDate    Name
223 Facility        35  M       19591       SWEDISH
223 Facility        35  M       19592       SWEDISH
223 Facility        35  U       19592       SWEDISH
223 Facility        35  U       19593       SWEDISH
223 Non-Facility    35  M       19594       RADIA
223 Non-Facility    35  U       19594       RADIA

我想要做的是将这些数据(对于数据集中的每个ID)组合成如下所示:

ID   Age MAMs ULTs SameDate 
223  35  3    3    2

因此,对于每个ID,我需要总时间" M"和" U"出现以及他们出现在同一天的次数;这个样本中有两次。

这是我到目前为止所做的:

data ImageTotals;
    set ImageClaims;
    by ID;
    retain ID MAMs ULTs SameDate;

    if first.ID then do;
        MAMs = 0;
        ULTs = 0;
        MamDate = .;
        UltDate = .;
        SameDate = 0;
    end;

    if MamUlt = "M" then do; MAMs = MAMs + 1; MamDate = ProcDate; end;
    if MamUlt = "U" then do; ULTs = ULTs + 1; UltDate = ProcDate; end;
    if MamDate = UltDate and MamDate ^= . then do; SameDate = SameDate+1; end;

    if last.ID;
    keep ID MAMs ULTs SameDate;
run;

有什么建议吗?这解决了计数问题,但没有解决SameDate问题(此实例仍然为零)。

3 个答案:

答案 0 :(得分:2)

您可以使用DOW循环在数据步骤中进行聚合。数据必须按ID和PROCDATE排序。在同一天内计算M或U出现的次数。然后,您可以使用这些日期计数在ID级别进行聚合,并测试两者是否出现在同一日期。只保留AGE变量,使其具有该ID的最后一条记录的值。

data counts ;
  do until (last.id);
    m=0;
    u=0;
    do until (last.procdate);
      set imageclaims;
      by id procdate;
      m= sum(m,proc='M');
      u= sum(u,proc='U');
    end;
    MAMs=sum(mams,m);
    ULTs=sum(ults,u);
    SameDate=sum(samedate,m and u);
  end;
  keep id age mams ults samedate ;
run;

答案 1 :(得分:1)

我认为这可能是一个SQL问题(不是我的专长),但是自从你开始使用DATA步骤解决方案后,我对两者都进行了尝试。我还添加了更多的测试数据。

data ImageClaims;
  input id age Proc $1. ProcDate;
  cards;
223 35 M 19591
223 35 M 19592
223 35 U 19592
223 35 U 19593
223 35 M 19594
223 35 U 19594
224 35 M 19591
224 35 M 19592
224 35 M 19593
224 35 M 19593
224 35 M 19594
224 35 U 19595
225 35 M 19592
225 35 U 19592
225 35 U 19593
225 35 M 19593
225 35 M 19594
225 35 U 19594
;
run;

对于DATA步骤方法,为MAM,ULT和MAMULT(同一天的Mam和Ult)创建计数器。注意,因为我对这些计数器(MAM ++ 1)使用sum语句,它们被隐式保留。

data ImageTotals (keep=id Age MAMs ULTs MAMULTs);
  set ImageClaims;
  by ID ProcDate;
  retain HaveMam HaveUlt; *Count vars are implicitly retained by sum statement;
  if first.ID then do;
    MAMs=0;    *count of mammograms;
    ULTs=0;    *count of ultrasounds;
    MAMULTs=0; *count of mammograms and ultrasounds on same date;
  end;
  if first.ProcDate then do;
    HaveMam=0;  *indicator for have a mammogram or not on that date;
    HaveUlt=0;  *indicator for have an ultrasound or not on that date;
  end;

  if Proc='M' then do;
    HaveMam=1;  *set mammogram indicator (for that date);
    MAMs++1;    *increment counter;
  end;
  else if Proc='U' then do;
    HaveUlt=1;  *set ultrasound indicator (for that date);
    ULTs++1;    *increment counter;
  end;

  if last.ProcDate then do;
    MAMULTs++(HaveMam=1 and HaveUlt=1); *increment MamUlts counter if had both on same date;
  end;

  if last.id;
run;

对于SQL解决方案,我使用通过ID和ProcDate计算MAM,ULT和MAMULT的子查询,然后外部查询按ID对它们求和。可能有一个更好的SQL解决方案,但我认为这是有效的。

proc sql;
  create table ImageTotals as
    select id
          ,max(age) as age  /*arbitrary use of max age is constant within id*/
          ,sum(MAMs) as MAMs
          ,sum(ULTs) as ULTs
          ,sum(MAMULTs) as MAMULTs
    from (
          select id
                ,procdate
                ,max(age) as age
                ,sum(Proc='M') as MAMs
                ,sum(Proc='U') as ULTs
                ,count(distinct(Proc))=2 as MAMULTs
          from ImageClaims
          group by id,ProcDate
          )
    group by id
  ;
quit;

proc print;
run;

我从两个步骤得到的Work.ImageTotals是:

Obs     id    age    MAMs    ULTs    MAMULTs

 1     223     35      3       3        2
 2     224     35      5       1        0
 3     225     35      3       3        3

答案 2 :(得分:0)

一旦你接受了Q的建议,认为这可以通过proc sql(count / group by)来解决,除非我在这里误解了复杂性......会发布一些代码,但是会让你先解决它...