使用SAS中的哈希对象从数据中提取某些行

时间:2012-02-10 22:08:30

标签: hash sas

我有两个SAS数据表。第一个记录有数百万条记录,每个记录都标有顺序记录ID,如下所示:

Table A

Rec  Var1 Var2 ... VarX
1    ...
2
3

第二个表指定应为Table A分配哪些行编码变量:

Table B

Code  BegRec    EndRec
AA      1200      4370
AX      7241      9488
BY     12119     14763

因此Table B的第一行表示Table Arec在1200和4370之间的所有数据都应分配代码AA。

我知道如何使用proc sql完成此操作,但我希望了解如何使用哈希对象完成此操作。

在SQL中,它只是:

proc sql;
 select b.code, a.*
 from tableA a, tableB b
 where b.begrec<=a.rec<=b.endrec;
quit;

我的实际数据包含数百GB的数据,因此我希望尽可能高效地进行处理。我的理解是,使用哈希对象可能对此有所帮助,但我还无法弄清楚如何映射我正在使用的方式。

4 个答案:

答案 0 :(得分:4)

哈希对象解决方案(从@Rob_Penridge借来的数据输入代码)。

    data big;
      do rec = 1 to 20000;
       output;
      end;
    run;

    data lookup;      
      input Code $ BegRec EndRec;
      datalines;
      AA      1200      4370
      AX      7241      9488
      BY     12119     14763
      ;
    run;


    data created;
      format code $4.;
      format begrec endrec best8.;
      if _n_=1 then do;
        declare hash h(dataset:'lookup');
        h.definekey('Code');
        h.definedata('code','begrec','endrec');
        h.definedone();
        call missing(code,begrec,endrec);
        declare hiter iter('h');
      end;

    set big;
    iter.first();
      do until (rc^=0);
       if begrec <= rec <= endrec then do;
       code_dup=code;
      end;
      rc=iter.next();
     end;
    keep rec code_dup;
    run;

答案 1 :(得分:2)

我不确定哈希表是否也是最有效的方法。我可能会使用SELECT语句来解决这个问题,因为条件逻辑会很快并且它仍然只需要对数据进行1次解析:

select;
  when ( 1200 <= _n_ <=4370) code = 'AA';
  ...
  otherwise;
end;

假设您需要多次运行此代码,并且每次您不想对select语句进行硬编码时,数据可能会发生变化。所以最好的解决方案是使用宏动态构建它。我有一个实用工具宏,我用于这种情况(包括在底部):

1)创建数据

data big;
  do i = 1 to 20000;
    output;
  end;
run;

data lookup;      
  input Code $ BegRec EndRec;
  datalines;
AA      1200      4370
AX      7241      9488
BY     12119     14763
;
run;

2)将较小的表的内容保存到宏变量中。您也可以使用call symput或其他首选方法执行此操作。此方法假设您的查找表中没有太多行。

%table_parse(iDs=lookup, iField=code  , iPrefix=code);
%table_parse(iDs=lookup, iField=begrec, iPrefix=begrec);
%table_parse(iDs=lookup, iField=endrec, iPrefix=endrec);

3)动态构建SELECT语句。

%macro ds;
  %local cnt;

  data final;
    set big;

    select;
      %do cnt=1 %to &code;
        when (&&begrec&cnt <= _n_ <= &&endrec&cnt) code = "&&code&cnt";
      %end;
      otherwise;
    end;

  run;
%mend;
%ds;

这是实用程序宏:

/*****************************************************************************
**  MACRO.TABLE_PARSE.SAS
**
**  AS PER %LIST_PARSE BUT IT TAKES INPUT FROM A FIELD IN A TABLE.
**  STORE EACH OBSERVATION'S FIELD'S VALUE INTO IT'S OWN MACRO VARIABLE.
**  THE TOTAL NUMBER OF WORDS IN THE STRING IS ALSO SAVED IN A MACRO VARIABLE.
**
**  THIS WAS CREATED BECAUSE %LIST_PARSE WOULD FALL OVER WITH VERY LONG INPUT
**  STRINGS.  THIS WILL NOT.
**
**  EACH VALUE IS STORED TO ITS OWN MACRO VARIABLE.  THE NAMES
**  ARE IN THE FORMAT <PREFIX>1 .. <PREFIX>N.
**
**  PARAMETERS:
**  iDS        : (LIB.DATASET) THE NAME OF THE DATASET TO USE.
**  iFIELD     : THE NAME OF THE FIELD WITHIN THE DATASET.
**  iPREFIX    : THE PREFIX TO USE FOR STORING EACH WORD OF THE ISTRING TO 
**               ITS OWN MACRO VARIABLE (AND THE TOTAL NUMBER OF WORDS). 
**  iDSOPTIONS : OPTIONAL. ANY DATSET OPTIONS YOU MAY WANT TO PASS IN
**               SUCH AS A WHERE FILTER OR KEEP STATEMENT.
**
******************************************************************************
**  HISTORY:
**  1.0  MODIFIED: 01-FEB-2007  BY: ROBERT PENRIDGE
**  - CREATED.
**  1.1  MODIFIED: 27-AUG-2010  BY: ROBERT PENRIDGE
**  - MODIFIED TO ALLOW UNMATCHED QUOTES ETC IN VALUES BEING RETURNED BY 
**    CHARACTER FIELDS.
**  1.2  MODIFIED: 30-AUG-2010  BY: ROBERT PENRIDGE
**  - MODIFIED TO ALLOW BLANK CHARACTER VALUES AND ALSO REMOVED TRAILING
**    SPACES INTRODUCED BY CHANGE 1.1.
**  1.3  MODIFIED: 31-AUG-2010  BY: ROBERT PENRIDGE
**  - MODIFIED TO ALLOW PARENTHESES IN CHARACTER VALUES.
**  1.4  MODIFIED: 31-AUG-2010  BY: ROBERT PENRIDGE
**  - ADDED SOME DEBUG VALUES TO DETERMINE WHY IT SOMETIMES LOCKS TABLES.
*****************************************************************************/
%macro table_parse(iDs=, iField=, iDsOptions=, iPrefix=);
  %local dsid pos rc cnt cell_value type;

  %let cnt=0;
  /*
  ** OPEN THE TABLE (AND MAKE SURE IT EXISTS)
  */
  %let dsid=%sysfunc(open(&iDs(&iDsOptions),i));
  %if &dsid eq 0 %then %do;
    %put WARNING: MACRO.TABLE_PARSE.SAS: %sysfunc(sysmsg());      
  %end;

  /*
  ** GET THE POSITION OF THE FIELD (AND MAKE SURE IT EXISTS)
  */
  %let pos=%sysfunc(varnum(&dsid,&iField));
  %if &pos eq 0 %then %do;
    %put WARNING: MACRO.TABLE_PARSE.SAS: %sysfunc(sysmsg());      
  %end;
  %else %do;
    /*
    ** DETERMINE THE TYPE OF THE FIELD
    */
    %let type = %upcase(%sysfunc(vartype(&dsid,&pos)));
  %end;

  /*
  ** READ THROUGH EACH OBSERVATION IN THE TABLE
  */
  %let rc=%sysfunc(fetch(&dsid));
  %do %while (&rc eq 0);
    %let cnt = %eval(&cnt + 1);
    %if "&type" = "C" %then %do;
      %let cell_value = %qsysfunc(getvarc(&dsid,&pos));
      %if "%trim(&cell_value)" ne "" %then %do;
        %let cell_value = %qsysfunc(cats(%nrstr(&cell_value)));
      %end;
    %end;
    %else %do;
      %let cell_value = %sysfunc(getvarn(&dsid,&pos));
    %end;

    %global &iPrefix.&cnt ;
    %let &iPrefix.&cnt = &cell_value ;

    %let rc=%sysfunc(fetch(&dsid));
  %end;


  /*
  ** CHECK FOR ABNORMAL TERMINATION OF LOOP
  */
  %if &rc ne -1 %then %do;
    %put WARNING: MACRO.TABLE_PARSE.SAS: %sysfunc(sysmsg());      
  %end;


  /*
  ** ENSURE THE TABLE IS CLOSED SUCCESSFULLY
  */
  %let rc=%sysfunc(close(&dsid));
  %if &rc %then %do;
    %put WARNING: MACRO.TABLE_PARSE.SAS: %sysfunc(sysmsg());      
  %end;

  %global &iPrefix;
  %let &iPrefix = &cnt ;
%mend;

调用此宏的其他示例:

%table_parse(iDs=sashelp.class, iField=sex, iPrefix=myTable, iDsOptions=%str(where=(sex='F')));
%put &mytable &myTable1 &myTable2 &myTable3; *etc...;

答案 2 :(得分:2)

我很想使用直接访问方法POINT =这里,这只会读取所需的行号而不是整个数据集。 这是代码,它使用与Rob的答案相同的创建数据代码。

    data want;
    set lookup;
    do i=begrec to endrec;
    set big point=i;
    output;
    end;
    drop begrec endrec;
    run;

如果大数据集中已有代码列,并且您只想更新查找数据集中的值,则可以使用MODIFY执行此操作。

    data big;
    set lookup (rename=(code=code1));
    do i=begrec to endrec;
    modify big point=i;
    code=code1;
    replace;
    end;
    run;

答案 3 :(得分:2)

这是我的解决方案,使用proc format。这也是在内存中完成的,很像哈希表,但需要较少的结构代码才能工作。

(数据输入代码借用@Rob_Penridge。)

data big;
  do rec = 1 to 20000;
   output;
  end;
run;

data lookup;      
  input Code $ BegRec EndRec;
  datalines;
  ZZ         0        20
  JJ        40        60
  AA      1200      4370
  AX      7241      9488
  BY     12119     14763
  ;
run;

data lookup_f;
    set lookup;    
    rename
        BegRec  = start
        EndRec  = end
        Code    = label;

    retain fmtname 'CodeRecFormat';
run;

proc format library = work cntlin=lookup_f; run;


data big_formatted;
    format rec CodeRecFormat.;
    format rec2 8.;
    length code $5.;

    set big;    

    code = putn(rec, "CodeRecFormat.");
    rec2 = rec;
run;