我有两个SAS数据表。第一个记录有数百万条记录,每个记录都标有顺序记录ID,如下所示:
Table A Rec Var1 Var2 ... VarX 1 ... 2 3
第二个表指定应为Table A
分配哪些行编码变量:
Table B Code BegRec EndRec AA 1200 4370 AX 7241 9488 BY 12119 14763
因此Table B
的第一行表示Table A
中rec
在1200和4370之间的所有数据都应分配代码AA。
我知道如何使用proc sql
完成此操作,但我希望了解如何使用哈希对象完成此操作。
在SQL中,它只是:
proc sql;
select b.code, a.*
from tableA a, tableB b
where b.begrec<=a.rec<=b.endrec;
quit;
我的实际数据包含数百GB的数据,因此我希望尽可能高效地进行处理。我的理解是,使用哈希对象可能对此有所帮助,但我还无法弄清楚如何映射我正在使用的方式。
答案 0 :(得分:4)
哈希对象解决方案(从@Rob_Penridge借来的数据输入代码)。
data big;
do rec = 1 to 20000;
output;
end;
run;
data lookup;
input Code $ BegRec EndRec;
datalines;
AA 1200 4370
AX 7241 9488
BY 12119 14763
;
run;
data created;
format code $4.;
format begrec endrec best8.;
if _n_=1 then do;
declare hash h(dataset:'lookup');
h.definekey('Code');
h.definedata('code','begrec','endrec');
h.definedone();
call missing(code,begrec,endrec);
declare hiter iter('h');
end;
set big;
iter.first();
do until (rc^=0);
if begrec <= rec <= endrec then do;
code_dup=code;
end;
rc=iter.next();
end;
keep rec code_dup;
run;
答案 1 :(得分:2)
我不确定哈希表是否也是最有效的方法。我可能会使用SELECT
语句来解决这个问题,因为条件逻辑会很快并且它仍然只需要对数据进行1次解析:
select;
when ( 1200 <= _n_ <=4370) code = 'AA';
...
otherwise;
end;
假设您需要多次运行此代码,并且每次您不想对select语句进行硬编码时,数据可能会发生变化。所以最好的解决方案是使用宏动态构建它。我有一个实用工具宏,我用于这种情况(包括在底部):
1)创建数据
data big;
do i = 1 to 20000;
output;
end;
run;
data lookup;
input Code $ BegRec EndRec;
datalines;
AA 1200 4370
AX 7241 9488
BY 12119 14763
;
run;
2)将较小的表的内容保存到宏变量中。您也可以使用call symput
或其他首选方法执行此操作。此方法假设您的查找表中没有太多行。
%table_parse(iDs=lookup, iField=code , iPrefix=code);
%table_parse(iDs=lookup, iField=begrec, iPrefix=begrec);
%table_parse(iDs=lookup, iField=endrec, iPrefix=endrec);
3)动态构建SELECT
语句。
%macro ds;
%local cnt;
data final;
set big;
select;
%do cnt=1 %to &code;
when (&&begrec&cnt <= _n_ <= &&endrec&cnt) code = "&&code&cnt";
%end;
otherwise;
end;
run;
%mend;
%ds;
这是实用程序宏:
/*****************************************************************************
** MACRO.TABLE_PARSE.SAS
**
** AS PER %LIST_PARSE BUT IT TAKES INPUT FROM A FIELD IN A TABLE.
** STORE EACH OBSERVATION'S FIELD'S VALUE INTO IT'S OWN MACRO VARIABLE.
** THE TOTAL NUMBER OF WORDS IN THE STRING IS ALSO SAVED IN A MACRO VARIABLE.
**
** THIS WAS CREATED BECAUSE %LIST_PARSE WOULD FALL OVER WITH VERY LONG INPUT
** STRINGS. THIS WILL NOT.
**
** EACH VALUE IS STORED TO ITS OWN MACRO VARIABLE. THE NAMES
** ARE IN THE FORMAT <PREFIX>1 .. <PREFIX>N.
**
** PARAMETERS:
** iDS : (LIB.DATASET) THE NAME OF THE DATASET TO USE.
** iFIELD : THE NAME OF THE FIELD WITHIN THE DATASET.
** iPREFIX : THE PREFIX TO USE FOR STORING EACH WORD OF THE ISTRING TO
** ITS OWN MACRO VARIABLE (AND THE TOTAL NUMBER OF WORDS).
** iDSOPTIONS : OPTIONAL. ANY DATSET OPTIONS YOU MAY WANT TO PASS IN
** SUCH AS A WHERE FILTER OR KEEP STATEMENT.
**
******************************************************************************
** HISTORY:
** 1.0 MODIFIED: 01-FEB-2007 BY: ROBERT PENRIDGE
** - CREATED.
** 1.1 MODIFIED: 27-AUG-2010 BY: ROBERT PENRIDGE
** - MODIFIED TO ALLOW UNMATCHED QUOTES ETC IN VALUES BEING RETURNED BY
** CHARACTER FIELDS.
** 1.2 MODIFIED: 30-AUG-2010 BY: ROBERT PENRIDGE
** - MODIFIED TO ALLOW BLANK CHARACTER VALUES AND ALSO REMOVED TRAILING
** SPACES INTRODUCED BY CHANGE 1.1.
** 1.3 MODIFIED: 31-AUG-2010 BY: ROBERT PENRIDGE
** - MODIFIED TO ALLOW PARENTHESES IN CHARACTER VALUES.
** 1.4 MODIFIED: 31-AUG-2010 BY: ROBERT PENRIDGE
** - ADDED SOME DEBUG VALUES TO DETERMINE WHY IT SOMETIMES LOCKS TABLES.
*****************************************************************************/
%macro table_parse(iDs=, iField=, iDsOptions=, iPrefix=);
%local dsid pos rc cnt cell_value type;
%let cnt=0;
/*
** OPEN THE TABLE (AND MAKE SURE IT EXISTS)
*/
%let dsid=%sysfunc(open(&iDs(&iDsOptions),i));
%if &dsid eq 0 %then %do;
%put WARNING: MACRO.TABLE_PARSE.SAS: %sysfunc(sysmsg());
%end;
/*
** GET THE POSITION OF THE FIELD (AND MAKE SURE IT EXISTS)
*/
%let pos=%sysfunc(varnum(&dsid,&iField));
%if &pos eq 0 %then %do;
%put WARNING: MACRO.TABLE_PARSE.SAS: %sysfunc(sysmsg());
%end;
%else %do;
/*
** DETERMINE THE TYPE OF THE FIELD
*/
%let type = %upcase(%sysfunc(vartype(&dsid,&pos)));
%end;
/*
** READ THROUGH EACH OBSERVATION IN THE TABLE
*/
%let rc=%sysfunc(fetch(&dsid));
%do %while (&rc eq 0);
%let cnt = %eval(&cnt + 1);
%if "&type" = "C" %then %do;
%let cell_value = %qsysfunc(getvarc(&dsid,&pos));
%if "%trim(&cell_value)" ne "" %then %do;
%let cell_value = %qsysfunc(cats(%nrstr(&cell_value)));
%end;
%end;
%else %do;
%let cell_value = %sysfunc(getvarn(&dsid,&pos));
%end;
%global &iPrefix.&cnt ;
%let &iPrefix.&cnt = &cell_value ;
%let rc=%sysfunc(fetch(&dsid));
%end;
/*
** CHECK FOR ABNORMAL TERMINATION OF LOOP
*/
%if &rc ne -1 %then %do;
%put WARNING: MACRO.TABLE_PARSE.SAS: %sysfunc(sysmsg());
%end;
/*
** ENSURE THE TABLE IS CLOSED SUCCESSFULLY
*/
%let rc=%sysfunc(close(&dsid));
%if &rc %then %do;
%put WARNING: MACRO.TABLE_PARSE.SAS: %sysfunc(sysmsg());
%end;
%global &iPrefix;
%let &iPrefix = &cnt ;
%mend;
调用此宏的其他示例:
%table_parse(iDs=sashelp.class, iField=sex, iPrefix=myTable, iDsOptions=%str(where=(sex='F')));
%put &mytable &myTable1 &myTable2 &myTable3; *etc...;
答案 2 :(得分:2)
我很想使用直接访问方法POINT =这里,这只会读取所需的行号而不是整个数据集。 这是代码,它使用与Rob的答案相同的创建数据代码。
data want;
set lookup;
do i=begrec to endrec;
set big point=i;
output;
end;
drop begrec endrec;
run;
如果大数据集中已有代码列,并且您只想更新查找数据集中的值,则可以使用MODIFY执行此操作。
data big;
set lookup (rename=(code=code1));
do i=begrec to endrec;
modify big point=i;
code=code1;
replace;
end;
run;
答案 3 :(得分:2)
这是我的解决方案,使用proc format
。这也是在内存中完成的,很像哈希表,但需要较少的结构代码才能工作。
(数据输入代码也借用@Rob_Penridge。)
data big;
do rec = 1 to 20000;
output;
end;
run;
data lookup;
input Code $ BegRec EndRec;
datalines;
ZZ 0 20
JJ 40 60
AA 1200 4370
AX 7241 9488
BY 12119 14763
;
run;
data lookup_f;
set lookup;
rename
BegRec = start
EndRec = end
Code = label;
retain fmtname 'CodeRecFormat';
run;
proc format library = work cntlin=lookup_f; run;
data big_formatted;
format rec CodeRecFormat.;
format rec2 8.;
length code $5.;
set big;
code = putn(rec, "CodeRecFormat.");
rec2 = rec;
run;