Question

我有很多large_tables（数十亿行），我想基于id_list（数百万行）进行子集化。我正在使用哈希表来加速它：

data subset1;
    set large_table1;
    if _n_ eq 1 then do;
        declare hash ht(dataset:"id_list");
        ht.definekey('id');
        ht.definedone();
    end;
    if ht.check() eq 0 then do; output; end;
run;

如何重用id_list＆＃39;哈希表？在每个子集查询中重新创建它会浪费太多时间。

更新：如答案所示，目前还没有办法在SAS中制作持久性哈希表。我根据经验测试了两个不太理想的选项，其中包含1200万行id_list和15亿行large_table。使用格式代替哈希表花费的时间几乎翻了一倍（40分钟对23分钟）。这使得在每个数据步骤中重新创建哈希表的开销可以忽略不计，因此我暂时只是这样做。

Answer 1

可悲的是，哈希表不能跨越DATA步骤。 AFAIK，当步骤结束时，它们被擦除以释放记忆。我看到Art Carpenter在SGF 2018上的一次演讲，他尝试了不同的方法来欺骗SAS制作一个持久的哈希表，并且无法成功。

https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2018/2399-2018.pdf

Answer 2

为了完整;这就是你如何重用哈希：使用FCMP。它并没有在数据步骤中真正重用该表（它将重新加载哈希表），但在宏中它仍然存在。

proc fcmp outlib=work.funcs.func;
function check_ids( name $ );
 declare hash h_ids(dataset:"work.class_names");
 rc = h_ids.defineKey( "name");
 rc = h_ids.definedone();
 rc = h_ids.check();
 return( not rc );
endsub;

quit;

data class_names;
  set sashelp.class;
  where sex='F'; 
run;

 options cmplib=work.funcs;

data class_find_f;
   set sashelp.class;
   if check_ids(name)=1;
run;

有关FCMP中散列的详细信息，请参阅Hashing in PROC FCMP to Enhance Your Productivity。

Answer 3

我这样做的方法不是使用哈希表，而是使用format。

data for_fmt;
  set id_list;
  retain fmtname 'idlistf' type 'n'; *or c if id is character, and add $ to fmtname;
  start=id;
  label=1;
  output;
  if _n_=1 then do;  *this section we tell it what to do with 'other' (not found) IDs;
    hlo='o';
    call missing(start); *unneeded but I like to do this for clarity;
    label=0;
    output;
  end;
run;

*if ID can be duplicated, then run a proc sort nodupkey here;

proc format cntlin=for_fmt;
run;

这会持续存在，并且应该与哈希表一样快。如果您的ID列表非常大，您可以在此处使用view并仅处理一次。

Answer 4

您还可以使用SASFILE语句将较小的数据集加载到内存中。

http://documentation.sas.com/?docsetId=lestmtsglobal&docsetTarget=n0osyhi338pfaan1plin9ioilduk.htm&docsetVersion=9.4&locale=en

这会加速每次加载，因为它会从内存加载到内存，而不是从磁盘加载到内存......

如何在SAS中重用哈希表

4 个答案: