如何在SAS数据集中查找两个变量并更新值

时间:2016-05-10 10:48:30

标签: sas lookup

我有一个数据集,其中我有两个变量,即ID和电话号码,我想查找电话号码并更新相同的变量,无论它丢失哪个ID为ex:

在附加的示例中,ID A和B在电话栏中缺少值,我想在可用的地方选择相同的值并更新它不在的地方。

Date Visitor_ID Telephone number 1-Mar-16 1000003634_4152228768
1-Mar-16 1000094865_1269576832
1-Mar-16 1000103735_1035466360
1-Mar-16 1000103735_1035466360 fda6a5563867eeebf19fb3 1-Mar-16 1000108145_3760680616
1-Mar-16 1000123010_2631619556
1-Mar-16 1000123010_2631619556 fda6a75c3765e0e8f797b4 1-Mar-16 1000126547_974397207
1-Mar-16 1000126592_2744218771
1-Mar-16 1000137177_3054387520
1-Mar-16 1000137208_498258799
1-Mar-16 1000137208_498258799 fda6a5563660e0ebf295b3 1-Mar-16 1000137460_2624495603
1-Mar-16 1000137460_2624495603 fda6a6583763eaeaf29eba 1-Mar-16 1000151867_3243977925
1-Mar-16 1000151867_3243977925 fda6a15a3f63eaedfb94b3 1-Mar-16 1000166048_3215927260
1-Mar-16 1000174960_357067493
1-Mar-16 1000178443_623552771
1-Mar-16 1000183569_2728954199
1-Mar-16 1000220805_3781532691
1-Mar-16 1000220805_3781532691 fda6aa5c3a64e0ebfb96b0

image

2 个答案:

答案 0 :(得分:1)

这是一个解决方案 - 它涉及哈希表和密钥连接 - 您可能以前没有见过这些。哈希表只是一个内存中的表,您可以轻松访问。键连接非常适合您在此处尝试执行的操作,您可以使用它们在索引上查找并更新现有数据集中的字段。

data telephone_nos;
 length id $1. telephone $2;
 id = "a"; telephone = ""; output;
 id = "b"; telephone = ""; output;
 id = "c"; telephone = ""; output;
 id = "b"; telephone = "13"; output;
 id = "a"; telephone = "12"; output;
 id = "e"; telephone = ""; output;
 id = "d"; telephone = ""; output;
 id = "c"; telephone = ""; output;
 id = "a"; telephone = ""; output;
run;


/* Create a telephone number lookup table that is deduped and indexed by id*/
data lookup_telephone_nos (drop = rc index = (id));
 /*create a hash table with a lookup id*/
 declare hash dedupe();
 dedupe.definekey('id');
 dedupe.definedone();
  do while (not e);
   /*Only read in data with telephone numbers*/
   set telephone_nos (keep = id telephone
                      where = (telephone ne "")) end = e;
   /*Check to see if you have already seen this telephone number*/
   rc=dedupe.check();
   /*If you haven't add it to the hash table and output it*/
   if rc ne 0 then do;
    rc=dedupe.add(); 
    output;              
   end;
  end;
  /*Remove the hash table*/
  dedupe.delete();
  stop;
 run;

/*If you don't have enough memory to use hash tables to dedupe - then create
  the above table without deduping (see below). This may take up more    
  physical disc space, but the key join will still work as it will pick up   
  the first instance that matches*/

/*
data lookup_telephone_nos (drop = rc index = (id));
 set telephone_nos (keep = id telephone
                    where = (telephone ne ""));
run;
*/

 /*Use a key join to fill in the missing telephone numbers*/
 data telephone_nos;
  set telephone_nos;
  /*Use a key join to fill in the missing telephone numbers*/
  set lookup_telephone_nos key = id / unique;

  /* _iorc_ will be 0 if a match is found, if no match is found and error will be written to the log, therefore
     If no matches are found (e.g. the b and c examples) then make sure that these do not cause errors*/
  if _iorc_ ne 0 then _ERROR_ = 0;
 run;

答案 1 :(得分:1)

有一个简单的解决方案需要两个简单的步骤。

data temp;
   input Date $ Visitor_ID $ Telephone_number $30.;
   datalines;
1-Mar-16    1000003634_4152228768 .
1-Mar-16    1000094865_1269576832 .
1-Mar-16    1000103735_1035466360 .
1-Mar-16    1000103735_1035466360   fda6a5563867eeebf19fb3
1-Mar-16    1000108145_3760680616 .
1-Mar-16    1000123010_2631619556 .
1-Mar-16    1000123010_2631619556   fda6a75c3765e0e8f797b4
1-Mar-16    1000126547_974397207 .
1-Mar-16    1000126592_2744218771 .
1-Mar-16    1000137177_3054387520 .
1-Mar-16    1000137208_498258799 .
1-Mar-16    1000137208_498258799    fda6a5563660e0ebf295b3
1-Mar-16    1000137460_2624495603 .
1-Mar-16    1000137460_2624495603   fda6a6583763eaeaf29eba
1-Mar-16    1000151867_3243977925 .
1-Mar-16    1000151867_3243977925   fda6a15a3f63eaedfb94b3
1-Mar-16    1000166048_3215927260 .
1-Mar-16    1000174960_357067493 .
1-Mar-16    1000178443_623552771 .
1-Mar-16    1000183569_2728954199 .
1-Mar-16    1000220805_3781532691 .
1-Mar-16    1000220805_3781532691   fda6aa5c3a64e0ebfb96b0
    ;
run;

首先为不缺少telephone_number的观察创建唯一的visitor_id / telephone_number组合列表:

proc sql;
    create table temp2 as select distinct
        visitor_id, telephone_number
        from temp (where = (not missing(telephone_number)));
quit;

然后在缺少原始telephone_number变量的情况下将其与原始表连接:

proc sql;
    create table temp3 as select
        a.date, a.visitor_id,
        case when missing(a.telephone_number) then b.telephone_number else a.telephone_number end as telephone_number
        from temp as a
        left join temp2 as b
        on a.visitor_id = b.visitor_id;
quit;

上面的内容有一个问题,因为一些visitor_id在数据集中没有CTN,因为数据集在很长一段时间内膨胀了。

这些似乎有效:

proc sql;
create table temp3 as 
select a.date, a.visitor_id,b.telephone_number
    from temp a inner join temp2 as b 
    on a.visitor_id = b.visitor_id;

退出;