Question

我正在根据ID号匹配文件。我需要格式化要匹配的ID的数据集，以便在列a中不再重复相同的ID号（因为列b的ID是匹配完成后的尚存ID）。我的ID列表中有超过一百万个观测值，并且同一ID在任一列或两列中都可以重复多次。

以下是我所需要的示例：

样本数据

尚存的ID为：

2    
4    
5

错误-1不再存在错误-1不再存在 8

我需要

我很可能是SAS新手，但这是我尝试过的一遍又一遍，因为我有一些ID重复了50次以上。

Proc sort data=Have;    
    by ID1;    
run;

这种排序使重复的ID1值连续，因此我可以使用LAG用上一行中的剩余ID2替换损坏的ID1。

Data Want;
    set Have;
        by ID1;
    lagID1=LAG(ID1);  
    lagID2=LAG(ID2); 
    If NOT first. ID1 THEN DO;  
        If ID1=lagID1 THEN ID1=lagID2; 
        KEEP ID1 ID2;
        IF ID1=ID2 then delete;
   end;
run;

这种工作方式，但是我仍然会得到一些结果，这些重复结果无论我运行多少次都无法解决（我会循环播放，但我不知道如何），因为它们只是在具有其他重复项的ID之间来回切换（我可以减少到其中约2,000个）。

我已经发现，除了使用LAG之外，我还需要将当前行之后的所有值替换为每个ID1值的ID2，但是我不知道该怎么做。

我想读取观察值1，在ID1或ID2列中找到ID1值的所有以后的实例，并将该值替换为当前观察值的ID2值。然后，我要在第2行中重复该过程，依此类推。

对于该示例，我想在值1的第一行之后查找任何实例，并将其替换为2，因为那是该对的尚存ID-1在任一列，我需要全部替换。第二行将查找3的更高值，并将其替换为4，以此类推。最终结果应该是一个ID号仅在ID1栏中出现一次（尽管它可能在ID2栏中多次出现）。

读取第一行后，数据集将如下所示： ID1 ID2

阅读观察2不会做任何改变，因为3不会再出现。在观察3之后，该集合将为：

同样，观察点四不会有变化。但观察结果5会导致最终变化：

我尝试使用以下语句，但我什至无法判断自己是否走在错误的轨道上，或者只是无法弄清语法。

Data want;
Set have;
      Do i=_n_;
          ID=ID2;
          Replace next var{EUID} where (EUID1=EUID1 AND EUID2=EUID1);
      End;
Run;

感谢您的帮助！

Answer 1

无需通过数据文件来回工作。您只需要保留替换信息，即可一次处理文件。

一种方法是使用ID变量的值作为索引来创建一个临时数组。对于您的ID值较小的简单示例来说，这样做很容易。

因此，例如，如果所有ID值都是1到1000之间的整数，则此步骤将完成工作。

data want ;
  set have ;
  array xx (1000) _temporary_;
  do while (not missing(xx(id1))); id1=xx(id1); end;
  do while (not missing(xx(id2))); id2=xx(id2); end;
  output;
  xx(id1)=id2;
run;

您可能需要添加测试以防止出现周期（1-> 2-> 1）。

对于更通用的解决方案，您应该将数组替换为哈希对象。像这样：

data want ;
  if _n_=1 then do;
    declare hash h();
    h.definekey('old');
    h.definedata('new');
    h.definedone();
    call missing(new,old);
  end;
  set have ;
  do while (not h.find(key:id1)); id1=new; end;
  do while (not h.find(key:id2)); id2=new; end;
  output;
  h.add(key: id1,data: id2);
  drop old new;
run;

Answer 2

这是您建议的算法的一种实现，使用modify语句一次加载并重写每一行。它适用于您的琐碎示例，但对于杂乱的数据，您可能会在ID1中获得重复的值。

data have;
input ID1 ID2 ;
datalines;
1 2    
3 4    
2 5
6 1 
1 7 
5 8 
;
run;

title "Before making replacements";
proc print data = have;
run;

/*Optional - should improve performance at cost of increased memory usage*/
sasfile have load;

data have;
    do i = 1 to nobs;
        do j = i to nobs;
            modify have point = j nobs = nobs;
            /* Make copies of target and replacement value for this pass */
            if j = i then do;
                id1_ = id1;
                id2_ = id2;
            end;
            else do;
                flag = 0; /* Keep track of whether we made a change */
                if id1 = id1_ then do;
                    id1 = id2_;
                    flag = 1;
                end;
                if id2 = id1_ then do;
                    id2 = id2_;
                    flag = 1;
                end;
                if flag then replace; /* Only rewrite the row if we made a change */                
            end;
        end;
    end;
    stop;
run;

sasfile have close;

title "After making replacements";
proc print data = have;
run;

请记住，由于这会修改适当的数据集，因此在运行数据步骤时中断它可能会导致数据丢失。确保首先备份，以防您需要回滚更改。

Answer 3

看起来像这样应该可以解决问题，并且很简单。让我知道您是否在寻找它：

data have;
input id1 id2;
datalines;
1 2    
3 4    
2 5
6 1 
1 7 
5 8 
;
run;

%macro test();
  proc sql noprint;
     select count(*) into: cnt
     from have;
  quit;

  %do i = 1 %to &cnt;
     proc sql noprint;
        select id1,id2 into: id1, :id2
        from have
        where monotonic() = &i;quit;

     data have;
     set have;
     if (_n_ > input("&i",8.))then do;
        if (id1 = input("&id1",8.))then id1 = input("&id2",8.);
        if (id2 = input("&id1",8.))then id2 = input("&id2",8.);
     end;
     run;        
  %end;
%mend test;
%test();

Answer 4

这可能会快一点：

data have2;
input id1 id2;
datalines;
1 2    
3 4    
2 5
6 1 
1 7 
5 8 
;
run;

%macro test2();
   proc sql noprint;
      select count(*) into: cnt
      from have2;
   quit;

   %do i = 1 %to &cnt;
      proc sql noprint;
         select id1,id2 into: id1, :id2
         from have2
         where monotonic() = &i;

         update have2 set id1 = &id2         
         where monotonic() > &i
         and id1 = &id1;
      quit;
      proc sql noprint;
         update have2 set id2 = &id2         
         where monotonic() > &i
         and id2 = &id1;
      quit;
%end;
%mend test2;
%test2();

SAS 9.4根据当前值替换当前行之后的所有值

4 个答案: