Question

我在大型数据集中拥有数十万个ID。

某些记录具有相同的ID但数据点不同。其中一些ID需要合并为一个ID。不止一次注册系统的人应该只是数据库中的一个人。

我还有一个单独的文件告诉我哪些ID需要合并，但它并不总是一对一的关系。例如，在许多情况下，我有x-> y，然后是y-> z，因为它们注册了三次。我有一个宏，基本上是以下一组if-then语句：

if ID='1111111' then do; ID='2222222'; end; 
if ID='2222222' then do; ID='3333333'; end;

我相信SAS一次只运行这一条记录。我的合并ID列表几乎长达15k，因此需要永久运行并且列表会变得更长。有更快的方法来更新这些ID吗？

由于

编辑：这是一个例子，除了由于所有合并而宏超过15k行。

data one; 
input ID $5. v1 $ v2 $;
cards;
11111 a b
11111 c d
22222 e f
33333 g h
44444 i j
55555 k l
66666 m n
66666 o p
;
run;

%macro ID_Change;
if ID='11111' then do; ID='77777'; end; *77777 is a brand new ID;
if ID='22222' then do; ID='88888'; end; *88888 is a new ID but is merged below;
if ID='88888' then do; ID='99999'; end; *99999 becomes the newer ID;
%mend;

data two; set one; %ID_Change; run;

Answer 1

哈希表将大大加快进程。 Hash tables是SAS中少用但高效的工具之一。由于语法与标准SAS编程非常不同，因此它们有点离奇。现在，将其视为将数据合并到内存中的一种方式（这是为什么它如此快速的一个重要原因）。

首先，创建一个包含所需转化的数据集。我们希望按ID进行匹配，然后将其转换为New_ID。将ID视为密钥列，将New_ID视为数据列。

dataset: translate

ID     New_ID
111111 222222
222222 333333

在哈希表中，您需要考虑两件事：

Key列（
数据列

数据列将替换与Key列匹配的观察结果。换句话说，每当New_ID匹配时，就会填充ID。

接下来，您将要进行哈希合并。这在数据步骤中执行。

data want;
     set have;

     /* Only declare the hash object on the first iteration. 
        Otherwise it will do this every record. */
     if(_N_ = 1) then do;
           declare hash id_h(dataset: 'translate'); *Declare a hash object called 'id_h';
           id_h.defineKey('ID');                    *Define key for matching;
           id_h.defineData('New_ID');               *The new ID after matching;
           id_h.defineDone();                       *Done declaring this hash object;
           call missing(New_ID);                    *Prevents a warning in the log;
     end;

    /* If a customer has changed multiple times, keep iterating until 
       there is no longer a match between tables */
    do while(id_h.Find() = 0);

        _loop_count+1; *Tells us how long we've been in the loop;

       /* Just in case the while loop gets to 500 iterations, then 
          there's likely a problem and you don't want the data step to get stuck */
       if(_loop_count > 500) then do;
            put 'WARNING: ' ID ' iterated 500 times. The loop will stop. Check observation ' _N_;
            leave;
       end; 

        /* If the ID of the hash table matches the ID of the dataset, then
           we'll set ID to be New_ID from the hash object;
        ID = New_ID; 
    end;

    _loop_count = 0;

   drop _loop_count;
run;

这应该非常快速地运行并提供所需的输出，假设您的查找表是按照您需要的方式编码的。

Answer 2

对您的单独文件使用PROC SQL或MERGE步骤（在您使用infile或proc import创建单独的数据集之后）所有记录的唯一ID。如果您的单独文件仅包含重复项，则需要为非重复项创建一个虚拟唯一ID。
PROC SORT BY唯一ID和注册时间戳。
使用DATA步骤和BY个变量。根据您是要保留第一次或最后一次注册，请执行if first.timestamp then output;（或最后等）

或者你可以在一个PROC SQL中使用left join到单独的文件，coalesce步骤返回虚拟唯一ID（如果它不包含在单独的文件中），group by唯一ID，having max(timestamp)（或分钟）。您还可以coalesce您可能希望在注册之间保留的任何其他变量 - 例如，如果第一次注册包含电话号码并且连续注册缺少该数据点。

如果没有可重复的例子，很难更具体。

SAS按顺序更新记录

2 个答案: