Question

如果temp中的id与hist中的id匹配，则从hist中删除该行并从temp中插入该行，如果id与hist中的任何行不匹配，则将该行附加到hist。我有两个具有相同列的数据集：

data hist;
input id1 id2 var1 $;
cards;
1 10 a
2 20 b
3 30 c 
4 40 d
5 50 e
;
run;
data temp;
input id1 id2 var1 $;
cards;
2 20 b
3 30 d
4 40 e
5 50 f
6 60 g
;
run;

temp将有当前状态，history将包含所有历史记录行。

我想在history数据集中删除并插入一行（如果它存在于temp（更新）中...并在history数据集中附加一行，如果来自{{ {1}}中不存在{1}}。 temp数据集将至少有100个记录。从上面的输入我想要这样输出。

history

来自history的第1,2,3,4行与1 10 a 2 20 b 3 30 d 4 40 e 5 50 f 6 60 g中的行匹配，因此它们会更新，temp中的第5行不匹配，因此会附加到history。

对不起之前的混淆。我猜现在应该很清楚。谢谢，萨姆。

Answer 1

有一种方法可以让SAS和PROC APPEND为您完成此操作。

因此，在不知道您的数据列的情况下，我将进行一般性的讨论。我假设您有一个或多个字段来定义唯一性。

首先，在HISTORY上创建一个唯一索引

proc sql;
create unique index hist_unq on HISTORY(col1, col2, ...);
quit;

然后使用PROC APPEND：

proc append base=history data=temp force;
run;

您将在日志中看到警告，并注意到附加的总数少于总数。类似的东西：

NOTE: Appending WORK.TEMP to WORK.HISTORY.
WARNING: Duplicate values not allowed on index hist_unq for file HISTORY, 36 observations rejected.
NOTE: There were 70 observations read from the data set WORK.TEMP.
NOTE: 34 observations added.
NOTE: The data set WORK.HISTORY has 144 observations and 2 variables.
NOTE: PROCEDURE APPEND used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

Answer 2

我认为到目前为止，DomPazz凭借其简单性得到了最好的答案，但如果您处于无法在history上方便地定义唯一索引的情况，或者您真的想避免任何警告消息，然后以下更复杂的数据步骤方法工作。它应该与proc append一样快，同时避免Joe设置的哈希对象方法的内存和CPU要求。

N.B。虽然这不需要history上的唯一索引，但如果temp中任何匹配ID的行数多于temp中的行数，则会追加history中不需要的行

data history;
input id var1 $;
cards;
1 a
2 b
3 c 
4 d
5 e
5 f
;
run;

data temp;
input id var1 $;
cards;
3 d
4 e
5 f
6 g
6 h
;
run;

proc datasets lib = work nolist;
    modify history;
    index create id;
    run;
quit;

data history;
    set temp;
    modify history key = id;
    if _iorc_ ne 0 then do;
        _ERROR_ = 0;
        output;
    end;
run;

这是如何运作的：

从temp（第一套声明）
尝试使用匹配的history来读取id的第一条记录。
如果我们找不到匹配项，请输出新记录。
因为我们从来没有从history开始连续查看来自id的任何非匹配temp的内容，所有其他变量的值仍然存在于PDV中当我们在步骤1中从temp读取它们时。
history的索引在数据步骤完成添加/修改/删除行之后才会更新，因此对于temp的最后一行，即使我们已经添加了一行使用id = 6到history，我们在相同数据步骤的后续迭代中不会通过索引找到它，因此会添加两行。

编辑：替代版本，使用匹配的ID更新历史记录中的记录：

data history;
    set temp(rename = (var1 = new_var1));
    do _n_ = 1 by 1 until(eof);
        modify history key = id end = eof;
        if _iorc_ = 0 then do;
            var1 = new_var1;
            replace;
        end;
        else do;
            _ERROR_ = 0;
            if not(eof and _n_ > 1) then output;
        end;        
    end;
run;

这里的一个缺点是你必须重命名temp中的所有非id变量，因为当modify语句从history读入一行时，它会覆盖变量PDV中的名字。如果您对temp和history的ID都有唯一索引，则可以避免这样：

data history;
    set temp(keep = id);
    modify history key = id;
    if _iorc_ = 0 then do;
        set temp key = id;
        replace;
    end;
    else do;
        _ERROR_ = 0;
        output;
    end;        
run;

如果从第一次覆盖的temp读入了匹配的记录，则额外的set语句会第二次从history读入相关记录。

Answer 3

您所描述的一种方法是SQL中的union。默认情况下，union不会附加重复记录。但是，它确实需要一些时间（因为它必须识别这些记录）。

proc sql;
  create table history_new as
  select * from history
  union
  select * from temp;
quit;

如果你有足够的内存来加载内存中哈希表中history的密钥，那么这可能是最快的选择。将history加载到散列，设置temp，find()当前行，如果未找到，则将该行添加到散列中。然后，最后，将哈希输出回历史记录。

根据临时值和历史记录的相对大小，您还可以仅将要添加的行输出到数据集，而不是将它们添加到哈希值，然后proc append该数据集。

如果temp小于四分之一左右，history的大小可能是更好的选择。

data temp_to_Add;
  set temp;
  if _n_=1 then do;
    declare hash h(dataset:'history');
    h.defineKey('keyvars');
    h.defineDone();
  end;
  rc = h.find();
  if rc ne 0 then output;
run;

如果您需要针对自身检查temp，请在rc ne 0时将节点添加到哈希。

使用删除更新历史记录表并插入

3 个答案: