Question

我在名为IOA.HAVE的服务器上有每日更新的数据集，其中包含Date, Area, LocnID, ATTR1, ATTR2, ... ATTR10列。简单来说就是问题，让我们说

Date    Area LocnID A1 A2
01Nov14 AAA  100000 50 G
01Nov14 AAA  100001 30 G
01Nov14 AAA  100002 30 K
01Nov14 BBB  100003 20 K
02Nov14 CCC  100009 30 C  
02Nov14 AAA  100000 50 G

对于每个特定Date，LocnID都是唯一的。

另一个名为Adjustment的本地文件（.xlxs）将每天通过proc import导入SAS。

Date    Area LocnID A1 A2 Type
02Nov14 BBB  100000 50 G  change
02Nov14 CCC  100009 30 C  close
03Nov14 DDD  200000 20    open

除了HAVE之外，其列与Type中的列类似。

如果Type = change，则意味着从那天开始，HAVE中LocnID的所有属性都应替换为Adjustment中的所有属性。如果是Type = close，那么从那天开始，应该从LocnID删除所有包含HAVE的记录。如果Type = open，然后从那天开始，请将新记录添加到HAVE。

调整后，IOA.HAVE应为

Date    Area LocnID A1 A2
01Nov14 AAA  100000 50 G
01Nov14 AAA  100001 30 G
01Nov14 AAA  100002 30 K
01Nov14 BBB  100003 20 K
02Nov14 BBB  100000 50 G    /* change Area */
....
....
03Nov14 DDD  200000 20      /* open */

目前我这样做

data t1 t2;
set adjustment;
if type in ('change','close') then output t1;
if type in ('change','open') then output t2;
run;

proc sql;
create table a1 as 
select * from `IOA.HAVE1 as a
where not exists (select * from work.t1 where a.Date >= t1.Date and a.LocnID = t1.LocnID)
union
select * from t2 
where t2.Date <= today()
order by Date, LocnID;
quit;

但这是非常低效的。如何优化这一点（最好采用更多的SAS方式而不是＆＃39; SQL＆＃39;）？

Answer 1

如果您的表格都按date locnID排序，那么您的方法并不错。您可以考虑将存在的查询更改为连接，但我认为SQL应该在一天结束时将它们优化为同一个查询（但请检查！）。我说这是一种相当标准的添加/更新/删除事务的方法：删除（更新+删除），然后插入（更新+添加）。实际上，删除操作比重新创建表更有效。

有趣的是，为此查询添加索引错误 - 非常糟糕。我在没有索引的排序表上运行查询，大约30秒（表大小~500k表示，~1k表示调整）。索引3分钟。问题是日期大于;当与索引一起使用时非常糟糕，并且非常常见的是看到索引会损害这样的查询。

在SAS中，您可以使用MERGE轻松完成此操作。我不认为MODIFY是正确的路线，因为我们正在做一些难以在MODIFY语句中复制的事情。它可能有用，但我发现MERGE在这里更容易编码。

这是我的示例，我将下面的SQL代码作为比较。有一个主要区别;在我的SAS代码中，我不删除更改，这意味着我获得的行数比您多：您的更改后删除所有行。你也可以在SAS方法中做到这一点，但你的英文版规范听起来并不是真的想要。

首先，我只是创建要测试的表。这是HAVE中大约五十万行和一千次调整。你可能有更多，或者你不会问，但这应该给出一个速度的想法。

data have;
  array locnIDs[1000] _Temporary_;
  do _t = 1 to dim(locnIDs);
    locnIDs[_t] = 1;
  end;
  call streaminit(7);
  do date = '01NOV2011'd to '01DEC2014'd;
    do _t = 1 to dim(locnIDs);
      if rand('Uniform') < 0.02 then locnIDs[_t]=not(locnIDs[_t]);
      if locnIDs[_t] then do;
        locnID = 100000+_t;
        A1=2**_t;
        A2=byte(mod(_t,26)+65);
        Area = repeat(byte(mod(_t,26)+65),2);
        output;
      end;
    end;
  end;
run;

data adjustment;
  call streaminit(7);
  set have;
  by date locnID;
  length type $6;
  retain lastlocnID;
  if (first.date) then lastlocnID=.;
  if lastlocnID gt 0 and locnID-lastlocnID gt 1 then do;
    if rand('Uniform') lt 0.001 then do;       
        locnID=locnID-1;
        type='open';
        output;
    end;
    else if rand('Uniform') lt 0.001 then do;
        type='close';
        output;
    end;
  end;
  else do;
    if rand('Uniform') lt 0.001 then do;
        type='change';
        Area = '###';
        output;
    end;
  end;
  lastlocnID=locnID;
run;

接下来我分成t1 / t2并执行上面的SQL方法。

data t1 t2;
  set adjustment;
  if type in ('change','close') then output t1;   *t1 is changes/deletes;
  if type in ('change','open') then output t2;    *t2 is all adds/changes;
run;
proc sql;
create table a1 as 
  select * from HAVE as a
    where not exists (select * from work.t1 where a.Date >= t1.Date and a.LocnID = t1.LocnID)
  union
  select * from t2 
    where t2.Date <= today()
  order by Date, LocnID;
quit;

现在这里是SAS合并 - 要快得多。基本上我对by locnid date进行排序，以便可以跨日期传播位置的更改，然后使用临时存储变量来存储需要向下传播的更改值，并使用另一个来标识删除。请注意，存在更多记录，因为我正在传播更改，而不是删除所有更改的后续记录。如果你想删除它们，这个方法仍然可以工作但是更容易（你可以跳过所有__a1 = a1和a1 = __ a1的东西，然后在输出语句后分配__delflag = 1。）

proc sort data=have;
  by locnid date;
run;

proc sort data=adjustment;
  by locnid date;
run;

data have2;
  merge have(in=_h) adjustment(in=_a);
  by locnid date;
  retain __delflag __a1 __a2 __area;
  if first.locnid then do;   *clear the flags;
    call missing(of __delflag __a1 __a2 __area);
  end;
  if not _a and not missing(__a1) then do;  *if a previous change is pending;
    a1=__a1;
    a2=__a2;
    area=__area;
  end;
  if _a and _h and type='change' then do;  *if a change is implemented;
    __a1=a1;
    __a2=a2;
    __area=area;
    __delflag=0;  *Not sure if this should be possible, but in case;
  end;
  else if _a and _h and type='close' then do;  *if a delete is needed;
    __delflag=1;
  end;

  if __delflag=1 then delete;
  output;
run;

由于SAS无需及时返回并检查每条记录，因此只需几分之一秒即可完成。我不确定这完全复制了你想要的东西，但它应该做一些接近它的事情。

SAS中a1的确切复制品是这样的：

data have3;
  merge have(in=_h) adjustment(in=_a);
  by locnid date;
  retain __delflag;
  if first.locnid then do;
    call missing(__delflag);
  end;
  if _a and _h and type = 'close' then do;
    __delflag=1;
  end;
  if __delflag=1 and not (type in ('open','change')) then delete;
  output;
  if _a and _h and type='change' then __delflag=1;
  if last.locnid then do;
    call missing(of __:);
  end;
run;

我删除了传播部分，如果在该记录之后更改则删除，并向删除添加一些参数以避免删除以后打开/更改记录。（这些在真实数据中可能无法实现，这是使用我的假数据的副本，可以在关闭后打开/更改。）

从SAS中的其他表更新表

1 个答案: