Question

我有一个包含缺失值的主数据集。

示例看起来像

Date    Index1    Index2  Key
01NOV    20          .    a
02NOV     .         30    a
02NOV    10         20    a

我还有一个不包含缺失值的更新数据集。

Date    Index1    Index2  Key
01NOV    10         10    a
02NOV     5         40    a

想法是，如果数据匹配且主数据集在index下缺少值，则将其替换为index数据集中的相应update。如果没有，保留其价值。

输出应为

Date    Index1    Index2  Key
01NOV    20         10    a
01NOV     5         30    a
02NOV    10         20    a

我的代码在

下面

proc sql;
update master as a
set index1 = case when a.index1 ^= . then a.index1 else (select index1 from update as b where a.Date = b.Date and a.Key = b.Key) end,
index2 = case when a.index2 ^= . then a.index2 else (select index2 from update as b where a.Date = b.Date and a.Key = b.Key) end;
quit;

但master和update都很大。有没有办法优化这个？

编辑

如何在特定时间段内更新master？ where a.Date = b.Date and a.Date between sDate and eDate？

Answer 1

如果SQL Update太慢，那么执行此操作的最佳方法可能是创建格式或哈希表，具体取决于您的可用内存和您拥有的变量数量。即使你有正确的索引表，SQL update在这种情况下往往会很慢。

尽管如此，使用正确索引的表可能首先尝试使用SQL Update。

确保所有表格按日期排序。
在date上的两个表上创建索引。
一次更新一个。

这个例子对我很快 - 对于6.5MM / 1.5MM行大约需要4分钟左右，其中大约一半的6.5MM行需要更新 - 显然150MM行需要更长时间，但总时间应该可以很好地扩展。

data sample;
  call streaminit(7);
  do key = 1 to 1000;
      do date = '01JAN2011'd to '31DEC2014'd;
        do _t = 1 to rand('Normal',5,2);
          if rand('Uniform') < 0.8 then val1=10;
          if rand('Uniform') < 0.6 then val2=20;
          output;
          call missing(of val1, val2);
        end;
      end;
  end;
run;

data update_t;
  do key = 1 to 1000;
      do date='01JAN2011'd to '31DEC2014'd;
        val1=10;
        val2=20;
        output;
      end;
  end;
run;


proc sql;
  create index keydate on sample (key, date);
  create index keydate on update_t  (key, date);

  update sample S
    set val1=coalesce(val1,
        (select val1 from update_t U where U.key = S.key and U.date=S.date)),
        val2=coalesce(val2,
        (select val2 from update_t U where U.key = S.key and U.date=S.date))
    where n(s.val1,s.val2) < 2;
quit;

我确保只有缺少val的行才能使用where语句进行更新，否则这是非常标准的。不幸的是，SAS不会对连接进行更新（在后端可能会有相同的效果，但你不能像在其他一些SQL中那样说update S,U set S.blah=U.blah）。注意这里SAMPLE和UPDATE表都是排序的（因为我创建了它们排序）;如果它们没有排序，你需要对它们进行排序以获得最佳行为。

如果您想要更快的选项，格式或哈希表就是您的朋友。我会在这里显示格式。

data update_formats;
  set update_t;
  length start $50;
  start=catx('|',key,date);
  label=val1;
  fmtname='$VAL1F';
  output;
  label=val2;
  fmtname='$VAL2F';
  output;
  if _n_=1 then do;
    hlo='o';
    label=' ';
    start=' ';
    output;
    fmtname='$VAL1F';
    output;
  end;
run;

proc sort data=update_formats;
by fmtname;
run;

proc format cntlin=update_formats;
quit;

data sample;
  modify sample;
  if n(val1,val2) < 2;  *where is slower for some reason;
  val1=coalesce(val1,input(put(catx('|',key,date),$VAL1F.),best12.));
  val2=coalesce(val2,input(put(catx('|',key,date),$VAL2F.),best12.));
run;

这使用格式将id + date转换为val1或val2。它会比SQL update更快，除非更新表中的行数非常高（1.5MM应该没问题，最终虽然格式开始变慢）。这个时间的总时间往往不会比表的写入时间长 - 在这种情况下（基线：最初写入SAMPLE的时间为2秒）加载格式需要13秒，然后再使用它们需要13秒/编写新的SAMPLE数据集 - 总时间低于30秒，而SQL更新为4分钟（也不需要创建索引或对更大的表进行排序）。

优化仅通过update语句更新主数据集中的缺失值

1 个答案: