Question

数据集如下所示：

Code    Type     Rating
0001    NULL      1
0002    NULL      1
0003    NULL      1
0003    PA 1      3
0004    NULL      1
0004    PB 1      2
0005    AC 1      3
0005    NULL      6
0006    AC 1      2

我希望输出数据集看起来像

Code    Type     Rating
0001    NULL      1
0002    NULL      1
0003    PA 1      4        
0004    PB 1      3        
0005    AC 1      9        
0006    AC 1      2

对于每个Code，Type最多包含两个值。我想通过汇总Code来选择唯一的Rating。但问题是，对于Type，如果它只有一个值，则将其值传递给输出数据集。如果有两个值（一个必须是NULL），则将不等于NULL的值传递给输出数据集。

观察总数N>100,000,000。那么实现这一目标有什么棘手的方法吗？

Answer 1

如果按照您的示例对数据进行排序，则可以在单个数据步骤中实现此目的。我假设NULL值实际上是丢失的，但如果没有，则将[if missing（type）]更改为[if type =＆＃39; NULL＆＃39;]。所有这一切都是对每个代码的Rating值求和，然后输出最后一条记录，保持非null类型。如果您的数据没有在Code上排序或编入索引，那么您需要先进行排序，这显然会增加执行时间。

/* create input file */
data have;
input Code Type $ Rating;
infile datalines dsd;
datalines;
0001,,1
0002,,1
0003,,1
0003,PA 1,3
0004,,1
0004,PB 1,2
0005,AC 1,3
0005,,6
0006,AC 1,2
;
run;

/* create summarised dataset */
data want;
set have;
by code;
retain _type; /* temporary variable */
if first.code then do;
    _type = type;
    _rating_sum = 0; /* reset sum */
end;
_rating_sum + rating; /* sum rating per Code */
if last.code then do;
    if missing(type) then type = _type; /* pick non-null value */
    rating = _rating_sum; /* insert sum */
    output;
end;
run;

Answer 2

在一个SQL步骤中也很容易做到。只需使用CASE ... WHEN ... END删除NULL和MAX然后获取非空值。

data have;
input @1 Code 4.
      @9 Type $4.
      @19 Rating 1.;
datalines;
0001    NULL      1
0002    NULL      1
0003    NULL      1
0003    PA 1      3
0004    NULL      1
0004    PB 1      2
0005    AC 1      3
0005    NULL      6
0006    AC 1      2
;;;;
run;

proc sql;
create table want as
    select code, 
        max(case type when 'NULL' then '' else type end) as type,
        sum(Rating) as rating
        from have
        group by code;
quit;

如果你想要返回NULL，那么你需要将select包装在select code, case type when ' ' then 'NULL' else type end as type, rating from ( ... );中，不过我建议将它们留空。

Answer 3

鉴于评论，另一种可能性是散列解决方案。这是内存约束的，因此它可能或可能不能处理实际数据（哈希表不是很大，但是100M行可能意味着哈希表中有60或70M行，时间为40或50字节仍然很大）。

如果数据集按代码排序，这几乎肯定不如普通数据步骤方法，所以这只应用于未排序的数据。

概念：

创建键入代码
如果传入记录是新记录，请添加到哈希表
如果传入记录不是新代码，请获取检索到的值并对评级求和。检查是否需要更换类型。
输出到数据集。

代码：

data _null_;
  if _n_=1 then do;
      if 0 then set have;
      declare hash h(ordered:'a');
      h.defineKey('code');
      h.defineData('code','type','rating');
      h.defineDone();
  end;
  set have(rename=(type=type_in rating=rating_in)) end=eof;
  rc_1 = h.find();
  if rc_1 eq 0 then do;
    if type ne type_in and type='NULL' then type=type_in;
    rating=sum(rating,rating_in);
    h.replace();
  end;
  else do;
    type=type_in;
    rating=rating_in;
    h.add();
  end;
  if eof then do;
    h.output(dataset:'want');
  end;
run;

在SAS中选择唯一记录的有效方法

3 个答案:

在SAS中选择​​唯一记录的有效方法

3 个答案:

在SAS中选择唯一记录的有效方法