Question

假设我有一个数据集D1，如下所示：

ID   ATR1   ATR2   ATR3  
1     A      R     W
2     B      T     X
1     A      S     Y
2     C      T     E
3     D      U     I
1     T      R     W
2     C      X     X

我想从中创建一个数据集D2，如下所示

ID   ATR1   ATR2   ATR3  
1     A      R      W
2     C      T      X
3     D      U      I

换句话说，数据集D2由来自D1的唯一ID组成。对于D2中的每个ID，ATR1-ATR3的值被选为D1中具有相同ID的记录中最常见的（相应变量的）。例如，D2中的ID = 1，ATR1 = A（最常见）。

我有一个非常笨拙的解决方案。我简单地将数据集“D1”的副本排序三次（例如通过ID和ATR1）并删除重复项。我后来合并了三个数据集以获得我想要的东西。但是，我认为可能有一种优雅的方式来做到这一点。我在原始数据集中有大约20个这样的变量。

谢谢

Answer 1

/* 
read and restructure so we end up with:

id attr_id value 
1 1 A
1 2 R
1 3 W
etc.
*/

data a(keep=id attr_id value);
length value $1;
array attrs_{*} $ 1 attr_1 - attr_3;
infile cards;
input id attr_1 - attr_3;
do attr_id=1 to dim(attrs_);
    value = attrs_{attr_id};
    output;
end;
cards;
1     A      R     W
2     B      T     X
1     A      S     Y
2     C      T     E
3     D      U     I
1     T      R     W
2     C      X     X
;
run;

/* calculate frequencies of values per id and attr_id */
proc freq data=a noprint;
tables id*attr_id*value / out=freqs(keep=id attr_id value count);
run;

/* sort so the most frequent value per id and attr_id ends up at the bottom of the group.
   if there are ties then it's a matter of luck which value we get */
proc sort data = freqs; 
by id attr_id count; 
run;

/* read and recreate the original structure. */ 
data b(keep=id attr_1 - attr_3);
retain attr_1 - attr_3;
array attrs_{*} $ 1 attr_1 - attr_3;
set freqs;
by id attr_id;
if first.id then do;
    do i=1 to dim(attrs_);
        attrs_{i} = ' '; 
    end;
end;
if last.attr_id then do; 
    attrs_{attr_id} = value;
end;
if last.id then do;
    output;
end;
run;

SAS用于以下场景（最常见的观察）

1 个答案: