假设我有一个数据集D1
,如下所示:
ID ATR1 ATR2 ATR3
1 A R W
2 B T X
1 A S Y
2 C T E
3 D U I
1 T R W
2 C X X
我想从中创建一个数据集D2
,如下所示
ID ATR1 ATR2 ATR3
1 A R W
2 C T X
3 D U I
换句话说,数据集D2
由来自D1
的唯一ID组成。对于D2
中的每个ID,ATR1-ATR3的值被选为D1
中具有相同ID的记录中最常见的(相应变量的)。例如,D2
中的ID = 1,ATR1 = A(最常见)。
我有一个非常笨拙的解决方案。我简单地将数据集“D1”的副本排序三次(例如通过ID和ATR1)并删除重复项。我后来合并了三个数据集以获得我想要的东西。但是,我认为可能有一种优雅的方式来做到这一点。我在原始数据集中有大约20个这样的变量。
谢谢
答案 0 :(得分:1)
/*
read and restructure so we end up with:
id attr_id value
1 1 A
1 2 R
1 3 W
etc.
*/
data a(keep=id attr_id value);
length value $1;
array attrs_{*} $ 1 attr_1 - attr_3;
infile cards;
input id attr_1 - attr_3;
do attr_id=1 to dim(attrs_);
value = attrs_{attr_id};
output;
end;
cards;
1 A R W
2 B T X
1 A S Y
2 C T E
3 D U I
1 T R W
2 C X X
;
run;
/* calculate frequencies of values per id and attr_id */
proc freq data=a noprint;
tables id*attr_id*value / out=freqs(keep=id attr_id value count);
run;
/* sort so the most frequent value per id and attr_id ends up at the bottom of the group.
if there are ties then it's a matter of luck which value we get */
proc sort data = freqs;
by id attr_id count;
run;
/* read and recreate the original structure. */
data b(keep=id attr_1 - attr_3);
retain attr_1 - attr_3;
array attrs_{*} $ 1 attr_1 - attr_3;
set freqs;
by id attr_id;
if first.id then do;
do i=1 to dim(attrs_);
attrs_{i} = ' ';
end;
end;
if last.attr_id then do;
attrs_{attr_id} = value;
end;
if last.id then do;
output;
end;
run;