我有一个数据集,该数据集以相同长度的观察值重新出现了大块数据,例如:
data have;
input name $ identifier ;
cards;
mary 1
mary 2
mary 2
mary 4
mary 5
mary 7
mary 6
adam 2
adam 3
adam 3
adam 7
/*remove*/
mary 1
mary 2
mary 2
mary 4
mary 5
mary 7
mary 6
/*remove*/
adam 8
mary 1
mary 2
mary 3
mary 4
mary 5
mary 7
mary 6
adam 9
mary 1
mary 2
mary 3
;
我希望删除由/ remove /标记的带顺序标识符的玛丽的再出现块。结果应如下所示:
mary 1
mary 2
mary 4
mary 5
mary 6
mary 7
adam 2
adam 3
adam 7
adam 8
mary 1
mary 2
mary 3
mary 4
mary 5
mary 6
mary 7
adam 9
mary 1
mary 2
mary 3
谢谢您的帮助!有人通过哈希表建议了一种方法,但是我怀疑我可能没有足够的内存来处理代码。可以通过datasteps或proc sql完成吗?
答案 0 :(得分:2)
如果每个组的最大记录数足够小,则可以使用以下方法构建带有组中标识符列表的字符串,并将其用作HASH中的键之一。
data want ;
do until (last.name);
set have ;
by name notsorted ;
length taglist $200 ;
taglist=catx('|',taglist,identifier);
end;
if _n_=1 then do;
dcl hash h();
h.defineKey('name','taglist');
h.defineDone();
end;
found = 0 ne h.add();
do until (last.name);
set have ;
by name notsorted ;
if not found then output;
end;
drop found taglist;
run;
如果键太大而无法放入哈希对象,则需要进行多次传递。首先找到组。然后找到每种类型的组的第一次出现。然后生成这些组的数据。
data pass1 ;
group + 1;
first_obs=row+1;
do until (last.name);
set have ;
by name notsorted ;
length taglist $200 ;
taglist=catx('|',taglist,identifier);
row+1;
end;
last_obs=row;
output;
keep group name taglist first_obs last_obs;
run;
proc sql ;
create table pass2 as
select group,first_obs,last_obs
from pass1
group by name,taglist
having min(group)=group
order by group
;
quit;
data want;
set pass2;
do obs=first_obs to last_obs;
set have point=obs;
output;
end;
drop /*group*/ first_obs last_obs ;
run;
结果:
Obs group name identifier
1 1 mary 1
2 1 mary 2
3 1 mary 2
4 1 mary 4
5 1 mary 5
6 1 mary 7
7 1 mary 6
8 2 adam 2
9 2 adam 3
10 2 adam 3
11 2 adam 7
12 4 adam 8
13 5 mary 1
14 5 mary 2
15 5 mary 3
16 5 mary 4
17 5 mary 5
18 5 mary 7
19 5 mary 6
20 6 adam 9
21 7 mary 1
22 7 mary 2
23 7 mary 3