sasprofessionals.net上的用户遇到的问题是无法通过多个变量对数据集进行分组,其中变量值在观察中可以互换,因为它们具有相同的含义。
在示例数据集中,观察值2,3和7是相同的,因为它们中的每一个都具有A14,A14和A10作为Stat1到Stat3的值,并且它只是顺序不同。这些应按Count分组。观察5和6形成另一组,应该按计数加以总结。
示例数据集:
Obs Stat1 Stat2 Stat3 Count
1 A14 A14 A14 53090
2 A14 A14 A10 6744
3 A14 A10 A14 5916
4 A01 A01 A01 4222
5 A10 A10 A10 3085
6 A10 A10 A10 2731
7 A10 A14 A14 2399
理想的输出:
Obs Stat1 Stat2 Stat3 Count
1 A14 A14 A14 53090
4 A01 A01 A01 4222
6 A10 A10 A10 5816
7 A10 A14 A14 15059
实际数据集更大,更复杂。我不知道用户是否尝试过任何方法来解决问题。
此问题最初发布在sasprofessionals.net上,并且为了社区的利益而被复制到StackOverflow。它已更改为符合StackOverflow Q& A标准。
答案 0 :(得分:2)
这是我解决用户问题的答案。通常,我将Stat1-Stat3加载到一个数组中,使用sortc调用函数对数组进行排序,然后通过一个临时ID对其进行求和,该临时ID由排序的Stat1-Stat3数组构成。
/* Loading the data into SAS dataset */
/* Loading Stat1-Stat3 into an array */
/* Sorting stat1-stat3 creating a new ID */
data have;
input obs stat1 $ stat2 $ stat3 $ count;
array stat{3} stat1-stat3;
call sortc(of stat1-stat3);
ID = CATX("/",stat1,stat2,stat3);
datalines;
1 A14 A14 A14 53090
2 A14 A14 A10 6744
3 A14 A10 A14 5916
4 A01 A01 A01 4222
5 A10 A10 A10 3085
6 A10 A10 A10 2731
7 A10 A14 A14 2399
;
/* sorting the data set in preparation for data step with by statement*/
PROC SORT data=have;
BY ID OBS;
RUN;
/* Summarising the dataset and outputing into final dataset*/
DATA summed (drop=ID count);
set sorted_arrays;
by ID;
retain sum 0;
if first.ID then sum = 0;
sum + count;
if last.ID then output;
RUN;
/* Sorting it back into original order */
PROC SORT data=summed out=want;
BY OBS;
RUN;
答案 1 :(得分:0)
由于我一直在给自己做哈希练习,所以我决定通过哈希来尝试。 Paul Dorfman有几篇论文讨论使用哈希表对数组进行排序,例如: Black Belt Hashigana。
下面,我使用一个哈希表进行水平排序,然后使用另一个哈希表来按ID对计数进行求和。数据只需要读取一次,但考虑到数据的大小,我肯定不会在这种情况下声称效率优势。我没有将数据恢复为原始排序顺序。
欢迎编辑/问题/建议,因为这是我的哈希学习曲线的一部分。 :)
data have;
input stat1 $ stat2 $ stat3 $ count;
datalines;
A14 A14 A14 53090
A14 A14 A10 6744
A14 A10 A14 5916
A01 A01 A01 4222
A10 A10 A10 3085
A10 A10 A10 2731
A10 A14 A14 2399
;
data want;
length _stat $3;
if _n_=1 then do;
declare hash hstat(multidata:"y", ordered:"y");
declare hiter hstatiter ("hstat" ) ;
hstat.definekey('_stat');
hstat.definedata('_stat');
hstat.definedone();
call missing(_stat);
declare hash hsum(suminc: "count", ordered: "y");
declare hiter hsumiter ("hsum" ) ;
hsum.definekey("stat1","stat2","stat3");
hsum.definedone();
end;
set have end=last;
array stat{3};
*load the array values into htable hstat to sort them;
*then iterate over the hash, returning the values to array in sorted order;
do _i=1 to dim(stat);
hstat.add(key:stat{_i},data:stat{_i});
end;
do _i=1 to dim(stat);
hstatiter.next();
stat{_i}=_stat;
end;
_rc=hstatiter.next(); *hack- there is no next, this releases hiter lock so can clear hstat;
hstat.clear();
*now that the stat keys have been sorted, can use them as key in hash table hsum;
*as data are loaded into/checked against the hash table, counts are summed;
*Then if last, iterate over hsum writing it to output dataset;
hsum.ref(); *This sums count as records are loaded/checked;
if last then do;
_rc = hsumiter.next();
do while(_rc = 0);
_rc = hsum.sum(sum: count);
output ;
_rc = hsumiter.next();
end;
end;
drop _: ;
run;