我正在处理一些SAS数据,并试图找出如何在尽可能少的步骤中找到datastep中的记录排序位置。
这是一个例子 -
data Places;
infile datalines delimiter=',';
input state $ city $40. ;
datalines;
WA,Seattle
OR,Portland
OR,Salem
OR,Tillamook
WA,Vancouver
;
Proc Sort data=WORK.PLACES;
by STATE CITY;
run;
data WORK.PLACES;
set WORK.PLACES;
by STATE CITY;
ST_CITY_RNK = _N_;
run;
Proc Sort data=WORK.PLACES;
by CITY;
run;
data WORK.PLACES;
set WORK.PLACES;
by CITY;
CITY_RNK = _N_;
run;
在这个例子中,有没有办法计算ST_CITY_RNK和CITY_RNK而不进行多次排序?感觉这应该可以通过有序的哈希表来实现,但我不确定如何去做。
谢谢!
答案 0 :(得分:1)
哈希表是可行的。临时数组的效果大致相同,可能会更容易一些。
两者的主要限制是你如何处理非独特的城市名称?萨勒姆,俄勒冈州和马萨诸塞州塞勒姆?显然,在州 - 市级别中,这很好,但你可能会发现拥有一个以上林肯或类似国家的州,谁知道;但是在刚刚城市,你肯定会找到几个Columbias,Lincolns,Charlestons等。我的解决方案给所有人提供了相同的排序等级(但是然后会向前跳6或者向下一个跳过)。您在上面发布的数据步骤解决方案将给予他们独特的排名。哈希迭代器可能会做任何一个。你可以通过一些努力来调整这一点,以给出独特的等级,但它会起作用。
data Places;
infile datalines delimiter=',';
input state $ city $40. ;
datalines;
WA,Seattle
OR,Portland
OR,Salem
OR,Tillamook
WA,Vancouver
;
run;
data sortrank;
*Init pair of arrays - the one that stores the original values, and one to mangle by sorting;
array states[32767] $ _temporary_;
array states_cities_sorted[32767] $40. _temporary_ (32767*'ZZZZZ');
array cities[32767] $40. _temporary_;
array cities_sorted[32767] $40. _temporary_ (32767*'ZZZZZ');
*Iterate over the dataset, load into arrays;
do _n_ = 1 by 1 until (Eof);
set places end=eof;
states[_n_] = state;;
states_cities_sorted[_n_] = catx(',',state,city);
cities[_n_] = city;
cities_sorted[_n_] = city;
end;
*Sort the to-be-sorted arrays;
call sortc(of states_cities_sorted[*]);
call sortc(of cities_sorted[*]);
do _i = 1 to _n_;
*For each array element, look up the rank using `whichc`, looking for the value of the unsorted element in the sorted list;
city_rank = whichc(cities[_i],of cities_sorted[*]);
state_cities_rank = whichc(catx(',',states[_i],cities[_i]),of states_cities_sorted[*]);
*And put the array elements back in their proper variables;
city = cities[_i];
state= states[_i];
*And finally make a row output;
output;
end;
run;
答案 1 :(得分:0)
供参考,这是一种哈希方法:
data Places;
infile datalines delimiter=',';
input state $ city $40. ;
datalines;
WA,Seattle
OR,Portland
OR,Salem
OR,Tillamook
WA,Vancouver
;
run;
data places;
set places;
if _n_ = 1 then do;
declare hash h1(ordered:'a',dataset:'places');
rc = h1.definekey('city');
rc = h1.definedata('city');
rc = h1.definedone();
declare hiter hi1('h1');
declare hash h2(ordered:'a',dataset:'places');
rc = h2.definekey('state','city');
rc = h2.definedata('state','city');
rc = h2.definedone();
declare hiter hi2('h2');
end;
t_city = city;
t_state = state;
rc = hi1.first();
do city_rank = 1 by 1 until(t_city = city);
rc = hi1.next();
end;
rc = hi2.first();
do state_city_rank = 1 by 1 until(t_city = city and t_state = state);
rc = hi2.next();
end;
state = t_state;
city = t_city;
drop t_:;
run;