考虑使用以下布局的示例性SAS数据集。
Price Num_items
100 10
120 15
130 20
140 25
150 30
我想通过定义一个名为cat的新变量将它们分为4类,以便新数据集如下所示:
Price Num_items Cat
100 10 1
120 15 1
130 20 2
140 25 3
150 30 4
此外,我想对它们进行分组,使它们具有大约相同数量的项目(例如,在上面的分组中,组1有25个,组2有20个,组3有25个,组4有30个观察值)。请注意,price列按升序排序(这是必需的)。
我正在努力从SAS开始。所以任何帮助将不胜感激。我不是在寻找一个完整的解决方案,但是有关准备解决方案的建议会有所帮助。
答案 0 :(得分:1)
酷的问题,巧妙复杂。我同意@J_Lard的观点,即保留数据的步骤可能是实现这一目标的最快捷方式。如果我正确理解您的问题,我认为下面的代码会为您提供一些关于如何解决它的想法。请注意,根据num_items
和group_target
,您的里程会有所不同。
生成类似但更大的数据集。
data have;
do price=50 to 250 by 10;
/*Seed is `_N_` so we'll see the same random item count.*/
num_items = ceil(ranuni(_N_)*10)*5;
output;
end;
run;
<强>归类。强>
/*Desired group size specification.*/
%let group_target = 50;
data want;
set have;
/*The first record, initialize `cat` and `cat_num_items` to 1 with implicit retainment*/
if _N_=1 then do;
cat + 1;
cat_num_items + num_items;
end;
else do;
/*If the item count for a new price puts the category count above the target, apply logic.*/
if cat_num_items + num_items > &group_target. then do;
/*If placing the item into a new category puts the current cat count closer to the `group_target` than would keeping it, then put into new category.*/
if abs(&group_target. - cat_num_items) < abs(&group_target. - (cat_num_items+num_items)) then do;
cat+1;
cat_num_items = num_items;
end;
/*Otherwise keep it in the currnet category and increment category count.*/
else cat_num_items + num_items;
end;
/*Otherwise keep the item count in the current category and increment category count.*/
else cat_num_items + num_items;
end;
drop cat_num_items;
run;
检查。强>
proc sql;
create table check_want as
select cat,
sum(num_items) as cat_count
from want
group by cat;
quit;