Question

考虑使用以下布局的示例性SAS数据集。

Price  Num_items  
100    10   
120    15  
130    20  
140    25  
150    30

我想通过定义一个名为cat的新变量将它们分为4类，以便新数据集如下所示：

Price  Num_items  Cat  
100    10         1  
120    15         1  
130    20         2  
140    25         3  
150    30         4

此外，我想对它们进行分组，使它们具有大约相同数量的项目（例如，在上面的分组中，组1有25个，组2有20个，组3有25个，组4有30个观察值）。请注意，price列按升序排序（这是必需的）。

我正在努力从SAS开始。所以任何帮助将不胜感激。我不是在寻找一个完整的解决方案，但是有关准备解决方案的建议会有所帮助。

Answer 1

酷的问题，巧妙复杂。我同意@J_Lard的观点，即保留数据的步骤可能是实现这一目标的最快捷方式。如果我正确理解您的问题，我认为下面的代码会为您提供一些关于如何解决它的想法。请注意，根据num_items和group_target，您的里程会有所不同。

生成类似但更大的数据集。

data have;
    do price=50 to 250 by 10;
        /*Seed is `_N_` so we'll see the same random item count.*/
        num_items = ceil(ranuni(_N_)*10)*5;
        output;
    end;
run;

<强>归类。

/*Desired group size specification.*/
%let group_target = 50;

data want;
    set have;
    /*The first record, initialize `cat` and `cat_num_items` to 1 with implicit retainment*/
    if _N_=1 then do;
        cat + 1;
        cat_num_items + num_items;
    end;
    else do;
        /*If the item count for a new price puts the category count above the target, apply logic.*/
        if cat_num_items + num_items > &group_target. then do;
            /*If placing the item into a new category puts the current cat count closer to the `group_target` than would keeping it, then put into new category.*/
            if abs(&group_target. - cat_num_items) < abs(&group_target. - (cat_num_items+num_items)) then do;
                cat+1;
                cat_num_items = num_items;
            end;
            /*Otherwise keep it in the currnet category and increment category count.*/
            else cat_num_items + num_items;
        end;
        /*Otherwise keep the item count in the current category and increment category count.*/
        else cat_num_items + num_items;
    end;
    drop cat_num_items;
run;

检查。

proc sql; create table check_want as select cat, sum(num_items) as cat_count from want group by cat; quit;

SAS用于子集化数据

1 个答案: