根据BY组的特征保留或删除一组观察

时间:2016-11-13 03:40:50

标签: sas

几分钟前我回答了一个SAS问题,并意识到有一个概括可能比那个(here)更有用。我在StackOverflow中没有看到这个问题。

一般问题是:在你查看小组中的所有观察结果之前,如何根据BY组的某些特征来处理和保留整个BY组?

使用类似于早期问题的输入数据:

* For some reason, we are tasked with keeping only observations that
* are in groups of ID_1 and ID_2 that contain at least one obs with
* a VALUE of 0.;
* In the following data, the following ID and ID_2 groups should be
* kept:
* A 2 (2 obs)
* B 1 (3 obs)
* B 3 (2 obs)
* B 4 (1 obs)
* The resulting dataset will have 8 observations.;
data x;
    input id $ id_2 value;
datalines;
A 1 1
A 1 1
A 1 1
A 2 0
A 2 1
B 1 0
B 1 1
B 1 3
B 2 1
B 3 0
B 3 0
B 4 0
C 2 4
;
run;

3 个答案:

答案 0 :(得分:0)

我的回答可能不是最有效的,特别是对于大型数据集,我有兴趣看到其他可能的答案。这是:

* For some reason, we are tasked with keeping only observations that
* are in groups of ID_1 and ID_2 that contain at least one obs with
* a VALUE of 0.;
* In the following data, the following ID and ID_2 groups should be
* kept:
* A 2 (2 obs)
* B 1 (3 obs)
* B 3 (2 obs)
* B 4 (1 obs)
* The resulting dataset will have 8 observations.;
data x;
    input id $ id_2 value;
datalines;
A 1 1
A 1 1
A 1 1
A 2 0
A 2 1
B 1 0
B 1 1
B 1 3
B 2 1
B 3 0
B 3 0
B 4 0
C 2 4
;
run;

* I realize the data are already sorted, but I think it is better
* not to assume they are.;
proc sort data=x;
    by id id_2;
run;

data obstokeep;
    keep id id_2 value;
    retain startptr haszero;

    * This SET statement reads through the dataset in sequence and 
    * uses the CUROBS option to obtain the observation number. In
    * most situations, this will be the same as the _N_ automatic
    * variable, but CUROBS is probably safer.;
    set x curobs=myptr;
    by id id_2;

    * When this is the first observation in a BY-group, save the 
    * current observation number (pointer).
    * Also initialize a flag variable that will become 1 if any 
    * obs contains a VALUE of 0; 
    * The variables are in a RETAIN statement, so they keep their
    * values as the SET statement above is executed for each obs
    * in the BY-group.;
    if first.id_2
     then do;
        startptr=myptr;
        haszero=0;
     end;

    * This statement is executed for each observation. We check
    * whether VALUE is 0 and, if so, record that fact.;
    if value = 0
     then haszero=1;

    * At the end of the BY-group, we check to see if there were
    * any observations with VALUE = 0. If so, we go back using
    * another SET statement, re-read them via direct access, and
    * write them to the output dataset.
    * (Note that if VALUE order is not relevant, you can gain a bit
    * more efficiency by writing the current obs first, then going
    * back to get the rest.);
    if last.id_2 and haszero
     then do;
        * When LAST and FIRST at the same time, there is only one
        * obs, so no need to backtrack, just output and go on.;
        if first.id_2
         then output obstokeep;
         else do;
            * Here we assume that the observations are sequential
            * (which they will be for a sequential SET statement),
            * so we re-read these observations using another SET 
            * statement with the POINT option for direct access 
            * starting with the first obs of the by-group (the 
            * saved pointer) and ending with the current one (the
            * current pointer).;
            do i=startptr to myptr;
              set x point=i;
              output obstokeep;
            end;
         end;
     end;

run;

答案 1 :(得分:0)

Double DoW循环解决方案:

data have;
    input id $ id_2 value;
datalines;
A 1 1
A 1 1
A 1 1
A 2 0
A 2 1
B 1 0
B 1 1
B 1 3
B 2 1
B 3 0
B 3 0
B 4 0
C 2 4
;
run;

data want;
do _n_ = 1 by 1 until(last.id_2);
    set have;
    by id id_2;
    flag = sum(flag,value=0);
end;
do _n_ = 1 to _n_;
    set have;
    if flag then output;
end;
drop flag;
run;

我使用〜{55}行对point方法进行了测试,发现性能没有明显差异。数据集使用:

data have;
do ID = 1 to 10000000;
    do id_2 = 1 to ceil(ranuni(1)*10);
        do value = floor(ranuni(2) * 5);
            output;
        end;
    end;
end;
run;

答案 2 :(得分:0)

proc sql;
   select a.*,b.value from (select id,id_2 from have where value=0)a left join have b
   on a.id=b.id and a.id_2=b.id_2;
quit;