SAS软件:如何删除因变量超过五个零的观测值

时间:2017-03-17 20:48:59

标签: sas filtering data-cleaning

我有一个消费者面板数据,每周记录在零售店的消费。唯一标识符是家庭ID。如果支出中出现超过五个零,我想删除观察结果。也就是说,家庭在五周内没有购买任何东西。一旦确定,我将删除与家庭ID相关的所有观察结果。有谁知道如何在SAS中实现此过程?感谢。

1 个答案:

答案 0 :(得分:0)

我认为proc SQL在这里会很好。

这可以通过一个更复杂的子查询一步完成,但最好将其分解为两个步骤。

  1. 计算每个家庭ID有多少个零。

  2. 过滤为仅包含5个或更少零的家庭ID。

  3. proc sql;
    create table zero_cnt as
    select distinct household_id,
    sum(case when spending = 0 then 1 else 0 end) as num_zeroes
    from original_data
    group by household_id;

    create table wanted as
    select *
    from original_data   
    where household_id in (select distinct household_id from zero_cnt where num_zeroes <= 5);  
    quit;
    

    编辑:

    如果零必须是连续的,那么构建要排除的ID列表的方法是不同的。

    * Sort by ID and date;
    proc sort data = original_data out = sorted_data;  
    by household_id date;
    run;  
    

    使用延迟运算符:检查以前的支出金额。

    有关LAG的更多信息,请访问:http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000212547.htm

    data exclude;   
      set sorted;   
      by household_id;    
      array prev{*} _L1-_L4;  
     _L1 = lag(spending);  
     _L2 = lag2(spending);  
     _L3 = lag3(spending);  
     _L4 = lag4(spending);  
    
      * Create running count for the number of observations for each ID;
      if first.household_id; then spend_cnt = 0;  
      spend_cnt + 1;  
    
      * Check if current ID has at least 5 observations to check. If so, add up current spending and previous 4 and output if they are all zero/missing;  
      if spend_cnt >= 5 then do;  
        if spending + sum(of prev) = 0 then output;  
      end;  
      keep household_id;
    run;
    

    然后只需使用子查询或匹配合并删除“排除”中的ID即可。数据集。

    proc sql;  
      create table wanted as  
      select *  
      from original_data;  
      where household_id not in(select distinct household_id from excluded);  
    quit;