我有一个消费者面板数据,每周记录在零售店的消费。唯一标识符是家庭ID。如果支出中出现超过五个零,我想删除观察结果。也就是说,家庭在五周内没有购买任何东西。一旦确定,我将删除与家庭ID相关的所有观察结果。有谁知道如何在SAS中实现此过程?感谢。
答案 0 :(得分:0)
我认为proc SQL在这里会很好。
这可以通过一个更复杂的子查询一步完成,但最好将其分解为两个步骤。
计算每个家庭ID有多少个零。
过滤为仅包含5个或更少零的家庭ID。
proc sql;
create table zero_cnt as
select distinct household_id,
sum(case when spending = 0 then 1 else 0 end) as num_zeroes
from original_data
group by household_id;
create table wanted as
select *
from original_data
where household_id in (select distinct household_id from zero_cnt where num_zeroes <= 5);
quit;
编辑:
如果零必须是连续的,那么构建要排除的ID列表的方法是不同的。
* Sort by ID and date;
proc sort data = original_data out = sorted_data;
by household_id date;
run;
使用延迟运算符:检查以前的支出金额。
有关LAG的更多信息,请访问:http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000212547.htm
data exclude;
set sorted;
by household_id;
array prev{*} _L1-_L4;
_L1 = lag(spending);
_L2 = lag2(spending);
_L3 = lag3(spending);
_L4 = lag4(spending);
* Create running count for the number of observations for each ID;
if first.household_id; then spend_cnt = 0;
spend_cnt + 1;
* Check if current ID has at least 5 observations to check. If so, add up current spending and previous 4 and output if they are all zero/missing;
if spend_cnt >= 5 then do;
if spending + sum(of prev) = 0 then output;
end;
keep household_id;
run;
然后只需使用子查询或匹配合并删除“排除”中的ID即可。数据集。
proc sql;
create table wanted as
select *
from original_data;
where household_id not in(select distinct household_id from excluded);
quit;