我有一个很大的面板数据集,看起来像这样:
.csv
对于每个ID,我想记录所有“触发”事件,即当a = 1时,然后我需要花费多长时间来使下一个出现b = 1。最终的输出应该给我以下内容:
data have;
input id t a b ;
datalines;
1 1 0 0
1 2 0 0
1 3 1 0
1 4 0 0
1 5 0 1
1 6 1 0
1 7 0 0
1 8 0 0
1 9 0 0
1 10 0 1
2 1 0 0
2 2 1 0
2 3 0 0
2 4 0 0
2 5 0 1
2 6 0 1
2 7 0 1
2 8 0 1
2 9 1 0
2 10 0 1
3 1 0 0
3 2 0 0
3 3 0 0
3 4 0 0
3 5 0 0
3 6 0 0
3 7 1 0
3 8 0 0
3 9 0 0
3 10 0 0
;
run;
获取所有a = 1和b = 1事件当然是没有问题的,但是由于它是一个非常大的数据集,每个ID都有很多两个事件,因此我正在寻找一种简洁明了的解决方案。有什么想法吗?
答案 0 :(得分:1)
一种优雅的DATA步骤方法可以使用嵌套的DOW循环。当您了解DOW循环时,这很简单。
data want(keep=id--diff);
length id a_no a_t b_t diff 8;
do until (last.id); * process each group;
do a_no = 1 by 1 until(last.id); * counter for each output;
do until ( output_condition or end); * process each triggering state change;
SET have end=end; * read data;
by id; * setup first. last. variables for group;
if a=1 then a_t = t; * detect and record start of trigger state;
output_condition = (b=1 and t > a_t > 0); * evaluate for proper end of trigger state;
end;
if output_condition then do;
b_t = t; * compute remaining info at output point;
diff = b_t - a_t;
OUTPUT;
a_t = .; * reset trigger state tracking variables;
b_t = .;
end;
else
OUTPUT; * end of data reached without triggered output;
end;
end;
run;
注意:一种SQL方式(未显示)可以在组内使用自我联接。
答案 1 :(得分:1)
这是一种相当简单的SQL方法,可以或多或少地提供所需的输出:
proc sql;
create table want
as select
t1.id,
t1.t as a_t,
t2.t as b_t,
t2.t - t1.t as diff
from
have(where = (a=1)) t1
left join
have(where = (b=1)) t2
on
t1.id = t2.id
and t2.t > t1.t
group by t1.id, t1.t
having diff = min(diff)
;
quit;
唯一缺少的部分是a_no
。要在SQL中一致地生成这种行递增ID,需要进行大量工作,但在执行额外的数据步骤时却显得微不足道:
data want;
set want;
by id;
if first.id then a_no = 0;
a_no + 1;
run;