Question

我有一个很大的面板数据集，看起来像这样：

.csv

对于每个ID，我想记录所有“触发”事件，即当a = 1时，然后我需要花费多长时间来使下一个出现b = 1。最终的输出应该给我以下内容：

data have;
   input id t a b ;
datalines;
1 1 0 0
1 2 0 0
1 3 1 0
1 4 0 0
1 5 0 1
1 6 1 0
1 7 0 0
1 8 0 0
1 9 0 0
1 10 0 1
2 1 0 0
2 2 1 0
2 3 0 0
2 4 0 0
2 5 0 1
2 6 0 1
2 7 0 1
2 8 0 1
2 9 1 0
2 10 0 1
3 1 0 0
3 2 0 0
3 3 0 0
3 4 0 0
3 5 0 0
3 6 0 0
3 7 1 0
3 8 0 0
3 9 0 0
3 10 0 0
;
run;

获取所有a = 1和b = 1事件当然是没有问题的，但是由于它是一个非常大的数据集，每个ID都有很多两个事件，因此我正在寻找一种简洁明了的解决方案。有什么想法吗？

Answer 1

一种优雅的DATA步骤方法可以使用嵌套的DOW循环。当您了解DOW循环时，这很简单。

data want(keep=id--diff);
  length id a_no a_t b_t diff 8;
  do until (last.id);                           * process each group;
    do a_no = 1 by 1 until(last.id);            * counter for each output;
      do until ( output_condition or end);      * process each triggering state change;

        SET have end=end;          * read data;
        by id;                     * setup first. last. variables for group;

        if a=1 then a_t = t;       * detect and record start of trigger state;

        output_condition = (b=1 and t > a_t > 0);  * evaluate for proper end of trigger state;
      end;

      if output_condition then do; 
        b_t = t;                     * compute remaining info at output point;
        diff = b_t - a_t;

        OUTPUT;

        a_t = .;       * reset trigger state tracking variables;
        b_t = .;
      end;
      else 
        OUTPUT;        * end of data reached without triggered output;
    end;
  end;
run;

注意：一种SQL方式（未显示）可以在组内使用自我联接。

Answer 2

这是一种相当简单的SQL方法，可以或多或少地提供所需的输出：

proc sql;
create table want
  as select 
    t1.id, 
    t1.t as a_t, 
    t2.t as b_t, 
    t2.t - t1.t as diff
    from 
      have(where = (a=1)) t1 
      left join 
      have(where = (b=1)) t2
    on 
      t1.id = t2.id 
      and t2.t > t1.t
    group by t1.id, t1.t
    having diff = min(diff)
    ;
quit;

唯一缺少的部分是a_no。要在SQL中一致地生成这种行递增ID，需要进行大量工作，但在执行额外的数据步骤时却显得微不足道：

data want;
 set want;
 by id;
 if first.id then a_no = 0;
 a_no + 1;
run;

识别触发事件后的首次出现

2 个答案: