在SAS中计算时间间隔中一列的滚动总和

时间:2017-11-29 12:41:04

标签: sas hashtable

我有一个问题,我认为没有太多要纠正才能正常工作。 我有桌子(带有所需的输出栏' sum_usage'):

id  opt    t_purchase            t_spent       bonus   usage sum_usage
a    1  10NOV2017:12:02:00  10NOV2017:14:05:00   100     9        15
a    1  10NOV2017:12:02:00  10NOV2017:15:07:33   100     0        15
a    1  10NOV2017:12:02:00  10NOV2017:13:24:50   100     6        6
b    1  10NOV2017:13:54:00  10NOV2017:14:02:58   100     3        10
a    1  10NOV2017:12:02:00  10NOV2017:20:22:07   100    12        27
b    1  10NOV2017:13:54:00  10NOV2017:13:57:12   100     7 .      7

所以,我需要总结来自time_purchase的所有使用值(对于一个id,opt组合(group by id,opt),只有一个唯一的time_purchase)直到t_spent。 此外,我有大约数百万行,所以哈希表将是最好的解决方案。我试过了:

data want;
 if _n_=1 then do;
  if 0 then set have(rename=(usage=_usage));
  declare hash h(dataset:'have(rename=(usage=_usage))',hashexp:20);
  h.definekey('id','opt', 't_purchase', 't_spent');
  h.definedata('_usage');
  h.definedone();
 end;
set have;
sum_usage=0;
do i=intck('second', t_purchase, t_spent) to t_spent ;
 if h.find(key:user,key:id_option,key:i)=0 then sum_usage+_usage;
end;
drop _usage i;
run;

底部的第五行肯定不正确(do i=intck('second', t_purchase, t_spent),但不知道如何处理此问题。所以,主要问题是如何设置时间间隔来计算这个。我在这个哈希表函数中已经有一个具有相同键的函数,但没有时间间隔,所以编写这个函数也不错,但是没有必要。

2 个答案:

答案 0 :(得分:1)

就个人而言,我会抛弃哈希并使用SQL。

示例数据:

data have;

input id $ opt    
    t_purchase  datetime20.
    t_spent     datetime20.
    bonus   usage sum_usage;

format 
    t_purchase  datetime20.
    t_spent     datetime20.;

datalines;
a    1  10NOV2017:12:02:00  10NOV2017:14:05:00   100     9        15
a    1  10NOV2017:12:02:00  10NOV2017:15:07:33   100     0        15
a    1  10NOV2017:12:02:00  10NOV2017:13:24:50   100     6        6
b    1  10NOV2017:13:54:00  10NOV2017:14:02:58   100     3        10
a    1  10NOV2017:12:02:00  10NOV2017:20:22:07   100    12        27
b    1  10NOV2017:13:54:00  10NOV2017:13:57:12   100     7       7
;

我要离开您的sum_usage专栏进行比较。

现在,创建一个总和表。新值为sum_usage2

proc sql noprint;
create table sums as
select a.id,
       a.opt,
       a.t_purchase,
       a.t_spent,
       sum(b.usage) as sum_usage2
    from have as a,
         have as b
    where a.id = b.id
      and a.opt = b.opt
      and b.t_spent <= a.t_spent
      and b.t_spent >= a.t_purchase
    group by a.id, 
       a.opt,
       a.t_purchase,
       a.t_spent;
quit;

现在您有了总和,请将它们连接回原始表:

proc sql noprint;
create table want as
select a.*,
       b.sum_usage2
    from have as a
      left join
         sums as b
      on a.id = b.id
      and a.opt = b.opt
      and a.t_spent = b.t_spent
      and a.t_purchase = b.t_purchase;
quit;

这会生成您想要的表格。或者,您可以使用哈希来查找值并在数据步骤中添加总和(给定大小可以更快)。

data want2;
set have;
format sum_usage2 best.;
if _n_=1 then do;
    %create_hash(lk,id opt t_purchase t_spent, sum_usage2,"sums");
end;

rc = lk.find();

drop rc;
run;
这里有

%create_hash()https://github.com/FinancialRiskGroup/SASPerformanceAnalytics

答案 1 :(得分:1)

我相信这个问题是你早期的一个变形,你可以通过对数据集中每条记录的3小时内每秒进行哈希查找来计算滚动总和。希望您意识到该方法的简单性每个记录需要大量3 * 3600个散列查找,并且必须将整个数据向量加载到散列中。

显示的时间日志数据在数据顶部插入了新记录,我假设数据在时间上单调下降。

数据步骤可以在单次通过单调数据时计算时间窗口内的滚动总和。该技术使用“环”阵列,其中索引推进由模数调整。一个数组用于时间,另一个用于度量(用法)。所需的数组大小是时间窗口内可能出现的最大项目数。

考虑一些生成的样本数据,时间步长为1,2,一次跳跃为200秒:

data have;
  time = '12oct2017:11:22:32'dt;
  usage = 0;
  do _n_ = 1 to &have_count;
     time + 2; *ceil(25*ranuni(123));
     if _n_ > 30 then time + -1;
     if _n_ = 145 then time + 200;
     usage = floor(180*ranuni(123));
     delta = time-lag(time);
     output;
  end;
run;

从排序时间上升的前一项计算滚动总和的情况开始。 (下降案例将随之而来):

示例参数为RING_SIZE 16和TIME_WINDOW为12秒。

%let RING_SIZE = 16;
%let TIME_WINDOW = '00:00:12't;

data want;
  array ring_usage [0:%eval(&RING_SIZE-1)] _temporary_ (&RING_SIZE*0);
  array ring_time  [0:%eval(&RING_SIZE-1)] _temporary_ (&RING_SIZE*0);

  retain ring_tail 0 ring_head -1 span 0 span_usage 0;

  set have;
  by time ; * cause error if data not sorted per algorithm requirement;

  * unload from accumulated usage the tail items that fell out the window;
  do while (span and time - ring_time(ring_tail) > &TIME_WINDOW);
    span + -1;

    span_usage + -ring_usage(ring_tail);
    ring_tail = mod ( ring_tail + 1, &RING_SIZE ) ;
  end;

  ring_head = mod ( ring_head + 1, &RING_SIZE );
  span + 1;

  if span > 1 and (ring_head = ring_tail) then do;
    _n_ = dim(ring_time);
    put 'ERROR: Ring array too small, size=' _n_;
    abort cancel;
  end;

  * update the ring array;
  ring_time(ring_head) = time;
  ring_usage(ring_head) = usage;

  span_usage + usage;

  drop ring_tail ring_head span;
run;

对于按降序排序的数据,你可以摇摆一些东西;升序,计算滚动和度假下降。

如果无法完成这样的抖动怎么办,或者你只想要一次通过?

作为滚动计算一部分的项目必须来自“前导”行或尚未通过SET读取的行。这怎么可能 ?第二个SET语句可用于打开到数据集的单独通道,从而获得前导值。

处理潜在客户数据需要更多的记账 - 需要处理数据末尾的过早覆盖和缩小窗口。

data want2;
  array ring_usage [-1:%eval(&RING_SIZE-1)] _temporary_;
  array ring_time  [-1:%eval(&RING_SIZE-1)] _temporary_;

  retain lead_index 0 ring_tail -1 ring_head -1 span 1 span_usage . guard_index .;

  set have;

&debug put / _N_ ':' time= ring_head=;

  * unload ring_head slotted item from sum;
  span + -1;
  span_usage + -ring_usage(ring_head);

  * advance ring_head slot by 1, the vacated slot will be overwritten by lead;
  ring_head = mod ( ring_head + 1, &RING_SIZE ); 

&debug put +2 ring_time(ring_head)= span= 'head';

  * load ring with lead values via a second SET of the same data;
  if not end2 then do;

    do until (_n_ > 1 or lead_index = 0 or end2);
      set have(keep=time usage rename=(time=t usage=u)) end=end2;  * <--- the second SET ;

      if end2 then guard_index = lead_index;

&debug if end2 then put guard_index=;

      ring_time(lead_index) = t;
      ring_usage(lead_index) = u;

&debug put +2 ring_time(lead_index)=  'lead';

      lead_index = mod ( lead_index + 1, &RING_SIZE);
    end;
  end;

  * advance ring_tail to cover the time window;
  if ring_tail ne guard_index then do;

      ring_tail_was = ring_tail;
      ring_tail = mod ( ring_tail + 1, &RING_SIZE ) ;

      do while (time - ring_time(ring_tail) <= &TIME_WINDOW);

          span + 1;
          span_usage + ring_usage(ring_tail);

&debug put +2 ring_time(ring_tail)= span= 'seek';

          ring_tail_was = ring_tail;
          ring_tail = mod ( ring_tail + 1, &RING_SIZE ) ;

          if ring_tail_was = guard_index then leave;

          if span > 1 and (ring_head = ring_tail) then do;
            _n_ = dim(ring_time);
            put 'ERROR: Ring array too small, size=' _n_;
            abort cancel;
          end;
      end;

      * seek went beyond window, back tail off to prior index;
      ring_tail = ring_tail_was;

  end;

&debug put +2 ring_time(ring_tail)= span= 'mark';

  drop lead_index t u ring_: guard_index span;

  format ring: span: usage 6.;
run;
options source;

确认两种方法具有相同的计算结果:

proc sort data=want2; by time;
run;

proc compare noprint data=want compare=want2 out=diff outnoequal;
  id time;
  var span_usage;
run;
---------- LOG ----------
NOTE: There were 150 observations read from the data set WORK.WANT.
NOTE: There were 150 observations read from the data set WORK.WANT2.
NOTE: The data set WORK.DIFF has 0 observations and 4 variables.

我没有对环数组进行基准测试,而不是使用Proc EXPAND和Hash进行比较。

警告:在处理非整数值时,使用+ in和-out操作的航位推算滚动值可能会遇到舍入错误。