Question

我有一个庞大的数据集（约300亿行）：

host_id | usr_id | src_id | visit_num | event_ts

来自其父主机的任何用户都可以访问源（src_id）上的页面，其中源是，例如，他们的手机，平板电脑或计算机（无法识别）。列vis_num是每个主机每个用户每个源的有序访问次数。列event_ts捕获每个主机每个用户每个源的每次访问的时间戳。一个主机的示例数据集可能如下所示：

   host_id | usr_id  | src_id |  vis_num  |      event_ts
----------------------------------------------------------------
   100     |   10    |  05    |     1     |  2017-08-01 14:52:34
   100     |   10    |  05    |     1     |  2017-08-01 14:56:00
   100     |   10    |  05    |     1     |  2017-08-01 14:58:09
   100     |   10    |  05    |     2     |  2017-08-01 17:08:10
   100     |   10    |  05    |     2     |  2017-08-01 17:16:07
   100     |   10    |  05    |     2     |  2017-08-01 17:23:25
   100     |   10    |  72    |     1     |  2017-07-29 20:03:01
   100     |   10    |  72    |     1     |  2017-07-29 20:04:10
   100     |   10    |  72    |     2     |  2017-07-29 20:45:17
   100     |   10    |  72    |     2     |  2017-07-29 20:56:46
   100     |   10    |  72    |     3     |  2017-07-30 09:30:15
   100     |   10    |  72    |     3     |  2017-07-30 09:34:19
   100     |   10    |  72    |     4     |  2017-08-01 18:16:57
   100     |   10    |  72    |     4     |  2017-08-01 18:26:00
   100     |   10    |  72    |     5     |  2017-08-02 07:53:33
   100     |   22    |  43    |     1     |  2017-07-06 11:45:48
   100     |   22    |  43    |     1     |  2017-07-06 11:46:12
   100     |   22    |  43    |     2     |  2017-07-07 08:41:11

根据每个源ID，访问次数的变化意味着注销时间和后续登录时间。请注意，来自不同来源的活动可能会在时间上重叠。

我的目标是计算在一段时间间隔内至少两次登录的用户数（非新用户），比如45天。我的最终目标是：

1）确定在特定时间段（45天）内至少重复两次关键事件的所有用户。

2）对于这些用户，衡量他们在第一次和第二次完成活动之间所花费的时间。

3）绘制累积分布函数 - 即，在不同时间间隔内执行第二事件的用户的百分比。

4）确定80％的用户完成第二次活动的时间间隔 - 这是您的产品使用间隔。

第23页：

http://usdatavault.com/library/Product-Analytics-Playbook-vol1-Mastering_Retention.pdf

以下是我尝试的内容：

with new_users as (

select host_id || ' ' || usr_id as host_usr_id,
       min(event_ts) as first_login_date

   from tableA
   group by 1
)
,

time_diffs as (
select a.host_id || ' ' || a.usr_id as host_usr_id,
       a.usr_id,
       a.src_id,
       a.event_ts,
       a.vis_num,
       b.first_login_date,  

   case when lag(a.vis_num) over 
                    (partition by a.host_id, a.usr_id, a.src_id 
                      order by a.event_ts) <> a.vis_num
        then a.event_ts -  lag(a.event_ts) over 
                                 (partition by a.host_id, a.usr_id, 
                                               a.src_id 
                                   order by a.event_ts)
        else null end 
          as time_diff                     


    from tableA a
    left join new_users b
    on b.host_usr_id = a.host_id || ' ' || a.usr_id


      where a.event_date > current_date - interval '45 days'
      and a.event_date > b.first_login_date + interval '45 days'

)

select count(distinct case when time_diff < interval '45 days'
                  and event_ts > first_login_date + interval '45 
days'
                  then host_usr_id end) as cnt_45


   from time_diffs

我尝试了多个其他（非常不同的）查询（见下文），但性能绝对是一个问题。加入日期间隔对我来说也是一个新概念。任何帮助表示赞赏。

另一种方法：

with new_users as (

select host_id,
       usr_id,
       min(event_ts) as first_login_date

   from tableA
   group by 1,2

),

x_day_twice as (

select a.host_id, 
       a.usr_id,
       a.src_id,
       max(a.vis_num) - min(a.vis_num) + 1 as num_logins

    from tableA a
    left join new_users b
    on a.host_id || ' ' || a.usr_id = b.host_id || ' ' || b.usr_id
    and a.event_ts > b.first_login_date + interval '45 days'

where event_ts >= current_timestamp - interval '1 days' - 
interval '45 days' and first_login_date < current_date - 1 - 45

group by 1, 2, 3
)


select count(distinct case when num_logins > 1 
                           then host_id || ' ' || usr_id end)
   from x_day_twice

在Vertica SQL

0 个答案: