Bigquery - 根据访问历史记录计算用户活动会话

时间:2015-06-16 04:04:28

标签: google-bigquery

我是BQ的新手,并且不确定通过此查询会花多少钱。

我有一个记录所有用户访问时间的表,如下所示:

user_id     access_time
-------------------------------------
user_a      2015-06-15 14:12:12
user_b      2015-06-15 14:12:12
user_a      2015-06-15 14:12:13
user_a      2015-06-15 14:12:19
user_a      2015-06-15 14:12:28
user_a      2015-06-15 19:32:15
user_a      2015-06-15 19:32:19

我想生成一个活动的会话表来表示用户的所有活动窗口。每个会话包含持续时间和开始时间。

如果下次访问不在10秒内,会话将过期。

会话表的例子是:

session_id    user_id    session_start_time    duration
------------------------------------------------------------
1             user_a     2015-06-15 14:12:12   16
2             user_b     2015-06-15 14:12:12   0
3             user_a     2015-06-15 19:32:15   4

BQ似乎不支持自定义功能,如何通过单一查询实现这一目标?

提前致谢!

更新:

修正了这个例子。

3 个答案:

答案 0 :(得分:4)

为了说明使用示例中的数据的方法,以下是查询将如何显示具有开始时间的新会话:

select user, ts start_time from (
select user, ifnull(seconds - prev_seconds > 10, true) new_session from (
select user, ts, seconds, lag(seconds, 1) over(partition by user order by seconds) prev_seconds from
(select user, ts, integer(ts/1000000) seconds from
(select 'user_a' user,  timestamp('2015-06-15 14:12:12') ts),
(select 'user_b' user,  timestamp('2015-06-15 14:12:12') ts),
(select 'user_a' user,  timestamp('2015-06-15 14:12:13') ts),
(select 'user_a' user,  timestamp('2015-06-15 14:12:19') ts),
(select 'user_a' user,  timestamp('2015-06-15 14:12:28') ts),
(select 'user_a' user,  timestamp('2015-06-15 19:32:15') ts),
(select 'user_a' user,  timestamp('2015-06-15 19:32:19') ts))))
where new_session

为了获得会话的持续时间,我们可以运行另一个窗口函数,而不是进行自连接。基本上我们首先找到会话的开始和结束,然后计算它们之间的差异:

select user, ts, if(next_is_last, next_seconds - seconds, 0) duration
from (
select 
  user, new_session, last_session, ts, seconds,
  lead(seconds, 1) over(partition by user order by seconds) next_seconds,
  lead(last_session, 1) over(partition by user order by seconds) next_is_last
from (
select 
  user,
  ts,
  ifnull(seconds - prev_seconds > 10, true) new_session,
  ifnull(next_seconds - seconds > 10, true) last_session
from (
select 
  user, 
  ts, 
  seconds, 
  lag(seconds, 1) over(partition by user order by seconds) prev_seconds,
  lead(seconds, 1) over(partition by user order by seconds) next_seconds 
from
(select user, ts, integer(ts/1000000) seconds from
(select 'user_a' user,  timestamp('2015-06-15 14:12:12') ts),
(select 'user_b' user,  timestamp('2015-06-15 14:12:12') ts),
(select 'user_a' user,  timestamp('2015-06-15 14:12:13') ts),
(select 'user_a' user,  timestamp('2015-06-15 14:12:19') ts),
(select 'user_a' user,  timestamp('2015-06-15 14:12:28') ts),
(select 'user_a' user,  timestamp('2015-06-15 19:32:15') ts),
(select 'user_a' user,  timestamp('2015-06-15 19:32:19') ts))))
where new_session or last_session)
where new_session

这导致:

Row user    ts                       duration    
1   user_a  2015-06-15 14:12:12 UTC  16  
2   user_a  2015-06-15 19:32:15 UTC  4   
3   user_b  2015-06-15 14:12:12 UTC  0  

答案 1 :(得分:1)

如果不能访问数据集本身,我会有点难以回答,但这是我要实现的逻辑流程:

  1. 对于每个事件,使用LEAD()函数查找下一个访问时间;计算差值,并在结果上运行if语句,将记录标记为" new session" 1/0。只参加新会议。这将为您提供所有会话开始时段的子表
  2. 按照完全相同的步骤,标记新会话除外,以获取每次访问的持续时间
  3. 加入两个子表格,如:

    on a.user_id = b.user_id和b.access_time> = a.session_start_time和b.access_time< next_session_time

  4. 然后只为每个用户和会话求和

  5. 可能不是最有效的方法(将部分结果保存到临时表以避免两次运行所有数据),但它应该可以工作

答案 2 :(得分:0)

好的,Mosha's answer开悟了,我尝试了这个解决方案。 关键点是:

  1. 使用窗口功能来存储表格。
  2. 排除会话开始和结束之间的间隔。
  3. 再次使用窗口函数来计算持续时间。
  4. 这是脚本:

    select user, 
      case
        when not new_session and end_of_session then seconds - start_time
        when end_of_session  and end_of_session then 0
      end as duration,
      case 
        when not new_session and end_of_session then start_time
        when new_session and end_of_session then seconds
      end as session_start,
      seconds as session_end from
    (select *, lag(seconds, 1) over (partition by user order by seconds, prev_seconds) as start_time from
    (select user, seconds , new_session, ifnull(end_session_temp, true) end_of_session, prev_seconds from
    (select user, seconds , new_session, prev_seconds, lead(new_session, 1) over (partition by user order by seconds, prev_seconds) as end_session_temp from
    (select user, seconds, new_session, prev_seconds from 
    (select user, seconds, prev_seconds, ifnull(seconds - prev_seconds > 10, true) new_session from 
    (select user, ts, seconds, lag(seconds, 1) over(partition by user order by seconds) as prev_seconds from
    (select user, ts, integer(ts/1000000) seconds from
    (select 'user_a' user,  timestamp('2015-06-15 14:12:12') ts),
    (select 'user_b' user,  timestamp('2015-06-15 14:12:12') ts),
    (select 'user_a' user,  timestamp('2015-06-15 14:12:13') ts),
    (select 'user_a' user,  timestamp('2015-06-15 14:12:19') ts),
    (select 'user_a' user,  timestamp('2015-06-15 14:12:28') ts),
    (select 'user_a' user,  timestamp('2015-06-15 19:32:15') ts),
    (select 'user_a' user,  timestamp('2015-06-15 19:32:19') ts))))))
    where (new_session or end_session_temp  is null or end_session_temp)))
    where not (new_session and not end_of_session)
    

    输出结果为:

    Row         user        duration    session_start   session_end  
    1           user_b      0           1434377532      1434377532   
    2           user_a      16          1434377532      1434377548   
    3           user_a      4           1434396735      1434396739