通过pyspark中的时差将行拆分为多个会话

时间:2016-12-01 12:27:24

标签: sql apache-spark pyspark apache-spark-sql spark-dataframe

这是伪数据:

user  ts
--------
1     1
1     3
1     10
1     13
1     21
1     24

如果每个用户的相邻时差≥6,则将其分成两个会话。因此,上述数据应按如下方式划分:

user    ts    diff
-------------------
1       1     None
1       3     2
-------------------
1       10    7
1       13    3
-------------------
1       21    8
1       24    3

我了解如何通过下面说明的Window函数在pyspark中生成diff列,但是如何以pyspark方式将其分割为每个用户的不同会话?非常感谢!

select
   user,
   ts,
   (lag(ts, 1) over (partition by user order by ts asc)) as diff
from user_event

1 个答案:

答案 0 :(得分:2)

你有正确的开端。 SQL将继续为:

select user, ts, diff,
       sum(case when diff > 6 then 1 else 0 end) over (partition by user order by ts) as session_grouping
from (select user, ts,
             lag(ts, 1) over (partition by user order by ts asc) as diff
      from user_event
     ) ue;