计算PySpark中的重叠会话

时间:2019-06-04 11:37:15

标签: hive pyspark apache-spark-sql hiveql

我有一个数据集,其中包含来自用户操作的日志,并且已对它们进行会话化(如果用户在不到20分钟的间隔内从同一IP访问我的系统,则他的所有操作都属于同一会话)。

举个例子,假设我们有这些数据可以引用多个组织和用户:

no-wrap

我希望能够计算每个单位/用户/天的重叠会话数。在这种情况下,由于所有4个会话都相互重叠,因此结果将为from pyspark.sql.types import TimestampType df = spark.createDataFrame( [ ("org_1", 'user_1', '2018-12-20 19:55:29', 1, 1), ("org_1", 'user_1', '2018-12-20 19:55:30', 1, 0), ("org_1", 'user_1', '2018-12-20 19:55:31', 2, 1), ("org_1", 'user_1', '2018-12-20 19:55:32', 3, 1), ("org_1", 'user_1', '2018-12-20 19:55:33', 4, 1), ("org_1", 'user_1', '2018-12-20 19:55:34', 1, 0), ("org_1", 'user_1', '2018-12-20 19:55:35', 3, 0), ("org_1", 'user_1', '2018-12-20 19:55:36', 3, 0), ("org_1", 'user_1', '2018-12-20 19:55:37', 1, 0), ("org_1", 'user_1', '2018-12-20 19:55:38', 2, 0), ], ("org_id", "user_id", "ymd", "session_id", "new_session") ) df = df.withColumn("ymd", df['ymd'].cast(TimestampType())) +------+-------+---------------------+----------+-----------+ |org_id|user_id|ymd |session_id|new_session| +------+-------+---------------------+----------+-----------+ |org_1 |user_1 |2018-12-20 19:55:29.0|1 |1 | |org_1 |user_1 |2018-12-20 19:55:30.0|1 |0 | |org_1 |user_1 |2018-12-20 19:55:31.0|2 |1 | |org_1 |user_1 |2018-12-20 19:55:32.0|3 |1 | |org_1 |user_1 |2018-12-20 19:55:33.0|4 |1 | |org_1 |user_1 |2018-12-20 19:55:34.0|1 |0 | |org_1 |user_1 |2018-12-20 19:55:35.0|3 |0 | |org_1 |user_1 |2018-12-20 19:55:36.0|3 |0 | |org_1 |user_1 |2018-12-20 19:55:37.0|1 |0 | |org_1 |user_1 |2018-12-20 19:55:38.0|2 |0 | +------+-------+---------------------+----------+-----------+

我当时正在考虑运行一个按组织,用户,天数划分的HIVE窗口函数,并以某种方式计算高于当前session_id的session_id的数量(这意味着它们是重叠的)。这是我正在描述的示例:

6

然后我可以总结一下overlaying_sessions列。

我无法弄清楚如何使用+------+-------+---------------------+----------+-----------+--------------------+ |org_id|user_id|ymd |session_id|new_session|overlapping_sessions| +------+-------+---------------------+----------+-----------+--------------------+ |org_1 |user_1 |2018-12-20 19:55:29.0|1 |1 |0 |org_1 |user_1 |2018-12-20 19:55:30.0|1 |0 |0 |org_1 |user_1 |2018-12-20 19:55:31.0|2 |1 |0 |org_1 |user_1 |2018-12-20 19:55:32.0|3 |1 |0 |org_1 |user_1 |2018-12-20 19:55:33.0|4 |1 |0 |org_1 |user_1 |2018-12-20 19:55:34.0|1 |0 | <- 3 (sessions 2, 3, 4 are above) |org_1 |user_1 |2018-12-20 19:55:35.0|3 |0 | <- 1 (session 4 is above) |org_1 |user_1 |2018-12-20 19:55:36.0|3 |0 |0 (since we've examined session_3 already) |org_1 |user_1 |2018-12-20 19:55:37.0|1 |0 |0 (since we've examined session_1 already) |org_1 |user_1 |2018-12-20 19:55:38.0|2 |0 | <- 2 (sessions 3, 4 are above) +------+-------+---------------------+----------+-----------+--------------------+ 实现这一目标。有指针吗?

0 个答案:

没有答案