我有一个数据集,其中包含来自用户操作的日志,并且已对它们进行会话化(如果用户在不到20分钟的间隔内从同一IP访问我的系统,则他的所有操作都属于同一会话)。
举个例子,假设我们有这些数据可以引用多个组织和用户:
no-wrap
我希望能够计算每个单位/用户/天的重叠会话数。在这种情况下,由于所有4个会话都相互重叠,因此结果将为from pyspark.sql.types import TimestampType
df = spark.createDataFrame(
[
("org_1", 'user_1', '2018-12-20 19:55:29', 1, 1),
("org_1", 'user_1', '2018-12-20 19:55:30', 1, 0),
("org_1", 'user_1', '2018-12-20 19:55:31', 2, 1),
("org_1", 'user_1', '2018-12-20 19:55:32', 3, 1),
("org_1", 'user_1', '2018-12-20 19:55:33', 4, 1),
("org_1", 'user_1', '2018-12-20 19:55:34', 1, 0),
("org_1", 'user_1', '2018-12-20 19:55:35', 3, 0),
("org_1", 'user_1', '2018-12-20 19:55:36', 3, 0),
("org_1", 'user_1', '2018-12-20 19:55:37', 1, 0),
("org_1", 'user_1', '2018-12-20 19:55:38', 2, 0),
],
("org_id", "user_id", "ymd", "session_id", "new_session")
)
df = df.withColumn("ymd", df['ymd'].cast(TimestampType()))
+------+-------+---------------------+----------+-----------+
|org_id|user_id|ymd |session_id|new_session|
+------+-------+---------------------+----------+-----------+
|org_1 |user_1 |2018-12-20 19:55:29.0|1 |1 |
|org_1 |user_1 |2018-12-20 19:55:30.0|1 |0 |
|org_1 |user_1 |2018-12-20 19:55:31.0|2 |1 |
|org_1 |user_1 |2018-12-20 19:55:32.0|3 |1 |
|org_1 |user_1 |2018-12-20 19:55:33.0|4 |1 |
|org_1 |user_1 |2018-12-20 19:55:34.0|1 |0 |
|org_1 |user_1 |2018-12-20 19:55:35.0|3 |0 |
|org_1 |user_1 |2018-12-20 19:55:36.0|3 |0 |
|org_1 |user_1 |2018-12-20 19:55:37.0|1 |0 |
|org_1 |user_1 |2018-12-20 19:55:38.0|2 |0 |
+------+-------+---------------------+----------+-----------+
。
我当时正在考虑运行一个按组织,用户,天数划分的HIVE窗口函数,并以某种方式计算高于当前session_id的session_id的数量(这意味着它们是重叠的)。这是我正在描述的示例:
6
然后我可以总结一下overlaying_sessions列。
我无法弄清楚如何使用+------+-------+---------------------+----------+-----------+--------------------+
|org_id|user_id|ymd |session_id|new_session|overlapping_sessions|
+------+-------+---------------------+----------+-----------+--------------------+
|org_1 |user_1 |2018-12-20 19:55:29.0|1 |1 |0
|org_1 |user_1 |2018-12-20 19:55:30.0|1 |0 |0
|org_1 |user_1 |2018-12-20 19:55:31.0|2 |1 |0
|org_1 |user_1 |2018-12-20 19:55:32.0|3 |1 |0
|org_1 |user_1 |2018-12-20 19:55:33.0|4 |1 |0
|org_1 |user_1 |2018-12-20 19:55:34.0|1 |0 | <- 3 (sessions 2, 3, 4 are above)
|org_1 |user_1 |2018-12-20 19:55:35.0|3 |0 | <- 1 (session 4 is above)
|org_1 |user_1 |2018-12-20 19:55:36.0|3 |0 |0 (since we've examined session_3 already)
|org_1 |user_1 |2018-12-20 19:55:37.0|1 |0 |0 (since we've examined session_1 already)
|org_1 |user_1 |2018-12-20 19:55:38.0|2 |0 | <- 2 (sessions 3, 4 are above)
+------+-------+---------------------+----------+-----------+--------------------+
实现这一目标。有指针吗?