问题:当尝试在pyspark流DF中执行级联自联接时,获得空结果。一个简单的自联接可以正常工作,但不能级联。加入之前没有聚合操作。
示例:我正在使用包含三列的流DF:user_id
,action_id
和timestamp
。我想确定用户执行的操作序列。例如,当它观察到由同一action_id
执行的a1
之类的a2
,a3
和user_id
序列时。
代码
输入:Kafka源,其值字段的格式为:
{"action_id":"a1","timestamp":1583301386000,"user_id":"u0"}
{"action_id":"a2","timestamp":1583301387000,"user_id":"u0"}
{"action_id":"a3","timestamp":1583301388000,"user_id":"u0"}
选择特定action_id
并为user_id
和timestamp
添加后缀的函数:
def select_action_id(events_df, action_id, idx):
return events_df.select(\
col("user_id").alias("user_id"+"_"+str(idx)),\
col("timestamp").alias("timestamp"+"_"+str(idx)),\
.where(col("action_id") == action_id)
级联自联接以标识action_id
的序列:
def get_sequence(events_df, action_ids):
joined_df = None
for action_id in action_ids:
action_df = select_action_id(events_df, action_id, 0)
if joined_df is not None:
joined_df = joined_df.join(action_df,
expr("""
user_id == user_id_0 AND
timestamp_0 >= final_timestamp AND
timestamp_0 <= initial_timestamp + interval 24 hours
"""))
joined_df = joined_df\
.drop("user_id_0", "final_timestamp")\
.withColumnRenamed("timestamp_0", "final_timestamp")
else:
joined_df = action_df
joined_df = joined_df\
.withColumnRenamed("timestamp_0", "final_timestamp")\
.withColumnRenamed("user_id_0", "user_id")\
.withColumn("initial_timestamp", col("final_timestamp"))
return joined_df
预期结果
+-------+-------------------+-------------------+
|user_id|initial_timestamp |final_timestamp |
+-------+-------------------+-------------------+
|u0 |2020-03-04 06:56:26|2020-03-04 06:56:28|
+-------+-------------------+-------------------+
获得的结果
+-------+-----------------+---------------+
|user_id|initial_timestamp|final_timestamp|
+-------+-----------------+---------------+
+-------+-----------------+---------------+
其他信息:建议的解决方案在actions_ids = ["a1", "a2"]
(即仅涉及一个联接)时有效,但在actions_ids = ["a1", "a2", "a3"]
时无效。
时间戳从unix转换为UTC。
这是我的第一个堆栈溢出问题:)如果对此问题的描述不够清楚或违反了任何约定,我深表歉意