Spark结构化流式级联自连接(Pyspark)

时间:2020-04-10 09:07:18

标签: apache-spark pyspark spark-structured-streaming self-join

问题:当尝试在pyspark流DF中执行级联自联接时,获得空结果。一个简单的自联接可以正常工作,但不能级联。加入之前没有聚合操作。

示例:我正在使用包含三列的流DF:user_idaction_idtimestamp。我想确定用户执行的操作序列。例如,当它观察到由同一action_id执行的a1之类的a2a3user_id序列时。

代码

输入:Kafka源,其值字段的格式为:

{"action_id":"a1","timestamp":1583301386000,"user_id":"u0"}
{"action_id":"a2","timestamp":1583301387000,"user_id":"u0"}
{"action_id":"a3","timestamp":1583301388000,"user_id":"u0"}

选择特定action_id并为user_idtimestamp添加后缀的函数:

def select_action_id(events_df, action_id, idx):

    return events_df.select(\
        col("user_id").alias("user_id"+"_"+str(idx)),\
        col("timestamp").alias("timestamp"+"_"+str(idx)),\
        .where(col("action_id") == action_id)

级联自联接以标识action_id的序列:

def get_sequence(events_df, action_ids):

    joined_df = None

    for action_id in action_ids:
        action_df = select_action_id(events_df, action_id, 0)

        if joined_df is not None:
            joined_df = joined_df.join(action_df,
            expr("""
                user_id == user_id_0 AND
                timestamp_0 >= final_timestamp AND
                timestamp_0 <= initial_timestamp + interval 24 hours
                """))

            joined_df = joined_df\
                .drop("user_id_0", "final_timestamp")\
                .withColumnRenamed("timestamp_0", "final_timestamp")

        else: 
            joined_df = action_df
            joined_df = joined_df\
                .withColumnRenamed("timestamp_0", "final_timestamp")\
                .withColumnRenamed("user_id_0", "user_id")\
                .withColumn("initial_timestamp", col("final_timestamp"))

    return joined_df

预期结果

+-------+-------------------+-------------------+
|user_id|initial_timestamp  |final_timestamp    |
+-------+-------------------+-------------------+
|u0     |2020-03-04 06:56:26|2020-03-04 06:56:28|
+-------+-------------------+-------------------+

获得的结果

+-------+-----------------+---------------+
|user_id|initial_timestamp|final_timestamp|
+-------+-----------------+---------------+
+-------+-----------------+---------------+

其他信息:建议的解决方案在actions_ids = ["a1", "a2"](即仅涉及一个联接)时有效,但在actions_ids = ["a1", "a2", "a3"]时无效。

时间戳从unix转换为UTC。

这是我的第一个堆栈溢出问题:)如果对此问题的描述不够清楚或违反了任何约定,我深表歉意

0 个答案:

没有答案