Spark Window函数的最后一个不为空值

时间:2019-06-17 19:25:03

标签: apache-spark pyspark apache-spark-sql

我们有一个用于用户事件的时间序列数据库,如下所示:

timestamp             user_id     event            ticke_type     error_type 
2019-06-06 14:33:31   user_a      choose_ticket    ticke_b        NULL
2019-06-06 14:34:31   user_b      choose_ticket    ticke_f        NULL
2019-06-06 14:36:31   user_a      booing_error     NULL           error_c  
2019-06-06 14:37:31   user_a      choose_ticket    ticke_h        NULL
2019-06-06 14:38:31   user_a      booing_error     NULL           error_d
2019-06-06 14:39:31   user_a      booing_error     NULL           error_e

这是我们需要的一个用例:

为了调查哪种机票类型导致某些预订错误, 我们将要研究机票类型, 仅在较早的事件choose_ticket中可用。

在这种情况下,我们要寻找的是每个booking_error事件,找到前一个choose_ticket事件 相同用户并将票证类型合并到booking_error事件中。

所以理想情况下,我们想要的输出是:

timestamp             user_id     event            ticke_type     error_type 
2019-06-06 14:36:31   user_a      booing_error     ticke_b        error_c  
2019-06-06 14:38:31   user_a      booing_error     ticke_h        error_d
2019-06-06 14:39:31   user_a      booing_error     ticke_h        error_e

我能找到的最接近的是Spark add new column to dataframe with value from previous row, 这样我们就可以从上一个事件中获取属性并将其立即应用于该事件。

这几乎可以工作,除了当有多个事件(在此示例中为booing_error)时,在这种情况下,只有第一个事件可以获取所需的属性。 例如,这是我们从上面的SO链接中获得的解决方案所得到的:

timestamp             user_id     event            ticke_type     error_type 
2019-06-06 14:36:31   user_a      booing_error     ticke_b        error_c  
2019-06-06 14:38:31   user_a      booing_error     ticke_h        error_d
2019-06-06 14:39:31   user_a      booing_error     NULL           error_e

总而言之,对于给定的行,如何找到符合特定条件的上一行并将其属性“ cherry-pick”呢?

最好的方法是什么?

2 个答案:

答案 0 :(得分:2)

org.apache.spark.sql.functions.last是您要寻找的。您可以重命名“最近”列以最后替换ticke_type。

scala> df.show
+-------------------+-------+-------------+----------+----------+
|          timestamp|user_id|        event|ticke_type|error_type|
+-------------------+-------+-------------+----------+----------+
|2019-06-06 14:33:31| user_a|choose_ticket|   ticke_b|      null|
|2019-06-06 14:34:31| user_b|choose_ticket|   ticke_f|      null|
|2019-06-06 14:36:31| user_a|booking_error|      null|   error_c|
|2019-06-06 14:37:31| user_a|choose_ticket|   ticke_h|      null|
|2019-06-06 14:38:31| user_a|booking_error|      null|   error_d|
|2019-06-06 14:39:31| user_a|booking_error|      null|   error_e|
+-------------------+-------+-------------+----------+----------+

scala> val overColumns = Window.partitionBy("user_id").orderBy("timestamp")
overColumns: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@70dc8c9a

scala> df.withColumn("closest", 
  org.apache.spark.sql.functions.last("ticke_type", true).over(overColumns)).filter($"event" === "booking_error").show
+-------------------+-------+-------------+----------+----------+-------+
|          timestamp|user_id|        event|ticke_type|error_type|closest|
+-------------------+-------+-------------+----------+----------+-------+
|2019-06-06 14:36:31| user_a|booking_error|      null|   error_c|ticke_b|
|2019-06-06 14:38:31| user_a|booking_error|      null|   error_d|ticke_h|
|2019-06-06 14:39:31| user_a|booking_error|      null|   error_e|ticke_h|
+-------------------+-------+-------------+----------+----------+-------+

答案 1 :(得分:1)

这是pyspark版本

 df = self.spark.createDataFrame(
            [('2019-06-06 14:33:31', 'user_a', 'choose_ticket', 'ticke_b', None),
             ('2019-06-06 14:34:31', 'user_b', 'choose_ticket', 'ticke_f', None),
             ('2019-06-06 14:36:31', 'user_a', 'booing_error', None, 'error_c'),
             ('2019-06-06 14:37:31', 'user_a', 'choose_ticket', 'ticke_h', None),
             ('2019-06-06 14:38:31', 'user_a', 'booing_error', None, 'error_d'),
             ('2019-06-06 14:39:31', 'user_a', 'booing_error', None, 'error_e'),
             ],
            ("timestamp", "user_id", "event", "ticke_type", "error_type"))

        df.show()

        window_spec = Window.partitionBy(col("user_id")).orderBy(col("timestamp"))

        df = df.withColumn('ticke_type_forwardfill', when(col("event") == "choose_ticket", col("ticke_type")) \
                           .otherwise(last("ticke_type", True).over(window_spec))) \
            .drop(col("ticke_type")) \
            .filter(col("event") == "booing_error")

        df.show()

结果

+-------------------+-------+-------------+----------+----------+
|          timestamp|user_id|        event|ticke_type|error_type|
+-------------------+-------+-------------+----------+----------+
|2019-06-06 14:33:31| user_a|choose_ticket|   ticke_b|      null|
|2019-06-06 14:34:31| user_b|choose_ticket|   ticke_f|      null|
|2019-06-06 14:36:31| user_a| booing_error|      null|   error_c|
|2019-06-06 14:37:31| user_a|choose_ticket|   ticke_h|      null|
|2019-06-06 14:38:31| user_a| booing_error|      null|   error_d|
|2019-06-06 14:39:31| user_a| booing_error|      null|   error_e|
+-------------------+-------+-------------+----------+----------+

+-------------------+-------+------------+----------+----------------------+
|          timestamp|user_id|       event|error_type|ticke_type_forwardfill|
+-------------------+-------+------------+----------+----------------------+
|2019-06-06 14:36:31| user_a|booing_error|   error_c|               ticke_b|
|2019-06-06 14:38:31| user_a|booing_error|   error_d|               ticke_h|
|2019-06-06 14:39:31| user_a|booing_error|   error_e|               ticke_h|
+-------------------+-------+------------+----------+----------------------+