我们有一个用于用户事件的时间序列数据库,如下所示:
timestamp user_id event ticke_type error_type
2019-06-06 14:33:31 user_a choose_ticket ticke_b NULL
2019-06-06 14:34:31 user_b choose_ticket ticke_f NULL
2019-06-06 14:36:31 user_a booing_error NULL error_c
2019-06-06 14:37:31 user_a choose_ticket ticke_h NULL
2019-06-06 14:38:31 user_a booing_error NULL error_d
2019-06-06 14:39:31 user_a booing_error NULL error_e
这是我们需要的一个用例:
为了调查哪种机票类型导致某些预订错误,
我们将要研究机票类型,
仅在较早的事件choose_ticket
中可用。
在这种情况下,我们要寻找的是每个booking_error
事件,找到前一个choose_ticket
事件
相同用户并将票证类型合并到booking_error
事件中。
所以理想情况下,我们想要的输出是:
timestamp user_id event ticke_type error_type
2019-06-06 14:36:31 user_a booing_error ticke_b error_c
2019-06-06 14:38:31 user_a booing_error ticke_h error_d
2019-06-06 14:39:31 user_a booing_error ticke_h error_e
我能找到的最接近的是Spark add new column to dataframe with value from previous row, 这样我们就可以从上一个事件中获取属性并将其立即应用于该事件。
这几乎可以工作,除了当有多个事件(在此示例中为booing_error
)时,在这种情况下,只有第一个事件可以获取所需的属性。
例如,这是我们从上面的SO链接中获得的解决方案所得到的:
timestamp user_id event ticke_type error_type
2019-06-06 14:36:31 user_a booing_error ticke_b error_c
2019-06-06 14:38:31 user_a booing_error ticke_h error_d
2019-06-06 14:39:31 user_a booing_error NULL error_e
总而言之,对于给定的行,如何找到符合特定条件的上一行并将其属性“ cherry-pick”呢?
最好的方法是什么?
答案 0 :(得分:2)
org.apache.spark.sql.functions.last
是您要寻找的。您可以重命名“最近”列以最后替换ticke_type。
scala> df.show
+-------------------+-------+-------------+----------+----------+
| timestamp|user_id| event|ticke_type|error_type|
+-------------------+-------+-------------+----------+----------+
|2019-06-06 14:33:31| user_a|choose_ticket| ticke_b| null|
|2019-06-06 14:34:31| user_b|choose_ticket| ticke_f| null|
|2019-06-06 14:36:31| user_a|booking_error| null| error_c|
|2019-06-06 14:37:31| user_a|choose_ticket| ticke_h| null|
|2019-06-06 14:38:31| user_a|booking_error| null| error_d|
|2019-06-06 14:39:31| user_a|booking_error| null| error_e|
+-------------------+-------+-------------+----------+----------+
scala> val overColumns = Window.partitionBy("user_id").orderBy("timestamp")
overColumns: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@70dc8c9a
scala> df.withColumn("closest",
org.apache.spark.sql.functions.last("ticke_type", true).over(overColumns)).filter($"event" === "booking_error").show
+-------------------+-------+-------------+----------+----------+-------+
| timestamp|user_id| event|ticke_type|error_type|closest|
+-------------------+-------+-------------+----------+----------+-------+
|2019-06-06 14:36:31| user_a|booking_error| null| error_c|ticke_b|
|2019-06-06 14:38:31| user_a|booking_error| null| error_d|ticke_h|
|2019-06-06 14:39:31| user_a|booking_error| null| error_e|ticke_h|
+-------------------+-------+-------------+----------+----------+-------+
答案 1 :(得分:1)
这是pyspark版本
df = self.spark.createDataFrame(
[('2019-06-06 14:33:31', 'user_a', 'choose_ticket', 'ticke_b', None),
('2019-06-06 14:34:31', 'user_b', 'choose_ticket', 'ticke_f', None),
('2019-06-06 14:36:31', 'user_a', 'booing_error', None, 'error_c'),
('2019-06-06 14:37:31', 'user_a', 'choose_ticket', 'ticke_h', None),
('2019-06-06 14:38:31', 'user_a', 'booing_error', None, 'error_d'),
('2019-06-06 14:39:31', 'user_a', 'booing_error', None, 'error_e'),
],
("timestamp", "user_id", "event", "ticke_type", "error_type"))
df.show()
window_spec = Window.partitionBy(col("user_id")).orderBy(col("timestamp"))
df = df.withColumn('ticke_type_forwardfill', when(col("event") == "choose_ticket", col("ticke_type")) \
.otherwise(last("ticke_type", True).over(window_spec))) \
.drop(col("ticke_type")) \
.filter(col("event") == "booing_error")
df.show()
结果
+-------------------+-------+-------------+----------+----------+
| timestamp|user_id| event|ticke_type|error_type|
+-------------------+-------+-------------+----------+----------+
|2019-06-06 14:33:31| user_a|choose_ticket| ticke_b| null|
|2019-06-06 14:34:31| user_b|choose_ticket| ticke_f| null|
|2019-06-06 14:36:31| user_a| booing_error| null| error_c|
|2019-06-06 14:37:31| user_a|choose_ticket| ticke_h| null|
|2019-06-06 14:38:31| user_a| booing_error| null| error_d|
|2019-06-06 14:39:31| user_a| booing_error| null| error_e|
+-------------------+-------+-------------+----------+----------+
+-------------------+-------+------------+----------+----------------------+
| timestamp|user_id| event|error_type|ticke_type_forwardfill|
+-------------------+-------+------------+----------+----------------------+
|2019-06-06 14:36:31| user_a|booing_error| error_c| ticke_b|
|2019-06-06 14:38:31| user_a|booing_error| error_d| ticke_h|
|2019-06-06 14:39:31| user_a|booing_error| error_e| ticke_h|
+-------------------+-------+------------+----------+----------------------+