如何使用Spark Windowing在数据框中找到当前行中的下一个出现项?

时间:2019-06-26 18:46:26

标签: scala apache-spark windowing

我有以下数据框:

+------+----------+-------------+--------------------+---------+-----+----------+
|ID    |MEM_ID    | BFS         | SVC_DT             |TYP      |SEQ  |BFS_SEQ   |
+------+----------+----------------------------------+---------+-----+----------+
|105771|29378668  | BRIMONIDINE | 2019-02-04 00:00:00|PD       |1    |1         |
|105772|29378668  | BRIMONIDINE | 2019-04-04 00:00:00|PD       |2    |2         |
|105773|29378668  | BRIMONIDINE | 2019-04-17 00:00:00|RV       |3    |3         |
|105774|29378668  | TIMOLOL     | 2019-04-17 00:00:00|RV       |4    |1         |
|105775|29378668  | BRIMONIDINE | 2019-04-22 00:00:00|PD       |5    |4         |
|105776|29378668  | TIMOLOL     | 2019-04-22 00:00:00|PD       |6    |2         |
+------+----------+----------------------------------+---------+-----+----------+

对于每一行,我必须从当前行中查找BFS级别下一个“ PD”类型的发生,并将其关联的ID填充为名为“ NEXT_PD_TYP_ID”的新列

我期望的输出是:

+------+---------+-------------+--------------------+----+-----+---------+---------------+
|ID    |MEM_ID   | BFS         | SVC_DT             |TYP |SEQ  |BFS_SEQ  |NEXT_PD_TYP_ID |
+------+---------+----------------------------------+----+-----+---------+---------------+
|105771|29378668 | BRIMONIDINE | 2019-02-04 00:00:00|PD  |1    |1        |105772         |
|105772|29378668 | BRIMONIDINE | 2019-04-04 00:00:00|PD  |2    |2        |105775         | 
|105773|29378668 | BRIMONIDINE | 2019-04-17 00:00:00|RV  |3    |3        |105775         |
|105774|29378668 | TIMOLOL     | 2019-04-17 00:00:00|RV  |4    |1        |105776         |
|105775|29378668 | BRIMONIDINE | 2019-04-22 00:00:00|PD  |5    |4        |null           | 
|105776|29378668 | TIMOLOL     | 2019-04-22 00:00:00|PD  |6    |2        |null           |
+------+---------+----------------------------------+----+-----+---------+---------------+

需要帮助。

我尝试使用条件聚合:max(when),但是由于它具有多个“ PD”,因此max对于所有行仅返回一个值。

没有错误消息

1 个答案:

答案 0 :(得分:0)

我希望这会有所帮助。 我创建了一个新列,其ID为TYP === PD。我称之为TYPPDID。 然后我使用了从下一行到下一行无界的窗口框架,并得到了第一个非空的TYPPDID 最后的orderBy("ID")仅是按顺序显示记录。

import org.apache.spark.sql.functions._

val df = Seq(
("105771", "BRIMONIDINE", "PD"),
("105772", "BRIMONIDINE", "PD"),
("105773", "BRIMONIDINE","RV"),
("105774", "TIMOLOL", "RV"),
("105775", "BRIMONIDINE", "PD"),
("105776", "TIMOLOL", "PD")
).toDF("ID", "BFS", "TYP").withColumn("TYPPDID", when($"TYP" === "PD", $"ID"))
df: org.apache.spark.sql.DataFrame = [ID: string, BFS: string ... 2 more fields]

scala> df.show
+------+-----------+---+-------+
|    ID|        BFS|TYP|TYPPDID|
+------+-----------+---+-------+
|105771|BRIMONIDINE| PD| 105771|
|105772|BRIMONIDINE| PD| 105772|
|105773|BRIMONIDINE| RV|   null|
|105774|    TIMOLOL| RV|   null|
|105775|BRIMONIDINE| PD| 105775|
|105776|    TIMOLOL| PD| 105776|
+------+-----------+---+-------+


scala> val overColumns = Window.partitionBy("BFS").orderBy("ID").rowsBetween(1, Window.unboundedFollowing)
overColumns: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@eb923ef


scala> df.withColumn("NEXT_PD_TYP_ID",first("TYPPDID", true).over(overColumns)).orderBy("ID").show(false)
+------+-----------+---+-------+-------+
|ID    |BFS        |TYP|TYPPDID|NEXT_PD_TYP_ID|
+------+-----------+---+-------+-------+
|105771|BRIMONIDINE|PD |105771 |105772 |
|105772|BRIMONIDINE|PD |105772 |105775 |
|105773|BRIMONIDINE|RV |null   |105775 |
|105774|TIMOLOL    |RV |null   |105776 |
|105775|BRIMONIDINE|PD |105775 |null   |
|105776|TIMOLOL    |PD |105776 |null   |
+------+-----------+---+-------+-------+