如何在Spark流数据帧中获取列的延迟?

时间:2017-08-08 21:17:37

标签: scala apache-spark apache-spark-sql window-functions spark-structured-streaming

我以这种格式将数据流传输到我的Spark Scala应用程序

id    mark1 mark2 mark3 time
uuid1 100   200   300   Tue Aug  8 14:06:02 PDT 2017
uuid1 100   200   300   Tue Aug  8 14:06:22 PDT 2017
uuid2 150   250   350   Tue Aug  8 14:06:32 PDT 2017
uuid2 150   250   350   Tue Aug  8 14:06:52 PDT 2017
uuid2 150   250   350   Tue Aug  8 14:06:58 PDT 2017

我把它读入了id,mark1,mark2,mark3和time列。时间也转换为日期时间格式。 我希望按ID分组,得到mark1的滞后,它给出了前一行的mark1值。 像这样:

id    mark1 mark2 mark3 prev_mark time
uuid1 100   200   300   null      Tue Aug  8 14:06:02 PDT 2017
uuid1 100   200   300   100       Tue Aug  8 14:06:22 PDT 2017
uuid2 150   250   350   null      Tue Aug  8 14:06:32 PDT 2017
uuid2 150   250   350   150       Tue Aug  8 14:06:52 PDT 2017
uuid2 150   250   350   150       Tue Aug  8 14:06:58 PDT 2017

将数据帧视为markDF。我试过了:

val window = Window.partitionBy("uuid").orderBy("timestamp")
val newerDF = newDF.withColumn("prev_mark", lag("mark1", 1, null).over(window))`

表示非时间窗口无法应用于流/附加数据集/帧。

我也尝试过:

val window = Window.partitionBy("uuid").orderBy("timestamp").rowsBetween(-10, 10)
val newerDF = newDF.withColumn("prev_mark", lag("mark1", 1, null).over(window))

获取几行无效的窗口。流媒体窗口类似于: window("timestamp", "10 minutes") 不能用来发送滞后。我对如何做到这一点非常困惑。任何帮助都会很棒!

1 个答案:

答案 0 :(得分:-1)

我建议您将time列更改为String

+-----+-----+-----+-----+----------------------------+
|id   |mark1|mark2|mark3|time                        |
+-----+-----+-----+-----+----------------------------+
|uuid1|100  |200  |300  |Tue Aug  8 14:06:02 PDT 2017|
|uuid1|100  |200  |300  |Tue Aug  8 14:06:22 PDT 2017|
|uuid2|150  |250  |350  |Tue Aug  8 14:06:32 PDT 2017|
|uuid2|150  |250  |350  |Tue Aug  8 14:06:52 PDT 2017|
|uuid2|150  |250  |350  |Tue Aug  8 14:06:58 PDT 2017|
+-----+-----+-----+-----+----------------------------+

root
 |-- id: string (nullable = true)
 |-- mark1: integer (nullable = false)
 |-- mark2: integer (nullable = false)
 |-- mark3: integer (nullable = false)
 |-- time: string (nullable = true)

之后,执行以下操作

df.withColumn("prev_mark", lag("mark1", 1).over(Window.partitionBy("id").orderBy("time")))

这将为您输出

+-----+-----+-----+-----+----------------------------+---------+
|id   |mark1|mark2|mark3|time                        |prev_mark|
+-----+-----+-----+-----+----------------------------+---------+
|uuid1|100  |200  |300  |Tue Aug  8 14:06:02 PDT 2017|null     |
|uuid1|100  |200  |300  |Tue Aug  8 14:06:22 PDT 2017|100      |
|uuid2|150  |250  |350  |Tue Aug  8 14:06:32 PDT 2017|null     |
|uuid2|150  |250  |350  |Tue Aug  8 14:06:52 PDT 2017|150      |
|uuid2|150  |250  |350  |Tue Aug  8 14:06:58 PDT 2017|150      |
+-----+-----+-----+-----+----------------------------+---------+