我以这种格式将数据流传输到我的Spark Scala应用程序
id mark1 mark2 mark3 time
uuid1 100 200 300 Tue Aug 8 14:06:02 PDT 2017
uuid1 100 200 300 Tue Aug 8 14:06:22 PDT 2017
uuid2 150 250 350 Tue Aug 8 14:06:32 PDT 2017
uuid2 150 250 350 Tue Aug 8 14:06:52 PDT 2017
uuid2 150 250 350 Tue Aug 8 14:06:58 PDT 2017
我把它读入了id,mark1,mark2,mark3和time列。时间也转换为日期时间格式。 我希望按ID分组,得到mark1的滞后,它给出了前一行的mark1值。 像这样:
id mark1 mark2 mark3 prev_mark time
uuid1 100 200 300 null Tue Aug 8 14:06:02 PDT 2017
uuid1 100 200 300 100 Tue Aug 8 14:06:22 PDT 2017
uuid2 150 250 350 null Tue Aug 8 14:06:32 PDT 2017
uuid2 150 250 350 150 Tue Aug 8 14:06:52 PDT 2017
uuid2 150 250 350 150 Tue Aug 8 14:06:58 PDT 2017
将数据帧视为markDF。我试过了:
val window = Window.partitionBy("uuid").orderBy("timestamp")
val newerDF = newDF.withColumn("prev_mark", lag("mark1", 1, null).over(window))`
表示非时间窗口无法应用于流/附加数据集/帧。
我也尝试过:
val window = Window.partitionBy("uuid").orderBy("timestamp").rowsBetween(-10, 10)
val newerDF = newDF.withColumn("prev_mark", lag("mark1", 1, null).over(window))
获取几行无效的窗口。流媒体窗口类似于:
window("timestamp", "10 minutes")
不能用来发送滞后。我对如何做到这一点非常困惑。任何帮助都会很棒!
答案 0 :(得分:-1)
我建议您将time
列更改为String
+-----+-----+-----+-----+----------------------------+
|id |mark1|mark2|mark3|time |
+-----+-----+-----+-----+----------------------------+
|uuid1|100 |200 |300 |Tue Aug 8 14:06:02 PDT 2017|
|uuid1|100 |200 |300 |Tue Aug 8 14:06:22 PDT 2017|
|uuid2|150 |250 |350 |Tue Aug 8 14:06:32 PDT 2017|
|uuid2|150 |250 |350 |Tue Aug 8 14:06:52 PDT 2017|
|uuid2|150 |250 |350 |Tue Aug 8 14:06:58 PDT 2017|
+-----+-----+-----+-----+----------------------------+
root
|-- id: string (nullable = true)
|-- mark1: integer (nullable = false)
|-- mark2: integer (nullable = false)
|-- mark3: integer (nullable = false)
|-- time: string (nullable = true)
之后,执行以下操作
df.withColumn("prev_mark", lag("mark1", 1).over(Window.partitionBy("id").orderBy("time")))
这将为您输出
+-----+-----+-----+-----+----------------------------+---------+
|id |mark1|mark2|mark3|time |prev_mark|
+-----+-----+-----+-----+----------------------------+---------+
|uuid1|100 |200 |300 |Tue Aug 8 14:06:02 PDT 2017|null |
|uuid1|100 |200 |300 |Tue Aug 8 14:06:22 PDT 2017|100 |
|uuid2|150 |250 |350 |Tue Aug 8 14:06:32 PDT 2017|null |
|uuid2|150 |250 |350 |Tue Aug 8 14:06:52 PDT 2017|150 |
|uuid2|150 |250 |350 |Tue Aug 8 14:06:58 PDT 2017|150 |
+-----+-----+-----+-----+----------------------------+---------+