此问题与Spark 2.4.4有关。
我正在执行流静态内部联接,并将结果显示为:-
val orderDetailsJoined = orderItemsDF.join(ordersDF, Seq("CustomerID"), joinType = "inner")
+----------+-------+------+---------+--------+--------+------------+-----------------------+---------------------+---------------+-----------------------+
|CustomerID|OrderID|ItemID|ProductID|Quantity|Subtotal|ProductPrice|OrderItemsTimestamp |OrderDate |Status |OrdersTimestamp |
+----------+-------+------+---------+--------+--------+------------+-----------------------+---------------------+---------------+-----------------------+
|2 |33865 |84536 |957 |1 |299.98 |299.98 |2019-11-30 18:29:17.893|2014-02-18 00:00:00.0|COMPLETE |2019-11-30 18:29:19.331|
|2 |33865 |84537 |1073 |1 |199.99 |199.99 |2019-11-30 18:29:17.893|2014-02-18 00:00:00.0|COMPLETE |2019-11-30 18:29:19.331|
|2 |33865 |84538 |502 |1 |50.0 |50.0 |2019-11-30 18:29:17.893|2014-02-18 00:00:00.0|COMPLETE |2019-11-30 18:29:19.331|
其中“ orderItemsDF”是流 DataFrame,而“ ordersDF”是静态 DF。
现在,我正在尝试按“ CustomerID”和“ OrderID”对结果进行分组,如下所示:
val aggResult = orderDetailsJoined.withWatermark("OrdersTimestamp", "2 minutes").
groupBy(window($"OrdersTimestamp", "1 minute"), $"CustomerID", $"OrderID").
agg(sum("Subtotal")).
select(col("CustomerID"), col("OrderID"), col("sum(Subtotal)").alias("Total Amount"))
但是当我尝试将结果显示为:
时,这给了我空白的输出val res = aggResult.writeStream.outputMode("append").format("console").trigger(Trigger.ProcessingTime("20 seconds")).option("truncate", "false").start()
res.awaitTermination()
-------------------------------------------
Batch: 1
-------------------------------------------
+----------+-------+------------+
|CustomerID|OrderID|Total Amount|
+----------+-------+------------+
+----------+-------+------------+
如果我这样做,
res.explain(true)
它说:No physical plan. Waiting for data.
请帮助!!!
答案 0 :(得分:0)
tl; dr OrdersTimestamp
值似乎没有增加,因此2分钟的水印和1分钟的groupBy
的水印无法发挥作用。
您使用OrdersTimestamp
通知Spark事件时间。如果在您发布的三个事件中它停留在2019-11-30 18:29:19.331
上并且时间没有提前,Spark只会等到“时间”组和迟到事件的水印在2019-11-30 18:29:19.331
+“ 1分钟”处+“ 2分钟”将分组结果传递到下游。