使用结构化流对流静态内部联接的输出进行聚合

时间:2019-12-01 11:04:27

标签: apache-spark apache-spark-sql spark-structured-streaming

此问题与Spark 2.4.4有关。

我正在执行流静态内部联接,并将结果显示为:-

val orderDetailsJoined = orderItemsDF.join(ordersDF, Seq("CustomerID"), joinType = "inner")

+----------+-------+------+---------+--------+--------+------------+-----------------------+---------------------+---------------+-----------------------+
|CustomerID|OrderID|ItemID|ProductID|Quantity|Subtotal|ProductPrice|OrderItemsTimestamp    |OrderDate            |Status         |OrdersTimestamp        |
+----------+-------+------+---------+--------+--------+------------+-----------------------+---------------------+---------------+-----------------------+
|2         |33865  |84536 |957      |1       |299.98  |299.98      |2019-11-30 18:29:17.893|2014-02-18 00:00:00.0|COMPLETE       |2019-11-30 18:29:19.331|
|2         |33865  |84537 |1073     |1       |199.99  |199.99      |2019-11-30 18:29:17.893|2014-02-18 00:00:00.0|COMPLETE       |2019-11-30 18:29:19.331|
|2         |33865  |84538 |502      |1       |50.0    |50.0        |2019-11-30 18:29:17.893|2014-02-18 00:00:00.0|COMPLETE       |2019-11-30 18:29:19.331|

其中“ orderItemsDF”是 DataFrame,而“ ordersDF”是静态 DF。

现在,我正在尝试按“ CustomerID”和“ OrderID”对结果进行分组,如下所示:

val aggResult = orderDetailsJoined.withWatermark("OrdersTimestamp", "2 minutes").
      groupBy(window($"OrdersTimestamp", "1 minute"), $"CustomerID", $"OrderID").
      agg(sum("Subtotal")).
      select(col("CustomerID"), col("OrderID"), col("sum(Subtotal)").alias("Total Amount"))

但是当我尝试将结果显示为:

时,这给了我空白的输出
val res = aggResult.writeStream.outputMode("append").format("console").trigger(Trigger.ProcessingTime("20 seconds")).option("truncate", "false").start()
res.awaitTermination()

-------------------------------------------
Batch: 1
-------------------------------------------
+----------+-------+------------+
|CustomerID|OrderID|Total Amount|
+----------+-------+------------+
+----------+-------+------------+

如果我这样做,

res.explain(true)

它说:No physical plan. Waiting for data.

请帮助!!!

1 个答案:

答案 0 :(得分:0)

tl; dr OrdersTimestamp值似乎没有增加,因此2分钟的水印和1分钟的groupBy的水印无法发挥作用。


您使用OrdersTimestamp通知Spark事件时间。如果在您发布的三个事件中它停留在2019-11-30 18:29:19.331上并且时间没有提前,Spark只会等到“时间”组和迟到事件的水印在2019-11-30 18:29:19.331 +“ 1分钟”处+“ 2分钟”将分组结果传递到下游。