火花结构化流传输不会聚合最后一批输入。聚合会不确定地延迟。使它们显示的唯一方法是发送后续的输入。
要对此进行测试,我在终端上运行netcat。
nc -lk 9999
1
2
3
然后我在spark-shell中运行以下scala代码段
流准备好后,我键入1,2,3,然后等待一会儿。没有聚集出现。过一会儿,当我输入4时,会出现前3批的汇总。输入4的聚合永远不会出现。
在Spark 2.4.4、2.4.0和2.3.0中观察到此行为。
import org.apache.spark.sql.streaming._
import spark.implicits._
val df = spark.readStream.
format("socket").
option ("host","localhost").
option ("port", 9999).
option("includeTimestamp",true).
load()
val wordCounts = df.
withWatermark("timestamp", "10 seconds").
groupBy(window($"timestamp", "10 seconds" , "10 seconds" ), col("value")).
agg(avg("value")).
select("window.start", "window.end", "*").
drop("window")
// df.writeStream.format("console").outputMode("append").start()
wordCounts.writeStream.
format("console").
outputMode("append").
start().
awaitTermination()
scala> :load /mqtt4/src/main/scala/SparkStreamingExample.scala
Loading /mqtt4/src/main/scala/SparkStreamingExample.scala...
import org.apache.spark.sql.streaming._
import spark.implicits._
2019-09-05 15:42:19 WARN TextSocketSourceProvider:66 - The socket source should not be used for production applications! It does not support recovery.
df: org.apache.spark.sql.DataFrame = [value: string, timestamp: timestamp]
wordCounts: org.apache.spark.sql.DataFrame = [start: timestamp, end: timestamp ... 2 more fields]
2019-09-05 15:42:20 WARN TextSocketSourceProvider:66 - The socket source should not be used for production applications! It does not support recovery.
-------------------------------------------
Batch: 0
-------------------------------------------
+-----+---+-----+----------+
|start|end|value|avg(value)|
+-----+---+-----+----------+
+-----+---+-----+----------+
-------------------------------------------
Batch: 1
-------------------------------------------
+-----+---+-----+----------+
|start|end|value|avg(value)|
+-----+---+-----+----------+
+-----+---+-----+----------+
-------------------------------------------
Batch: 2
-------------------------------------------
+-----+---+-----+----------+
|start|end|value|avg(value)|
+-----+---+-----+----------+
+-----+---+-----+----------+
-------------------------------------------
Batch: 3
-------------------------------------------
+-----+---+-----+----------+
|start|end|value|avg(value)|
+-----+---+-----+----------+
+-----+---+-----+----------+
-------------------------------------------
Batch: 4
-------------------------------------------
+-----+---+-----+----------+
|start|end|value|avg(value)|
+-----+---+-----+----------+
+-----+---+-----+----------+
-------------------------------------------
Batch: 5
-------------------------------------------
+-------------------+-------------------+-----+----------+
| start| end|value|avg(value)|
+-------------------+-------------------+-----+----------+
|2019-09-05 15:42:30|2019-09-05 15:42:40| 1| 1.0|
|2019-09-05 15:42:30|2019-09-05 15:42:40| 2| 2.0|
|2019-09-05 15:42:30|2019-09-05 15:42:40| 3| 3.0|
+-------------------+-------------------+-----+----------+