为什么最后一批没有在Spark结构化流中聚合?

时间:2019-09-05 11:03:13

标签: apache-spark spark-streaming

火花结构化流传输不会聚合最后一批输入。聚合会不确定地延迟。使它们显示的唯一方法是发送后续的输入。

要对此进行测试,我在终端上运行netcat。

nc -lk 9999
1
2
3

然后我在spark-shell中运行以下scala代码段

流准备好后,我键入1,2,3,然后等待一会儿。没有聚集出现。过一会儿,当我输入4时,会出现前3批的汇总。输入4的聚合永远不会出现。

在Spark 2.4.4、2.4.0和2.3.0中观察到此行为。

import org.apache.spark.sql.streaming._
import spark.implicits._

val df = spark.readStream.
   format("socket"). 
   option ("host","localhost").                                                              
   option ("port", 9999).                                        
   option("includeTimestamp",true).        
   load()  

val wordCounts = df.                                         
   withWatermark("timestamp", "10 seconds").
   groupBy(window($"timestamp",  "10 seconds" , "10 seconds" ), col("value")).
   agg(avg("value")).
   select("window.start", "window.end", "*").
   drop("window")

// df.writeStream.format("console").outputMode("append").start()                   
wordCounts.writeStream.                                                                   
  format("console"). 
  outputMode("append"). 
  start().                    
  awaitTermination() 
scala> :load /mqtt4/src/main/scala/SparkStreamingExample.scala 
Loading /mqtt4/src/main/scala/SparkStreamingExample.scala...
import org.apache.spark.sql.streaming._
import spark.implicits._
2019-09-05 15:42:19 WARN  TextSocketSourceProvider:66 - The socket source should not be used for production applications! It does not support recovery.
df: org.apache.spark.sql.DataFrame = [value: string, timestamp: timestamp]
wordCounts: org.apache.spark.sql.DataFrame = [start: timestamp, end: timestamp ... 2 more fields]
2019-09-05 15:42:20 WARN  TextSocketSourceProvider:66 - The socket source should not be used for production applications! It does not support recovery.
-------------------------------------------                                     
Batch: 0
-------------------------------------------
+-----+---+-----+----------+
|start|end|value|avg(value)|
+-----+---+-----+----------+
+-----+---+-----+----------+

-------------------------------------------                                     
Batch: 1
-------------------------------------------
+-----+---+-----+----------+
|start|end|value|avg(value)|
+-----+---+-----+----------+
+-----+---+-----+----------+

-------------------------------------------                                     
Batch: 2
-------------------------------------------
+-----+---+-----+----------+
|start|end|value|avg(value)|
+-----+---+-----+----------+
+-----+---+-----+----------+

-------------------------------------------                                     
Batch: 3
-------------------------------------------
+-----+---+-----+----------+
|start|end|value|avg(value)|
+-----+---+-----+----------+
+-----+---+-----+----------+

-------------------------------------------                                     
Batch: 4
-------------------------------------------
+-----+---+-----+----------+
|start|end|value|avg(value)|
+-----+---+-----+----------+
+-----+---+-----+----------+

-------------------------------------------                                     
Batch: 5
-------------------------------------------
+-------------------+-------------------+-----+----------+
|              start|                end|value|avg(value)|
+-------------------+-------------------+-----+----------+
|2019-09-05 15:42:30|2019-09-05 15:42:40|    1|       1.0|
|2019-09-05 15:42:30|2019-09-05 15:42:40|    2|       2.0|
|2019-09-05 15:42:30|2019-09-05 15:42:40|    3|       3.0|
+-------------------+-------------------+-----+----------+

0 个答案:

没有答案