不了解结构化流中的更新模式和水印

时间:2018-08-19 08:46:19

标签: apache-spark spark-structured-streaming

我有以下代码,它输出

number: 1, count: 1
number: 2, count: 1
number: 3, count: 2
number: 6, count: 2
number: 7, count: 1

我认为不应该输出number: 6, count: 2,因为这些事件低于水位线。但我不明白为什么会输出

import java.sql.Timestamp

import org.apache.spark.sql.execution.streaming.MemoryStream
import org.apache.spark.sql.{ForeachWriter, Row, SparkSession}

object UpdateModeWithWatermarkTest {
  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder()
      .appName("UpdateModeWithWatermarkTest")
      .config("spark.sql.shuffle.partitions", 1)
      .master("local[2]").getOrCreate()

    import spark.implicits._


    val inputStream = new MemoryStream[(Timestamp, Int)](1, spark.sqlContext)
    val now = 5000L

    val aggregatedStream = inputStream.toDS().toDF("created", "number")
      .withWatermark("created", "1 second")
      .groupBy("number")
      .count()

    val query = aggregatedStream.writeStream.outputMode("update")
      .foreach(new ForeachWriter[Row] {
        override def open(partitionId: Long, epochId: Long): Boolean = true

        override def process(value: Row): Unit = {
          println(s"number: ${value.getInt(0)}, count: ${value.getLong(1)}")
        }

        override def close(errorOrNull: Throwable): Unit = {}
      }).start()

    new Thread(new Runnable() {
      override def run(): Unit = {
        inputStream.addData(
          (new Timestamp(now + 5000), 1),
          (new Timestamp(now + 5000), 2),
          (new Timestamp(now + 5000), 3),
          (new Timestamp(now + 5000), 3)
        )
        while (!query.isActive) {
          Thread.sleep(50)
        }
        Thread.sleep(10000)

        // At this point, the water mark is (now  + 5000) - 1 second = 9 seconds
        // when adding following two events: (new Timestamp(4000L), 6),  (new Timestamp(now), 6)
        // These two events are below water mark, so that they should be discarded, then should not output number: 6, count: 2
        inputStream.addData((new Timestamp(4000L), 6))
        inputStream.addData(
          (new Timestamp(now), 6),
          (new Timestamp(11000), 7)
        )
      }
    }).start()

    query.awaitTermination(45000)


  }

}

2 个答案:

答案 0 :(得分:1)

实际上并不难。

水印允许使用窗口在一段时间内考虑将较迟到达的数据包含在已计算的结果中。它的前提是它跟踪到某个时间点,在该时间点之前,假定不再有任何较晚的事件会到达,但是如果到达,则将其丢弃。

https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#window-operations-on-event-time上的优秀示例,带有漂亮的图表来补充。

答案 1 :(得分:0)

我认为有关output mode of structured streaming的官方解释已经回答了您的问题。

  

更新模式-(自Spark 2.1.1起可用)仅结果表中自上次触发以来已更新的行将被输出到接收器。更多信息将在以后的版本中添加。

在您的问题中,这意味着1秒钟内到达的数据将更新归档的“计数”