如何知道流批处理中的哪些事件迟到了?

时间:2017-09-03 11:15:29

标签: apache-spark spark-structured-streaming

我使用Apache Spark 2.2.0

我想知道结构化流媒体中的流批处理中有多少事件发生。有没有办法知道数字或(更好)什么事件到底是什么时候?

我使用以下示例来探索水印和晚期事件。

val valuesPerDevice = spark.
  readStream.
  format("kafka").
  option("subscribe", "topic1").
  option("kafka.bootstrap.servers", "localhost:9092").
  load.
  withColumn("tokens", split('value, ",")).
  withColumn("seconds", 'tokens(0) cast "long").
  withColumn("event_time", to_timestamp(from_unixtime('seconds))). // <-- Event time has to be a timestamp
  withColumn("device", 'tokens(1)).
  withColumn("level", 'tokens(2) cast "int").
  withWatermark(eventTime = "event_time", delayThreshold = "10 seconds"). // <-- define watermark (before groupBy!)
  groupBy($"event_time"). // <-- use event_time for grouping
  agg(collect_list("level") as "levels", collect_list("device") as "devices").
  withColumn("event_time", to_timestamp($"event_time")) // <-- convert to human-readable date

import scala.concurrent.duration._
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
val sq = valuesPerDevice.
  writeStream.
  format("console").
  option("truncate", false).
  trigger(Trigger.ProcessingTime(5.seconds)).
  outputMode(OutputMode.Append). // <-- Append output mode
  start

0 个答案:

没有答案