Spark构造的流式窗口问题

时间:2019-05-13 10:42:47

标签: apache-spark pyspark apache-kafka spark-streaming

我无法在Spark Structed Streaming中重新分级窗口。我想对我从滑动窗口中的kafka源连续接收的数据进行分组,并计算数据数量。问题在于,每次有数据传入时,writestream都会流传输窗口数据帧,并更新当前窗口的计数。

我正在使用以下代码创建窗口:

#Define schema of the topic to be consumed 
jsonSchema = StructType([ StructField("State", StringType(), True) \
                        , StructField("Value", StringType(), True) \
                        , StructField("SourceTimestamp", StringType(), True) \
                        , StructField("Tag", StringType(), True)
                        ])


spark = SparkSession \
    .builder \
    .appName("StructuredStreaming") \
    .config("spark.default.parallelism", "100") \
    .getOrCreate()

df = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "10.129.140.23:9092") \
  .option("subscribe", "SIMULATOR.SUPERMAN.TOTO") \
  .load() \
  .select(from_json(col("value").cast("string"), jsonSchema).alias("data")) \
  .select("data.*")

df = df.withColumn("time", current_timestamp())

Window = df \
    .withColumn("window",window("time","4 seconds","1 seconds")).groupBy("window").count() \
    .withColumn("time", current_timestamp())

#Write back to kafka

query = Window.select(to_json(struct("count","window","time")).alias("value")) \
    .writeStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "10.129.140.23:9092") \
    .outputMode("update") \
    .option("topic", "structed") \
    .option("checkpointLocation", "/home/superman/notebook/checkpoint") \
    .start()

窗口未排序,并且每次计数发生变化时都会更新。我们如何等待窗口结束并一次流式传输最终计数。代替此输出:

{"count":21,"window":{"start":"2019-05-13T09:39:14.000Z","end":"2019-05-13T09:39:18.000Z"},"time":"2019-05-13T09:39:15.026Z"}
{"count":47,"window":{"start":"2019-05-13T09:39:12.000Z","end":"2019-05-13T09:39:16.000Z"},"time":"2019-05-13T09:39:15.026Z"}
{"count":21,"window":{"start":"2019-05-13T09:39:13.000Z","end":"2019-05-13T09:39:17.000Z"},"time":"2019-05-13T09:39:15.026Z"}
{"count":21,"window":{"start":"2019-05-13T09:39:15.000Z","end":"2019-05-13T09:39:19.000Z"},"time":"2019-05-13T09:39:15.026Z"}
{"count":21,"window":{"start":"2019-05-13T09:39:16.000Z","end":"2019-05-13T09:39:20.000Z"},"time":"2019-05-13T09:39:17.460Z"}
{"count":42,"window":{"start":"2019-05-13T09:39:14.000Z","end":"2019-05-13T09:39:18.000Z"},"time":"2019-05-13T09:39:17.460Z"}
{"count":42,"window":{"start":"2019-05-13T09:39:15.000Z","end":"2019-05-13T09:39:19.000Z"},"time":"2019-05-13T09:39:17.460Z"}
{"count":21,"window":{"start":"2019-05-13T09:39:17.000Z","end":"2019-05-13T09:39:21.000Z"},"time":"2019-05-13T09:39:17.460Z"}
{"count":40,"window":{"start":"2019-05-13T09:39:16.000Z","end":"2019-05-13T09:39:20.000Z"},"time":"2019-05-13T09:39:19.818Z"}
{"count":19,"window":{"start":"2019-05-13T09:39:19.000Z","end":"2019-05-13T09:39:23.000Z"},"time":"2019-05-13T09:39:19.818Z"}
{"count":19,"window":{"start":"2019-05-13T09:39:18.000Z","end":"2019-05-13T09:39:22.000Z"},"time":"2019-05-13T09:39:19.818Z"}
{"count":40,"window":{"start":"2019-05-13T09:39:17.000Z","end":"2019-05-13T09:39:21.000Z"},"time":"2019-05-13T09:39:19.818Z"}
{"count":37,"window":{"start":"2019-05-13T09:39:19.000Z","end":"2019-05-13T09:39:23.000Z"},"time":"2019-05-13T09:39:21.939Z"}
{"count":18,"window":{"start":"2019-05-13T09:39:21.000Z","end":"2019-05-13T09:39:25.000Z"},"time":"2019-05-13T09:39:21.939Z"}

我想要这样:

{"count":47,"window":{"start":"2019-05-13T09:39:12.000Z","end":"2019-05-13T09:39:16.000Z"},"time":"2019-05-13T09:39:15.026Z"}
{"count":21,"window":{"start":"2019-05-13T09:39:13.000Z","end":"2019-05-13T09:39:17.000Z"},"time":"2019-05-13T09:39:15.026Z"}
{"count":42,"window":{"start":"2019-05-13T09:39:14.000Z","end":"2019-05-13T09:39:18.000Z"},"time":"2019-05-13T09:39:17.460Z"}
{"count":42,"window":{"start":"2019-05-13T09:39:15.000Z","end":"2019-05-13T09:39:19.000Z"},"time":"2019-05-13T09:39:17.460Z"}
{"count":40,"window":{"start":"2019-05-13T09:39:16.000Z","end":"2019-05-13T09:39:20.000Z"},"time":"2019-05-13T09:39:19.818Z"}
{"count":40,"window":{"start":"2019-05-13T09:39:17.000Z","end":"2019-05-13T09:39:21.000Z"},"time":"2019-05-13T09:39:19.818Z"}

根据结束时间戳记与当前时间之间的比较,预期输出等待窗口关闭。

0 个答案:

没有答案