Question

我正在使用pyspark从Kafka读取流数据，并根据事件时间对每个固定窗口的消息发生次数进行计数。在每个窗口的末尾，我想将特定窗口的所有消息出现都写到输出中。 Spark有可能吗？

流数据示例：

{"timestamp": "2019-01-01 00:00:00", "message": "hello"}
{"timestamp": "2019-01-01 00:00:01", "message": "world"}
{"timestamp": "2019-01-01 00:00:02", "message": "hello"}
{"timestamp": "2019-01-01 00:00:03", "message": "world"}
{"timestamp": "2019-01-01 00:00:04", "message": "hello"}
{"timestamp": "2019-01-01 00:00:05", "message": "world"}

示例输出的固定窗口为5秒：

+-------+-----+
|message|count|
+-------+-----+
|hello  |3    |
+-------+-----+
|world  |3    |
+-------+-----+

示例代码：

spark = (
    SparkSession
        .builder
        .appName('Message Counter')
        .getOrCreate()
)

stream = (
    spark
        .readStream
        .format('kafka')
        .option('kafka.bootstrap.servers', 'localhost:9092')
        .option('subscribe', 'test')
        .load()
        .selectExpr('CAST(key AS STRING)', 'CAST(value AS STRING)')
)

schema = (
    StructType()
        .add('timestamp', TimestampType())
        .add('message', StringType())
)

counts = (
    stream
        .select(from_json(col('value').cast('string'), schema).alias('test'))
        .select('test.*')
        .groupby(window('timestamp', '5 seconds'), 'message')
        .count()
        .select('message', 'count')
)

# TODO: Add output here.

我遇到的问题是在每个窗口的末尾触发输出（而不是在每个消息上），并仅发送当前窗口的数据。因此，我还需要在写入输出后重置数据（计数）。（或在每条输入消息上递增生成计数矩阵，例如大熊猫crosstab的输出，每行是一个窗口，每列是一条消息，等等。

谢谢

窗口末端的PySpark结构化流触发输出

0 个答案: