由于偏移不匹配,无法从Spark结构化流中读取Kafka主题

时间:2019-10-21 21:16:53

标签: apache-spark apache-kafka spark-structured-streaming

我正在尝试使用以下代码片段从Kafka主题中读取记录:

ds_raw = (spark
                .readStream
                .format("kafka")
                .option("subscribe", GLOBAL_ARGS.kafka_topic)
                .option("kafka.bootstrap.servers", GLOBAL_ARGS.kafka_brokers)
                .option("failOnDataLoss", False)
                .option("startingOffsets", "latest") 
                .load())

并将输出写入Kafka接收器主题。

    sink_topic = "{0}_sink".format(GLOBAL_ARGS.kafka_topic)

    query = (df_output
            .selectExpr("CAST(id AS STRING) AS key", "to_json(struct(*)) AS value")
            .writeStream
            .format("kafka")
            .option("kafka.bootstrap.servers", GLOBAL_ARGS.kafka_brokers)
            .option("topic", sink_topic)
            .outputMode("update") 
            .trigger(processingTime=GLOBAL_ARGS.window_size)
            .option("checkpointLocation", "/checkpoint_tmp")
            .start())

我遇到以下错误:

org.apache.spark.sql.streaming.StreamingQueryException: Writing job aborted.
=== Streaming Query ===
Identifier: [id = 811b7dff-aa61-4d3d-9784-394707180fb2, runId = 2fa6a3e4-1dc0-403a-bada-382237b61bd1]
Current Committed Offsets: {KafkaSource[Subscribe[topic_1]]: {"topic_1":{"2":0,"5":0,"4":0,"1":10552,"3":0,"0":0}}}
Current Available Offsets: {KafkaSource[Subscribe[topic_1]]: {"topic_1":{"2":0,"5":0,"4":0,"1":10565,"3":0,"0":0}}}

似乎提交的偏移量和可用的偏移量具有不同的值。

0 个答案:

没有答案