无法使用CreateStream读取旧的Kafka流

时间:2018-01-19 16:43:18

标签: python-2.7 apache-spark pyspark apache-kafka

我能够阅读被推送到KafkaStream的新邮件,但我无法阅读旧邮件。
当我开始使用New Topic时,如何阅读推送到流中的所有旧消息?

kafkaStream = KafkaUtils.createStream(ssc,'zookeeper1.sys.net:2181,zookeeper2.sys.net:2181,zookeeper3.sys.net:2181,zookeeper4.sys.net:2181,zookeeper5.sys.net:2181,zookeeper6.sys.net:2181','spark-streaming24',{'TOPIC':3},keyDecoder=lambda x: x,valueDecoder=lambda x: x)

1 个答案:

答案 0 :(得分:0)

据我所知,接收方不可能,但createDirectStream有可选的fromOffsets参数,它带有字典TopicAndPartition - > offset

from pyspark.streaming.kafka import TopicAndPartition

fromOffsets = {
  TopicAndPartition("topic", i): long(0) for i in range(n_partitions)
}

结构化流式传输具有等效的startingOffsets选项

df = spark \
  .read \
  .format("kafka") \
  .option("startingOffsets", "earliest") \
  ...
  .start()

或使用JSON

df = spark \
  .read \
  .format("kafka") \
  .option("startingOffsets", """{"topic":{"0":0,"1":0}}""")
  ...
  .start()