我能够阅读被推送到KafkaStream
的新邮件,但我无法阅读旧邮件。
当我开始使用New Topic时,如何阅读推送到流中的所有旧消息?
kafkaStream = KafkaUtils.createStream(ssc,'zookeeper1.sys.net:2181,zookeeper2.sys.net:2181,zookeeper3.sys.net:2181,zookeeper4.sys.net:2181,zookeeper5.sys.net:2181,zookeeper6.sys.net:2181','spark-streaming24',{'TOPIC':3},keyDecoder=lambda x: x,valueDecoder=lambda x: x)
答案 0 :(得分:0)
据我所知,接收方不可能,但createDirectStream
有可选的fromOffsets
参数,它带有字典TopicAndPartition
- > offset
:
from pyspark.streaming.kafka import TopicAndPartition
fromOffsets = {
TopicAndPartition("topic", i): long(0) for i in range(n_partitions)
}
结构化流式传输具有等效的startingOffsets
选项
df = spark \
.read \
.format("kafka") \
.option("startingOffsets", "earliest") \
...
.start()
或使用JSON
df = spark \
.read \
.format("kafka") \
.option("startingOffsets", """{"topic":{"0":0,"1":0}}""")
...
.start()