如何在PySpark中使用偏移创建InputDStream(使用KafkaUtils.createDirectStream)?

时间:2015-10-21 20:32:58

标签: apache-spark apache-kafka pyspark

如何将Topic与Pyspark中特定{{1}}的偏移量一起使用?

2 个答案:

答案 0 :(得分:8)

如果要根据Kafka主题中的记录创建RDD,请使用一组静态元组。

提供所有导入

from pyspark.streaming.kafka import KafkaUtils, OffsetRange

然后你创建一个Kafka Brokers字典

kafkaParams = {"metadata.broker.list": "host1:9092,host2:9092,host3:9092"}

然后创建偏移对象

start = 0
until = 10
partition = 0
topic = 'topic'    
offset = OffsetRange(topic,partition,start,until)
offsets = [offset]

最后你创建了RDD:

kafkaRDD = KafkaUtils.createRDD(sc, kafkaParams,offsets)

要使用偏移创建Stream,您需要执行以下操作:

from pyspark.streaming.kafka import KafkaUtils, TopicAndPartition
from pyspark.streaming import StreamingContext

然后使用sparkcontext

创建sparkstreaming上下文
ssc = StreamingContext(sc, 1)

接下来我们设置所有参数

 kafkaParams = {"metadata.broker.list": "host1:9092,host2:9092,host3:9092"}
 start = 0
 partition = 0
 topic = 'topic'    

然后我们创建fromOffset Dictionary

topicPartion = TopicAndPartition(topic,partition)
fromOffset = {topicPartion: long(start)}
//notice that we must cast the int to long 

最后我们创建了流

directKafkaStream = KafkaUtils.createDirectStream(ssc, [topic],kafkaParams, 
fromOffsets=fromOffset)

答案 1 :(得分:3)

你可以这样做:

from pyspark.streaming.kafka import TopicAndPartition
topic = "test"
brokers = "localhost:9092"
partition = 0
start = 0
topicpartion = TopicAndPartition(topic, partition)
fromoffset = {topicpartion: int(start)}
kafkaDStream = KafkaUtils.createDirectStream(spark_streaming,[topic], \
        {"metadata.broker.list": brokers}, fromOffsets = fromoffset)

注意:Spark 2.2.0,python 3.6