如何使用pyspark将火花转换的数据写回kafka经纪人?

时间:2016-05-19 22:04:08

标签: python-2.7 pyspark spark-streaming kafka-producer-api kafka-python

在我的pyspark应用程序中,我打算使用Spark流媒体作为转换Kafka消息的方法"在飞行中"。每个此类消息最初都是从特定的Kafka主题接收的。这样的消息需要进行一些转换(让我们说 - 用一个字符串代替另一个字符串),转换后的版本需要在不同的Kafka主题上发布。 第一部分(接收Kafka消息)似乎工作正常:

from pyspark import SparkConf, SparkContext

from operator import add
import sys
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
## Constants
APP_NAME = "PythonStreamingDirectKafkaWordCount"
##OTHER FUNCTIONS/CLASSES

def main():
    sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
    ssc = StreamingContext(sc, 2)

    brokers, topic = sys.argv[1:]
    kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})
    ...

    ssc.start()
    ssc.awaitTermination()
if __name__ == "__main__":

   main()

将一些东西(让我们说 - 一个字符串)放到另一个Kafka主题上的正确语法是什么? 这种方法应该由KafkaUtils提供,还是以其他方式提供?

2 个答案:

答案 0 :(得分:0)

在处理程序函数中,我们可以对每条记录执行任何操作,然后将该记录发送到另一个kafka主题:

from pyspark import SparkConf, SparkContext
from operator import add
import sys
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json
from kafka import SimpleProducer, KafkaClient
from kafka import KafkaProducer

producer = KafkaProducer(bootstrap_servers='localhost:9092')

def handler(message):
    records = message.collect()
    for record in records:
        producer.send('spark.out', str(record))
        producer.flush()

def main():
    sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
    ssc = StreamingContext(sc, 10)

    brokers, topic = sys.argv[1:]
    kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})
    kvs.foreachRDD(handler)

    ssc.start()
    ssc.awaitTermination()
if __name__ == "__main__":

   main()

运行此:

spark-submit --jars spark-streaming-kafka-assembly_2.10-1.6.1.jar s.py localhost:9092 test

答案 1 :(得分:0)

根据SPARK文档的正确方法 https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html#design-patterns-for-using-foreachrdd

def kafka_sender(messages):
    producer = KafkaProducer(bootstrap_servers='localhost:9092')

    for message in messages:
        producer.send('alerts', bytes(message[0].encode('utf-8')))
        # For faster push
        # producer.flush()  

    producer.flush()



# On your Dstream
sentiment_data.foreachRDD(lambda rdd: rdd.foreachPartition(kafka_sender))