从卡夫消费者使用的ConsumerRecord对象创建RDD

时间:2018-09-14 09:18:54

标签: python-3.x apache-spark pyspark apache-kafka spark-streaming

我想在我的流应用程序中从python-kafka使用者创建rdd。

我的代码是:

from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
from kafka import KafkaConsumer


conf = (SparkConf()
       .setAppName("test"))

spark = SparkSession.builder \
     .appName(" ") \
     .config(conf=conf) \
     .getOrCreate()

sc = spark.sparkContext
ssc = StreamingContext(sc, 15)

topic = 'mytopic'

consumer = KafkaConsumer('mytopic', group_id='mytopic-groupid', bootstrap_servers=['localhost:9092'], auto_offset_reset='earliest')

rdd = sc.parallelize([i for i in consumer] )
print(rdd.collect())

# Start the computation
ssc.start()

# Wait for the computation to terminate
ssc.awaitTermination()

但是当我尝试收集并打印出来时,什么也没显示。

当我仅输出为:

consumer = KafkaConsumer('mytopic', group_id='mytopic-groupid', bootstrap_servers=['localhost:9092'], auto_offset_reset='earliest')
for i in consumer:
    print(i)

它输出消费者内容:

ConsumerRecord(topic='mytopic', partition=1, offset=53797, timestamp=1536916141939, timestamp_type=0, key=None, value=b'[22/Feb/2018 11:57:39 -0800] INFO     134.25.20.69 root - "POST /notebook/api/check_status HTTP/1.1"', checksum=None, serialized_key_size=-1, serialized_value_size=104)
ConsumerRecord(topic='mytopic', partition=1, offset=53798, timestamp=1536916141942, timestamp_type=0, key=None, value=b'[22/Feb/2018 11:57:39 -0800] INFO     134.25.20.69 user12 - "POST /notebook/api/check_status HTTP/1.1"', checksum=None, serialized_key_size=-1, serialized_value_size=104)
ConsumerRecord(topic='mytopic', partition=1, offset=53799, timestamp=1536916141943, timestamp_type=0, key=None, value=b'[22/Feb/2018 11:57:40 -0800] INFO     134.25.20.69 jhon - "POST /notebook/api/check_status HTTP/1.1"', checksum=None, serialized_key_size=-1, serialized_value_size=104)

我知道我可以使用kafkautils创建直接流,但是我想了解为什么这不起作用。 请提出为什么不能在开放流中创建rdd的问题,我在做什么错了?

0 个答案:

没有答案