我想在我的流应用程序中从python-kafka使用者创建rdd。
我的代码是:
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
from kafka import KafkaConsumer
conf = (SparkConf()
.setAppName("test"))
spark = SparkSession.builder \
.appName(" ") \
.config(conf=conf) \
.getOrCreate()
sc = spark.sparkContext
ssc = StreamingContext(sc, 15)
topic = 'mytopic'
consumer = KafkaConsumer('mytopic', group_id='mytopic-groupid', bootstrap_servers=['localhost:9092'], auto_offset_reset='earliest')
rdd = sc.parallelize([i for i in consumer] )
print(rdd.collect())
# Start the computation
ssc.start()
# Wait for the computation to terminate
ssc.awaitTermination()
但是当我尝试收集并打印出来时,什么也没显示。
当我仅输出为:
consumer = KafkaConsumer('mytopic', group_id='mytopic-groupid', bootstrap_servers=['localhost:9092'], auto_offset_reset='earliest')
for i in consumer:
print(i)
它输出消费者内容:
ConsumerRecord(topic='mytopic', partition=1, offset=53797, timestamp=1536916141939, timestamp_type=0, key=None, value=b'[22/Feb/2018 11:57:39 -0800] INFO 134.25.20.69 root - "POST /notebook/api/check_status HTTP/1.1"', checksum=None, serialized_key_size=-1, serialized_value_size=104)
ConsumerRecord(topic='mytopic', partition=1, offset=53798, timestamp=1536916141942, timestamp_type=0, key=None, value=b'[22/Feb/2018 11:57:39 -0800] INFO 134.25.20.69 user12 - "POST /notebook/api/check_status HTTP/1.1"', checksum=None, serialized_key_size=-1, serialized_value_size=104)
ConsumerRecord(topic='mytopic', partition=1, offset=53799, timestamp=1536916141943, timestamp_type=0, key=None, value=b'[22/Feb/2018 11:57:40 -0800] INFO 134.25.20.69 jhon - "POST /notebook/api/check_status HTTP/1.1"', checksum=None, serialized_key_size=-1, serialized_value_size=104)
我知道我可以使用kafkautils创建直接流,但是我想了解为什么这不起作用。 请提出为什么不能在开放流中创建rdd的问题,我在做什么错了?