我正在尝试构建一个Spark Streaming App,它使用来自Kafka主题的消息,其中包含使用Avro格式化的消息,但我遇到了Confluent消息解串器的一些问题。
按照Spark Python Avro Kafka Deserialiser的说明,我让Kafka消费者正确反序列化消息,但最终无法运行PythonStreamingDirectKafkaWordCount示例。
代码:
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from confluent_kafka.avro.cached_schema_registry_client import CachedSchemaRegistryClient
from confluent_kafka.avro.serializer.message_serializer import MessageSerializer
schema_registry_client = CachedSchemaRegistryClient(url='http://127.0.0.1:8081')
serializer = MessageSerializer(schema_registry_client)
if __name__ == "__main__":
sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
ssc = StreamingContext(sc, 2)
sc.setLogLevel("WARN")
kvs = KafkaUtils.createDirectStream(ssc, ["avrotest2"], {"metadata.broker.list": "localhost:9092"}, valueDecoder=serializer.decode_message)
lines = kvs.map(lambda x: x[1])
lines.pprint()
counts = lines.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
counts.pprint()
ssc.start()
ssc.awaitTermination()
Spark提交CLI
/opt/spark/bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.0 --jars spark-streaming-kafka-0-8_2.11-2.0.0.jar smartbus-stream-app_avro2.py
lines.pprint()输出:
{u'temperature': 21.0, u'max_capacity': 44, u'equip_type': u'Autocarro', u'id': u'CARRIS_502', u'tire_pressure': 4.974999904632568, u'humidity': 21.0, u'equip_category': u'Carruagem_Unica', u'users_out': 2.0, u'equip_brand': u'Volvo', u'battery_status': 99.5, u'equip_fuel': u'Biodiesel', u'fuel': 39.79999923706055, u'equip_model': u'3d', u'aqi_sensor': 5.0, u'seated_capacity': 32, u'users_in': 3.0, u'location': u'38.760780, -9.166853'}
StackTrace输出:
2018-03-17 03:47:06 ERROR Executor:91 - Exception in task 0.0 in stage 36.0 (TID 29)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 229, in main
process()
File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 224, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/opt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2438, in pipeline_func
File "/opt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2438, in pipeline_func
File "/opt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 362, in func
File "/opt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1857, in combineLocally
File "/opt/spark/python/lib/pyspark.zip/pyspark/shuffle.py", line 236, in mergeValues
for k, v in iterator:
File "/root/ss_app/smartbus-stream-app_avro2.py", line 17, in <lambda>
counts = lines.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
AttributeError: 'dict' object has no attribute 'split'
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1126)
at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1132)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
无法真正找到有关此事的更多细节。任何专家都可以对这个问题有所了解吗?
提前致谢
答案 0 :(得分:0)
它正在抛出这个错误,因为你试图拆分Dictonary。在平面地图内部,每一行都以字典形式出现。当你尝试拆分它时会抛出这个错误字典没有属性拆分。