Spark Streaming Kafka Consumer(Avro) - AttributeError:' dict'对象没有属性' split'

时间:2018-03-17 03:35:43

标签: python apache-spark pyspark apache-kafka avro

我正在尝试构建一个Spark Streaming App,它使用来自Kafka主题的消息,其中包含使用Avro格式化的消息,但我遇到了Confluent消息解串器的一些问题。

按照Spark Python Avro Kafka Deserialiser的说明,我让Kafka消费者正确反序列化消息,但最终无法运行PythonStreamingDirectKafkaWordCount示例。

代码:

import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from confluent_kafka.avro.cached_schema_registry_client import CachedSchemaRegistryClient
from confluent_kafka.avro.serializer.message_serializer import MessageSerializer
schema_registry_client = CachedSchemaRegistryClient(url='http://127.0.0.1:8081')
serializer = MessageSerializer(schema_registry_client)

if __name__ == "__main__":
    sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
    ssc = StreamingContext(sc, 2)
    sc.setLogLevel("WARN")
    kvs = KafkaUtils.createDirectStream(ssc, ["avrotest2"], {"metadata.broker.list": "localhost:9092"}, valueDecoder=serializer.decode_message)
    lines = kvs.map(lambda x: x[1])
    lines.pprint()
    counts = lines.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
    counts.pprint()
    ssc.start()
    ssc.awaitTermination()

Spark提交CLI

/opt/spark/bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.0 --jars spark-streaming-kafka-0-8_2.11-2.0.0.jar smartbus-stream-app_avro2.py

lines.pprint()输出:

{u'temperature': 21.0, u'max_capacity': 44, u'equip_type': u'Autocarro', u'id': u'CARRIS_502', u'tire_pressure': 4.974999904632568, u'humidity': 21.0, u'equip_category': u'Carruagem_Unica', u'users_out': 2.0, u'equip_brand': u'Volvo', u'battery_status': 99.5, u'equip_fuel': u'Biodiesel', u'fuel': 39.79999923706055, u'equip_model': u'3d', u'aqi_sensor': 5.0, u'seated_capacity': 32, u'users_in': 3.0, u'location': u'38.760780, -9.166853'}

StackTrace输出:

2018-03-17 03:47:06 ERROR Executor:91 - Exception in task 0.0 in stage 36.0 (TID 29)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 229, in main
    process()
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 224, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2438, in pipeline_func
  File "/opt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2438, in pipeline_func
  File "/opt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 362, in func
  File "/opt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1857, in combineLocally
  File "/opt/spark/python/lib/pyspark.zip/pyspark/shuffle.py", line 236, in mergeValues
    for k, v in iterator:
  File "/root/ss_app/smartbus-stream-app_avro2.py", line 17, in <lambda>
    counts = lines.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
AttributeError: 'dict' object has no attribute 'split'

        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1126)
        at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1132)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

无法真正找到有关此事的更多细节。任何专家都可以对这个问题有所了解吗?

提前致谢

1 个答案:

答案 0 :(得分:0)

它正在抛出这个错误,因为你试图拆分Dictonary。在平面地图内部,每一行都以字典形式出现。当你尝试拆分它时会抛出这个错误字典没有属性拆分。