我正在尝试对使用Spark Steaming(PySpark)以avro编码的Kafka主题中的数据进行解码,并收到以下错误:
2019-05-19 14:00:46 ERROR PythonRunner:91 - Python worker exited unexpectedly (crashed)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/Users/robert.dempsey/Applications/spark-2.3.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 238, in main
eval_type = read_int(infile)
File "/Users/robert.dempsey/Applications/spark-2.3.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 692, in read_int
raise EOFError
EOFError
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:332)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:471)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:454)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:286)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:223)
at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:440)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:249)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1992)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:172)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/Users/robert.dempsey/Applications/spark-2.3.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 240, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/Users/robert.dempsey/Applications/spark-2.3.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 60, in read_command
command = serializer._read_with_length(file)
File "/Users/robert.dempsey/Applications/spark-2.3.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 171, in _read_with_length
return self.loads(obj)
File "/Users/robert.dempsey/Applications/spark-2.3.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 566, in loads
return pickle.loads(obj, encoding=encoding)
File "/anaconda3/envs/pyspark_env/lib/python3.7/site-packages/avro/schema.py", line 173, in __setitem__
% (key, value, self))
Exception: Attempting to map key 'name' to value <avro.schema.Field object at 0x1112bbb70> in ImmutableDict {}
到目前为止,我发现的所有解决方案都在谈论使用Confluent的Python库,该库需要架构注册表,而在我的环境中我将不会使用它。话虽如此,我是他们使用了该库的,但由于没有使用该库对邮件进行编码,因此无法正常工作。
同样,这段代码在Spark上下文之外也可以工作,并且我能够解码消息,所以我不确定解码是Spark的问题还是其他原因。
我的Python代码:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import avro.schema
from avro.io import BinaryDecoder, DatumReader
import os
cwd = os.getcwd()
schema_path = os.path.join(cwd, 'avro/schemas/user.avsc')
schema = avro.schema.Parse(open(schema_path).read())
def decoder(msg):
bytes_reader = io.BytesIO(msg)
decoder = BinaryDecoder(bytes_reader)
reader = DatumReader(schema)
user = reader.read(decoder)
return user
sc = SparkContext(appName="SparkStreamingTest")
ssc = StreamingContext(sc, 5)
broker = "localhost:9092"
kafka_params = {
"bootstrap.servers": broker,
"auto.offset.reset": "smallest",
"group.id": "test.group"
}
kafka_stream = KafkaUtils.createDirectStream(ssc,
topics=['test.avro'],
kafkaParams=kafka_params,
valueDecoder=decoder
)
messages = kafka_stream.map(lambda x: x[1])
messages.pprint()
ssc.start()
ssc.awaitTermination()
我正在通过外壳脚本调用此python文件:
#!/usr/bin/env bash
spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.0,org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0,org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2 ./avro/avro_stream.py
我的Avro模式,保存在一个文件中:
{
"namespace": "example.avro",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
我的目标是能够解码avro编码的消息,以便随后将它们转换为JSON并对数据执行转换。
感谢您的帮助!