PySpark读取具有地图复杂类型的Avro文件。获取任务结果时出现异常:java.io.InvalidClassException

时间:2016-05-09 12:39:39

标签: python apache-spark avro

我想在使用示例avro_inputformat.py

中的代码时分享我的问题
schema = open('test_schema_without_map.avsc').read()
conf = {"avro.schema.input.key": reduce(lambda x, y: x + y, schema)}
avro_image_rdd = sc.newAPIHadoopFile(
    input_file,
    "org.apache.avro.mapreduce.AvroKeyInputFormat",
    "org.apache.avro.mapred.AvroKey",
    "org.apache.hadoop.io.NullWritable", 
keyConverter="org.apache.spark.examples.pythonconverters.AvroWrapperToJavaConverter",
    conf=conf
)

output = avro_image_rdd.map(lambda x: x[0]).collect()
for k in output:
    print "Image filename : %s" % k

并且在运行时

spark-submit --driver-class-path /opt/spark-1.6.1/lib/spark-examples-1.6.1-hadoop2.6.0.jar read_test_avro_file_with_map.py 

我收到以下错误

Job aborted due to stage failure: Exception while getting task result: java.io.InvalidClassException: scala.collection.convert.Wrappers$MutableMapWrapper; no valid constructor

使用以下架构读取avro文件时:

{
"namespace": "test.avro",
"type": "record",
"name": "TestImage",
"fields": [
    {"name": "filename", "type": "string"},
    {"name": "data", "type": "bytes"},
    {"name": "metadata", "type":
        {
            "type": "map", "values": "string"
        }
    }
   ],
}

但是,当架构不包含'map'avro复杂类型时,相同的代码可以正常工作:

{
"namespace": "test.avro",
"type": "record",
"name": "TestImage",
"fields": [
    {"name": "filename", "type": "string"},
    {"name": "data", "type": "bytes"},
   ],
}

如果有人知道问题出在哪里,请分享您的经验......

版本:

  • spark 1.6.1
  • avro 1.8.0

avro文件的内容是:

records = [
{
    "filename": "input_filename_1",
    "metadata": {"a": "1", "b": "23"},
    "data": "1,2,3,4,5,6,7,8,9,0"
},
{
    "filename": "input_filename_2",
    "metadata": {"c": "11", "d": "213"},
    "data": "10,11,12,13,14,15"
}
]

0 个答案:

没有答案