我正在尝试使用Spark的api将csv转换为avro,如下所示:
1)使用newAPIHadoopFile读取csv文件
2)使用saveAsNewAPIHadoopFile将csv保存到avro。
在将文件保存为avro时,低于错误:
org.apache.avro.file.DataFileWriter$AppendWriteException: java.lang.NullPointerException: in org.srdevin.avro.topLevelRecord null of org.srdevin.avro.topLevelRecord
at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:308)
at org.apache.avro.mapreduce.AvroKeyRecordWriter.write(AvroKeyRecordWriter.java:77)
at org.apache.avro.mapreduce.AvroKeyRecordWriter.write(AvroKeyRecordWriter.java:39)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1125)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1123)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1131)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
以下是代码段:
val in = "test.csv"
val out = "csvToAvroOutput"
val schema = new Schema.Parser().parse(new File("/path/to/test.avsc"))
val hadoopRDD = sc.newAPIHadoopFile(in, classOf[KeyValueTextInputFormat]
, classOf[Text], classOf[Text])
val job = Job.getInstance
AvroJob.setOutputKeySchema(job, schema)
hadoopRDD.map(row => (new AvroKey(row), NullWritable.get()))
.saveAsNewAPIHadoopFile(
out,
classOf[AvroKey[GenericRecord]],
classOf[NullWritable],
classOf[AvroKeyOutputFormat[GenericRecord]],
job.getConfiguration)
架构:test.avsc
{
"type" : "record",
"name" : "topLevelRecord",
"namespace" : "org.srdevin.avro",
"aliases": ["MyRecord"],
"fields" : [ {
"name" : "name",
"type" : [ "string", "null"] ,
"default": "null",
"aliases": ["name"]
}, {
"name" : "age",
"type" : [ "string" , "null" ],
"default": "null",
"aliases": ["age"]
}]
}
使用的Spark版本:2.1.0,avro版本:1.7.6
由于