使用Spark saveAsNewAPIHadoopFile将CSV(KeyValueTextInputFormat)转换为Avro(AvroKeyOutputFormat)

时间:2018-05-15 10:31:09

标签: apache-spark

我正在尝试使用Spark的api将csv转换为avro,如下所示:

1)使用newAPIHadoopFile读取csv文件

2)使用saveAsNewAPIHadoopFile将csv保存到avro。

在将文件保存为avro时,低于错误:

org.apache.avro.file.DataFileWriter$AppendWriteException: java.lang.NullPointerException: in org.srdevin.avro.topLevelRecord null of org.srdevin.avro.topLevelRecord
    at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:308)
    at org.apache.avro.mapreduce.AvroKeyRecordWriter.write(AvroKeyRecordWriter.java:77)
    at org.apache.avro.mapreduce.AvroKeyRecordWriter.write(AvroKeyRecordWriter.java:39)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1125)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1123)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1123)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1131)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

以下是代码段:

val in = "test.csv"
val out = "csvToAvroOutput"
val schema = new Schema.Parser().parse(new File("/path/to/test.avsc"))

val hadoopRDD = sc.newAPIHadoopFile(in, classOf[KeyValueTextInputFormat]
  , classOf[Text], classOf[Text])

val job = Job.getInstance
AvroJob.setOutputKeySchema(job, schema)


hadoopRDD.map(row => (new AvroKey(row), NullWritable.get()))
  .saveAsNewAPIHadoopFile(
    out,
    classOf[AvroKey[GenericRecord]],
    classOf[NullWritable],
    classOf[AvroKeyOutputFormat[GenericRecord]],
    job.getConfiguration)

架构:test.avsc

{
  "type" : "record",
  "name" : "topLevelRecord",
  "namespace" : "org.srdevin.avro",
  "aliases": ["MyRecord"],
  "fields" : [ {
    "name" : "name",
    "type" : [ "string", "null"] ,
    "default": "null",
    "aliases": ["name"]
  }, {
    "name" : "age",
    "type" : [ "string" , "null" ],
    "default": "null",
    "aliases": ["age"]
  }]
}

使用的Spark版本:2.1.0,avro版本:1.7.6

由于

0 个答案:

没有答案