Spark-JavaPairRDD将saveAsHadoopFile转换为AvroOutputFormat

时间:2018-06-27 12:01:46

标签: java apache-spark java-pair-rdd

我正在尝试使用以下代码将JavaPairRDD保存到avro文件中

JavaPairRDD<String, Float> j = existingRDD.mapToPair().combineByKey().mapToPair();

j.saveAsHadoopFile("/hdfsPath/avro/", String.class, Float.class, AvroOutputFormat.class);

但是我在第二行得到一个NullPointerException

java.lang.NullPointerException
at java.io.StringReader.<init>(StringReader.java:50)
at org.apache.avro.Schema$Parser.parse(Schema.java:1012)
at org.apache.avro.Schema.parse(Schema.java:1064)
at org.apache.avro.mapred.AvroJob.getOutputSchema(AvroJob.java:143)
at org.apache.avro.mapred.AvroOutputFormat.getRecordWriter(AvroOutputFormat.java:153)
at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1191)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1183)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

可能是由于我没有正确使用saveAsHadoopFile,因为使用时我没有收到任何错误

j.saveAsTextFile("/hdfsPath/avro/");
//OR
j.saveAsHadoopFile("/user/cloudera/avro/", String.class, Float.class, TextOutputFormat.class);

传递给mapToPair的PairFunction返回Tuple2<String, Float>。另外,我尝试制作自己的类并将其扩展为AvroOutputFormat.class,而不是saveAsHadoopFile方法中的AvroOutputFormat

public class CombineOutput extends AvroOutputFormat{
  String department;
  Float avgSal;
}

被传递为

j.saveAsHadoopFile("/hdfsPath/avro/", String.class, Float.class, CombineOutput.class);

但是它给了我同样的NullPointerException

我在Java中找不到与saveAsHadoopFileAvroOutputFormat有关的任何资源。有人可以帮我吗?

我使用Spark 1.6.0

0 个答案:

没有答案