如何使用newAPIHadoopFile读取spark中的avro文件?

时间:2016-09-13 13:05:27

标签: java hadoop apache-spark

我正在尝试在spark工作中阅读na Avro文件 我的火花版本是1.6.0(spark-core_2.10-1.6.0-cdh5.7.1)。

这是我的java代码:

JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("ReadAvro"));
JavaPairRDD <NullWritable, Text> lines = sc.newAPIHadoopFile(args[0],AvroKeyValueInputFormat.class,AvroKey.class,AvroValue.class,new Configuration());

但是我得到了一个编译时异常:

  

方法newAPIHadoopFile(String,Class,Class,Class,   JavaSparkContext类型中的配置)不适用于   arguments(String,Class,Class,   类,配置)

那么在Java中使用JavaSparkContext.newAPIHadoopFile()的正确方法是什么?

1 个答案:

答案 0 :(得分:3)

public class Utils {

  public static <T> JavaPairRDD<String, T> loadAvroFile(JavaSparkContext sc, String avroPath) {
    JavaPairRDD<AvroKey, NullWritable> records = sc.newAPIHadoopFile(avroPath, AvroKeyInputFormat.class, AvroKey.class, NullWritable.class, sc.hadoopConfiguration());
    return records.keys()
        .map(x -> (GenericRecord) x.datum())
        .mapToPair(pair -> new Tuple2<>((String) pair.get("key"), (T)pair.get("value")));
  }
}

将该实用程序用作:

JavaPairRDD<String, YourAvroClassName> records = Utils.<YourAvroClassName>loadAvroFile(sc, inputDir);

您可能还需要使用KryoSerializer并注册自定义KryoRegistrator:

sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
sparkConf.set("spark.kryo.registrator", "com.test.avro.MyKryoRegistrator");

public class MyKryoRegistrator implements KryoRegistrator {

  public static class SpecificInstanceCollectionSerializer<T extends Collection> extends CollectionSerializer {
    Class<T> type;
    public SpecificInstanceCollectionSerializer(Class<T> type) {
      this.type = type;
    }

    @Override
    protected Collection create(Kryo kryo, Input input, Class<Collection> type) {
      return kryo.newInstance(this.type);
    }

    @Override
    protected Collection createCopy(Kryo kryo, Collection original) {
      return kryo.newInstance(this.type);
    }
  }


  Logger logger = LoggerFactory.getLogger(this.getClass());

  @Override
  public void registerClasses(Kryo kryo) {
    // Avro POJOs contain java.util.List which have GenericData.Array as their runtime type
    // because Kryo is not able to serialize them properly, we use this serializer for them
    kryo.register(GenericData.Array.class, new SpecificInstanceCollectionSerializer<>(ArrayList.class));
    kryo.register(YourAvroClassName.class);
  }
}

希望这会有所帮助......