Question

在我的MapReduce作业中，我使用AvroParquetOutputFormat使用Avro架构写入Parquet文件。

应用程序逻辑需要由Reducer创建多种类型的文件，并且每个文件都有自己的Avro架构。

AvroParquetOutputFormat类有一个静态方法setSchema（）来设置输出的Avro架构。查看代码，AvroParquetOutputFormat使用AvroWriteSupport.setSchema（），这也是一个静态实现。

如果不扩展AvroWriteSupport并破解逻辑，是否有更简单的方法可以在单个MR作业中从AvroParquetOutputFormat实现多个Avro架构输出？

任何指针/输入都受到高度赞赏。

谢谢＆amp;此致

MK

Answer 1

回答可能为时已晚，但我也遇到了这个问题并提出了解决方案。

首先，MultipleAvroParquetOutputFormat中没有内置“parquet-mr”的支持。但为了实现类似的行为，我使用了MultipleOutputs。

对于仅限地图的作业，请将您的映射器设置为：

public class EventMapper extends Mapper<LongWritable, BytesWritable, Void, GenericRecord>{

    protected  KafkaAvroDecoder deserializer;
    protected String outputPath = "";

    // Using MultipleOutputs to write custom named files
    protected MultipleOutputs<Void, GenericRecord> mos;

    public void setup(Context context) throws IOException, InterruptedException {
        super.setup(context);
        Configuration conf = context.getConfiguration();           
        outputPath = conf.get(FileOutputFormat.OUTDIR);
        mos = new MultipleOutputs<Void, GenericRecord>(context);
    }

    public void map(LongWritable ln, BytesWritable value, Context context){

        try {
            GenericRecord record = (GenericRecord) deserializer.fromBytes(value.getBytes());
            AvroWriteSupport.setSchema(context.getConfiguration(), record.getSchema());
            Schema schema = record.getSchema();
            String mergeEventsPath = outputPath + "/" + schema.getName(); // Adding '/' will do no harm 
            mos.write( (Void) null, record, mergeEventsPath);

        } catch (IOException e) {
            e.printStackTrace();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }

    @Override
    public void cleanup(Context context) throws IOException, InterruptedException {
        mos.close();
    }

}

这将为每个架构创建一个新的RecordWriter并创建一个新的镶木地板文件，并附加架构名称，例如schema1-r-0000.parquet。

这也将根据驱动程序中设置的模式创建默认的part-r-0000x.parquet文件。要避免这种情况，请使用LazyOutputFormat之类的：

LazyOutputFormat.setOutputFormatClass(job, AvroParquetOutputFormat.class);

希望这有帮助。

如何使用AvroParquetOutputFormat设置多个Avro架构？

1 个答案: