在hadoop MR任务中使用Avro作为映射器的输出格式时,如何设置压缩编解码器?
旧的"mapred" API提供了此方法:
org.apache.avro.mapred.AvroJob.setOutputCodec(JobConf job, String codec)
但是,较新的"mapreduce" API中没有。如何在较新的“ mapreduce” API中设置编解码器?
我天真地尝试使用作业配置来设置编解码器,但没有成功:
public int run(String[] args) throws Exception {
[..]
Job job = new Job(getConf());
job.setJarByClass(MapReduceExample.class);
job.setJobName("MRExample");
// hm .. this doesn't seem to do work, output still has "null" codec
job.getConfiguration().set(AvroJob.CONF_OUTPUT_CODEC,
CodecFactory.deflateCodec(6).toString());
job.setMapperClass(ExampleMapper.class);
[..]
AvroJob.setMapOutputKeySchema(job, Schema.create(Schema.Type.STRING));
AvroJob.setMapOutputValueSchema(job, Schema.create(Schema.Type.BYTES));
// here I was hoping to use something like
// AvroJob.setMapOutputCodec(job, "deflate")
[..]
return (job.waitForCompletion(true) ? 0 : 1);
}
当我用python打开生成的avro
>>> from avro.datafile import DataFileReader
>>> from avro.io import DatumReader
>>> av_fh = open("output/part-r-00000.avro", "rb")
>>> av_rd = DataFileReader(av_fh, DatumReader())
>>> av_rd.codec
'null'
答案 0 :(得分:0)
当我更改以下几行时可以使用
job.getConfiguration().set(AvroJob.CONF_OUTPUT_CODEC,
CodecFactory.deflateCodec(6).toString());
到
FileOutputFormat.setCompressOutput(job, true);
job.getConfiguration().set(AvroJob.CONF_OUTPUT_CODEC,
DataFileConstants.DEFLATE_CODEC);