Question

我必须在特定的Reducer输出中生成更多数量的输出文件。我已经实现了自定义分区，它将输出数据导入相应的reducer。但是我的一些reducer有20 gb以上的数据，有些只有15 mb的数据。现在我的问题是如何在reducer里面创建20 gb的小输出文件数据。在一个reducer中有5个小输出文件，因此在reducer阶段的数据处理会更快。

我用谷歌搜索，发现我必须使用MultiOutput来解决我的问题。但我很困惑使用。请提出一些实施建议。

我正在从HBase读取数据并写入文本文件。

这是我的驱动程序代码

Job job = new Job(hbaseConf);
    job.setJarByClass(HBaseToFileDriver.class);
    job.setJobName("Importing Data from HBase to File:::" + args[0]);

    Scan scan = new Scan();
    scan.setFilter(new RowFilter(CompareOp.EQUAL, new SubstringComparator("Japan")));
    scan.setCaching(10000); // 1 is the default in Scan, which will be bad
                // for
                // MapReduce jobs
    scan.setCacheBlocks(false); // don't set to true for MR jobs
    scan.addFamily(Bytes.toBytes("cf"));

    TableMapReduceUtil.initTableMapperJob(args[0], // input table
        scan, // Scan instance to control CF and attribute selection
        MyMapper.class, // mapper class
        Text.class, // mapper output key
        IntWritable.class, // mapper output value
        job);
    job.setReducerClass(MyReducer.class); // reducer class
    job.setPartitionerClass(MyPartioner.class);
    job.setNumReduceTasks(6); // at least one, adjust as required
    //job.setInt("outputs.per.reducer", 4);

    FileOutputFormat.setOutputPath(job, new Path(args[1]));

这是我的Mapper代码

public class MyMapper extends TableMapper<Text, IntWritable> {

    private final IntWritable ONE = new IntWritable(1);
    private Text text = new Text();



    public void map(ImmutableBytesWritable row, Result value, Context context)
        throws IOException, InterruptedException {

    String FundamentalSeriesId = new String(value.getValue(Bytes.toBytes("cf"), Bytes.toBytes("FundamentalSeriesId")));
    String FundamentalSeriesId_objectTypeId = new String(value.getValue(Bytes.toBytes("cf"), Bytes.toBytes("FundamentalSeriesId_objectTypeId")));

    text.set(FundamentalSeriesId+"|^|"+FundamentalSeriesId_objectTypeId+"|!|");

    context.write(text, ONE);
    }
}

这是我的分手

public class MyPartioner extends Partitioner<Text, IntWritable> {

    public int getPartition(Text key, IntWritable value, int setNumRedTask) {

    String str = key.toString();
    if (str.contains("Japan|2014")) {
        return 0;
    } else if (str.contains("Japan|2013")) {
        return 1;
    }  else if (str.contains("Japan|2012")) {
        return 2;
    } else if (str.contains("Japan|2011")) {
        return 3;
    } else if (str.contains("Japan|2010")) {
        return 4;
    }

        return 5;

    }

}

Answer 1

如果你想只发射6个减速器，如果你想根据年份分配数据，那么根据数据特性，会发生一些减速器有20Gb，而有些减速器只有15Mb需要处理。 / p>

如果您使用MultipleOutputFormat，并且如果您正在分发6个桶的数据，那么您将登陆同一页面。

你可以找到另一个属性，如Year和其他属性，或者你必须增加减速器数量，并且必须根据属性的HashCode定义分区器（同样在这种情况下也可能发生一些reducer将获得更多数据进行处理）。

如果您只想参考，可以在下面的链接中找到MultipleOutput示例

http://hadooptutorial.info/mapreduce-multiple-outputs-use-case/

如何在Custom Partioner

1 个答案: