应用错误收集

我认为您认为编写自定义分区程序可以控制创建Reduce Task的次数是错误的。请检查以下说明： -

实际上，paritioner根据密钥的哈希值确定在哪个reducer中发送密钥和值列表，如下所述。 public class HashPartitioner<K, V> extends Partitioner<K, V> { public int getPartition(K key, V value, int numReduceTasks) { return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks; } }

现在生成的输出文件数量的问题取决于您要求作业运行的reduce任务的数量。因此，如果您为作业配置了3个reduce任务，并且说您编写了一个自定义分区程序，导致仅将密钥发送到2个reducer。在这种情况下，您将找到第三个reducer的空part-r00002输出文件，因为它在分区后没有得到任何记录。可以使用LazyOutputFormat删除此空零件文件。

例如：import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat; LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);

我希望这可以解除你的怀疑。

Mapreduce根据reduce task或reduce方法调用输出HDFS中的文件数

1 个答案: