Question

在我的mapReduce程序中，我必须使用分区程序：

public class TweetPartitionner extends HashPartitioner<Text, IntWritable>{

    public int getPartition(Text a_key, IntWritable a_value, int a_nbPartitions) {
        if(a_key.toString().startsWith("#"))
            return 0;
        else
            return 1;
    }

}

我已设置了减少任务的数量：job.setNumReduceTasks(2);

但是我收到以下错误：java.io.IOException: Illegal partition for #rescinfo (1)

参数a_nbPartitions返回1。

我在另一篇文章中读过： Hadoop: Number of reducer is not equal to what I have set in program

在eclipse中运行它似乎使用本地作业运行器。它只是支持0或1个reducer。如果您尝试将其设置为使用多个减速器，它忽略了它，无论如何只使用它。

我开发了安装在Cygwin上的Hadoop 0.20.2，我当然使用Eclipse。我该怎么办？

Answer 1

实际上，您不需要专用的Hadoop集群。只是你必须告诉Eclipse你打算在你的伪分布式集群上运行这个工作，而不是在本地运行。为此，您需要在代码中添加以下行：

Configuration conf = new Configuration();
conf.set("fs.default.name", "hdfs://localhost:9000");
conf.set("mapred.job.tracker", "localhost:9001");

之后通过以下方式将减速器数量设置为2：

job.setNumReduceTasks(2);

是的，您必须非常确定分区逻辑。您可以访问此page，其中显示了如何编写自定义分区程序。

HTH

Answer 2

在你有一个专用的hadoop集群来运行你的工作之前，没有办法在本地模式下有超过1个reducer。您可以将Eclipse配置为将您的作业提交给hadoop集群，然后将考虑您的配置。

在每种情况下，在编写自己的分区程序时都应始终使用return Math.min(i, a_nbPartitions-1)。

Eclipse中的Hadoop和reducers数量

2 个答案: