Question

我设法使我的Word Count程序得到了包装，现在我希望能够获得最大的出现。

我的WordCount输出如下：

File1:Word1: x
File1:Word2: x

其中File表示文件，Word表示搜索到的Word，x表示计数。

我想获得这些字数的最大数量。因此，以我的示例为例：

File1:Word1: 4
File1:Word2: 10
File2:Word1: 4
File2:Word2: 1

我希望将File1的Word1和File 2的Word1递增1，因为这是特定文件的单词的最大单词数。

不幸的是，我很难获得想要的输出。

我的地图功能如下：

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> outputCollector, Reporter reporter)
        throws IOException { 

    String parsedLine = value.toString();
    String[] pieces = parsedLine.split(":");
    StringTokenizer tokenizer = new StringTokenizer(pieces[1]);

    while (tokenizer.hasMoreTokens()) {
        String token = tokenizer.nextToken();
        outputCollector.collect(new Text(token), ONE);
    }
}

我的Reduce看起来像这样：

private int maximum = 0;

@Override
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> outputCollector, Reporter reporter)
        throws IOException {

    Text occuredKey = new Text();

    int total = 0;
    while (values.hasNext()) {
        total += values.next().get();
    }

    if (total > maximum) {
        maximum = total;
        occuredKey.set(key);
    }
    outputCollector.collect(occuredKey, new IntWritable(total));
}

我尝试了几件事：

将关键字（例如Word1，Word2）放置在地图中，但该关键字无效。
在我的地图中进行迭代，如果找到了该单词，请将其放入列表中，然后比较列表大小

我的理解是第一份工作的输出是第二份工作的输入，但这似乎不正确，因为我无法从第一份工作中获取计数。

我们非常感谢您的帮助，对此我一直坚持不懈。

要在输出中明确显示：

我有60个文件，每个文件都包含与“字数统计”中搜索的相同的5个单词。所以我的第一份工作的输出文件中有60 x 5的总记录。第二项工作将使用5个单词，并计算该单词在每个文件5个集合中最高的次数。因此，我的输出应为5条记录，这5条记录的总计数应等于60

如何在Hadoop中获得最大字数？

0 个答案: