hadoop倒排索引计数

时间:2014-05-01 15:59:28

标签: java hadoop mapreduce

我有两个文件作为输入:

fileA.txt:

learn hadoop
learn java

fileB.txt:

hadoop java
eclipse eclipse

期望的输出:

learn   fileA.txt:2

hadoop  fileA.txt:1 , fileB.txt:1

java    fileA.txt:1 , fileB.txt:1

eclipse fileB.txt:2

我的简化方法:

public void reduce(Text key, Iterator<Text> values,
                OutputCollector<Text, Text> output, Reporter reporter)
                throws IOException {

            Set<Text> outputValues = new HashSet<Text>();
            while (values.hasNext()) {
                Text value = new Text(values.next());
                // delete duplicates
                outputValues.add(value);
            }
            boolean isfirst = true;
            StringBuilder toReturn = new StringBuilder();
            Iterator<Text> outputIter = outputValues.iterator();
            while (outputIter.hasNext()) {
                if (!isfirst) {
                    toReturn.append("/");
                }
                isfirst = false;
                toReturn.append(outputIter.next().toString());
            }
            output.collect(key, new Text(toReturn.toString()));
        }

我需要帮助计数器(按文件计算单词)

我设法打印:

learn   fileA.txt

hadoop  fileA.txt / fileB.txt

java    fileA.txt / fileB.txt

eclipse fileB.txt

但无法打印每个文件的计数

非常感谢任何帮助

1 个答案:

答案 0 :(得分:1)

据我所知,这应该打印出你想要的东西:

@Override
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
    Map<String, Integer> fileToCnt = new HashMap<String, Integer>();
    while(values.hasNext()) {
        String file = values.next().toString();
        Integer current = fileToCnt.get(file);
        if (current == null) {
            current = 0;
        }
        fileToCnt.put(file, current + 1);
    }
    boolean isfirst = true;
    StringBuilder toReturn = new StringBuilder();
    for (Map.Entry<String, Integer> entry : fileToCnt.entrySet()) {
        if (!isfirst) {
            toReturn.append(", ");
        }
        isfirst = false;
        toReturn.append(entry.getKey()).append(":").append(entry.getValue());
    }
    output.collect(key, new Text(toReturn.toString()));
}