我有两个文件作为输入:
fileA.txt:
learn hadoop
learn java
fileB.txt:
hadoop java
eclipse eclipse
期望的输出:
learn fileA.txt:2
hadoop fileA.txt:1 , fileB.txt:1
java fileA.txt:1 , fileB.txt:1
eclipse fileB.txt:2
我的简化方法:
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
Set<Text> outputValues = new HashSet<Text>();
while (values.hasNext()) {
Text value = new Text(values.next());
// delete duplicates
outputValues.add(value);
}
boolean isfirst = true;
StringBuilder toReturn = new StringBuilder();
Iterator<Text> outputIter = outputValues.iterator();
while (outputIter.hasNext()) {
if (!isfirst) {
toReturn.append("/");
}
isfirst = false;
toReturn.append(outputIter.next().toString());
}
output.collect(key, new Text(toReturn.toString()));
}
我需要帮助计数器(按文件计算单词)
我设法打印:
learn fileA.txt
hadoop fileA.txt / fileB.txt
java fileA.txt / fileB.txt
eclipse fileB.txt
但无法打印每个文件的计数
非常感谢任何帮助
答案 0 :(得分:1)
据我所知,这应该打印出你想要的东西:
@Override
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
Map<String, Integer> fileToCnt = new HashMap<String, Integer>();
while(values.hasNext()) {
String file = values.next().toString();
Integer current = fileToCnt.get(file);
if (current == null) {
current = 0;
}
fileToCnt.put(file, current + 1);
}
boolean isfirst = true;
StringBuilder toReturn = new StringBuilder();
for (Map.Entry<String, Integer> entry : fileToCnt.entrySet()) {
if (!isfirst) {
toReturn.append(", ");
}
isfirst = false;
toReturn.append(entry.getKey()).append(":").append(entry.getValue());
}
output.collect(key, new Text(toReturn.toString()));
}