Question

Mapper从文件中读取行...如何在整个文件扫描之后最终发出键值，而不是每行？

Answer 1

使用新的mapreduce API，您可以覆盖Mapper.cleanup(Context)方法，并像在map方法中一样使用Context.write(K, V)。

@Override
protected void cleanup(Context context) {
  context.write(new Text("key"), new Text("value"));
}

旧的mapred API可以覆盖close()方法 - 但是您需要存储对map方法的OutputCollector的引用：

private OutputCollector cachedCollector = null;

void map(Longwritable key, Text value, OutputCollector outputCollector, Reporter reporter) {
  if (cachedCollector == null) {
    cachedCollector = outputCollector;
  }

  // ...
}

public void close() {
  cachedCollector.collect(outputKey, outputValue);
}

Answer 2

整个文件或多个文件有一个Key值吗？

如果是案例＃1：使用WholeFileInputFormat。您将收到完整的文件内容作为单个记录。您可以将其拆分为记录，处理所有记录并在处理结束时发出最终的键/值

Cae＃2：使用相同的fileInputFormat。将所有键值存储在临时存储中。最后，访问您的临时存储并发出您想要的任何键/值，并抑制您不想要的那些

Answer 3

Chris的答案的另一个替代方案可能是您可以通过覆盖Mapper类的run()来实现此目的（新API）

public static class Map extends Mapper<IntWritable, IntWritable, IntWritable, IntWritable> {

  //map method here

  // Override the run()
  @override
  public void run(Context context) throws IOException, InterruptedException {
    setup(context);
    while (context.nextKeyValue()) {
      map(context.getCurrentKey(), context.getCurrentValue(), context);
    }
  // Have your last <key,value> emitted here
  context.write(lastOutputKey, lastOutputValue);
  cleanup(context);
  }
}

为了确保每个映射器都有一个要处理的文件，您必须创建自己的FileInputFormat版本和覆盖 isSplittable()，如下所示：

Class NonSplittableFileInputFormat extends FileInputFormat{

@Override 
    public boolean isSplitable(FileSystem fs, Path filename){ 
        return false; 
    }
}

如何在整个文件处理结束时发出键值？

3 个答案: