Question

我的hadoop作业需要知道每条记录的输入路径。

例如，假设我正在对S3对象集合运行作业：

s3://bucket/file1
s3://bucket/file2
s3://bucket/file3

我想减少键值对，例如

s3://bucket/file1    record1
s3://bucket/file1    record2
s3://bucket/file2    record1
...

是否有org.apache.hadoop.mapreduce.InputFormat的扩展可以实现此目的？或者，使用自定义输入格式有更好的方法吗？

我知道在mapper中可以从MapContext（How to get the input file name in the mapper in a Hadoop program?）访问此信息，但我使用的是Apache Crunch，我无法控制我的任何步骤是Map还是Reduces，但是我可以可靠地控制InputFormat，所以在我看来它就是这样做的地方。

Answer 1

请查看my blog article to customize inputsplit and recordreader。

该博客中的代码设置如下键（记录阅读器代码的第69-70行）

value = new Text(line);
key = new LongWritable(splitstart);

在你的情况下你需要设置如下键，我没有测试它。

key = fsplit.getPath().toString();

Hadoop InputFormat设置输入文件路径的键

1 个答案: