Question

我需要知道我正在使用的输入文件的分区的行索引。通过将行索引连接到数据，我可以在原始文件中强制执行此操作，但我宁愿在Hadoop中执行此操作。我在我的映射器中有这个...

String id = context.getConfiguration().get("mapreduce.task.partition");

但是“id”在每种情况下都是0。在“Hadoop：The Definitive Guide”中，它提到访问属性，例如分区id“可以从传递给Mapper或Reducer的所有方法的上下文对象访问”。从我所知道的，它实际上并没有涉及如何访问这些信息。

我浏览了Context对象的文档，看起来上面的方法就是这样，脚本编译了。但是因为我的每个价值都是0，所以我不确定我是否真的使用了正确的东西而且我无法在网上找到任何可以帮助解决这个问题的细节。

用于测试的代码......

public class Test {

public static class TestMapper extends Mapper<LongWritable, Text, Text, Text> {

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String id = context.getConfiguration().get("mapreduce.task.partition");
        context.write(new Text("Test"), new Text(id + "_" + value.toString()));
    }
}


public static class TestReducer extends Reducer<Text, Text, Text, Text> {

    public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {

        for(Text value : values) {
            context.write(key, value);
        }
    }
}


public static void main(String[] args) throws Exception {

    if(args.length != 2) {
        System.err.println("Usage: Test <input path> <output path>");
        System.exit(-1);
    }

    Job job = new Job();
    job.setJarByClass(Test.class);
    job.setJobName("Test");

    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    job.setMapperClass(TestMapper.class);
    job.setReducerClass(TestReducer.class);

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);

    System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Answer 1

有两个选项：

使用偏移量而不是行号
跟踪映射器中的行号

对于第一个，LongWritable的键告诉您正在处理的行的偏移量。除非您的行长度完全相同，否则您将无法从偏移量计算行号，但它确实允许您确定排序是否有用。

第二个选项是在映射器中跟踪它。您可以将代码更改为：

public static class TestMapper extends Mapper<LongWritable, Text, Text, Text> {

    private long currentLineNum = 0;
    private Text test = new Text("Test");   

    public void map(LongWritable key, Text value, Context context) 
                          throws IOException, InterruptedException {

        context.write(test, new Text(currentLineNum + "_" + value));
        currentLineNum++; 
    }
}

Answer 2

您还可以将矩阵表示为元组行，并在每个元组中包含行和列，这样当您在文件中读取时，就会获得该信息。如果您使用的文件只是构成2D数组的空格或逗号分隔值，那么在映射器中找出您当前正在处理的行（行）将非常困难

在Hadoop中获取输入文件的分区ID

2 个答案: