Question

考虑我有以下格式的大量输入

1,2,6,4
4,5,18,7
9,1,3,5
......

输出应该是它的转置 1 4 9 ..
2 5 1 ..
6 6 3 ..
4 7 5 ..

在这种情况下，未指定行号。我们可以在解析时获得列号假设文件非常大并且将分割为多个映射器。由于未指定行号，因此无法识别每个映射器的输出顺序。因此，是否可以使用另一个mapreduce程序预处理输入文件，并在文件发送到Mapper之前提供行号？

Answer 1

当您使用TextInputFormat时，您将输入文件中的位置作为LongWritable键。虽然它不是row的实际值，但您可以在执行输出时使用它对列进行排序。所以整个地图减少工作看起来像这样：

public static class TransposeMapper extends Mapper<LongWritable, Text, LongWritable, Text> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        long column = 0;
        long somethingLikeRow = key.get();
        for (String num : value.toString().split(",")) {
            context.write(new LongWritable(column), new Text(somethingLikeRow + "\t" + num));
            ++column;
        }
    }
}

public static class TransposeReducer extends Reducer<LongWritable, Text, Text, NullWritable> {
    @Override
    protected void reduce(LongWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        TreeMap<Long, String> row = new TreeMap<Long, String>(); // storing values sorted by positions in input file
        for (Text text : values) {
            String[] parts = text.toString().split("\t"); // somethingLikeRow, value
            row.put(Long.valueOf(parts[0]), parts[1]);
        }
        String rowString = StringUtils.join(row.values(), ' '); // i'm using org.apache.commons library for concatenation
        context.write(new Text(rowString), NullWritable.get());
    }
}

当没有指定行号时，使用MapReduce进行Matrix Transpose

1 个答案: