Question

我正在阅读有关mapreduce的内容，我对某个特定场景感到疑惑。假设我们有几个文件（例如fileA，fileB，fileC），每个文件由多个整数组成。如果我们想要对所有文件中的数字进行排序以创建如下内容：

23 fileA
34 fileB
35 fileA
60 fileA
60 fileC

地图和减少过程如何运作？

目前，这就是我所拥有的，但它并不完全正确;

(fileName, fileContent) -> (map to) (Number, fileName)
对临时密钥，值对和get进行排序 (Number, (list of){fileName1, fileName2...})

减少临时对并获得

(Number, fileName1)
(Number, fileName2)

依此类推

问题是在排序阶段，文件名可能不是按字母顺序排列，因此reduce部分不会生成正确的输出。有人可以就这种情况的正确方法提供一些见解吗？

Answer 1

实现这一目标的最佳方法是通过二次排序。您需要对两个键（在您的案例编号中）和值（在您的案例文件名中）进行排序。在Hadoop中，映射器输出仅按键排序。

这可以通过使用复合密钥来实现：密钥是数字和文件名的组合。对于例如对于第一个记录，密钥将是（23，fileA），而不是仅仅（23）。

您可以在此处阅读有关二级排序的信息：https://www.safaribooksonline.com/library/view/data-algorithms/9781491906170/ch01.html

您还可以在“ Hadoop The Definitive Guide ”一书中查看“次要排序”部分。

为了简单起见，我编写了一个程序来实现同样的目的。

在此程序中，映射器默认对键进行排序。我写了一个逻辑来对reducer端的值进行排序。因此，它负责对键和值进行排序，并产生所需的输出。

以下是该计划：

package com.myorg.hadooptests;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.util.*;

public class SortedValue {


    public static class SortedValueMapper
            extends Mapper<LongWritable, Text , Text, IntWritable>{

        public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

            String[] tokens = value.toString().split(" ");

            if(tokens.length == 2) {
                context.write(new Text(tokens[1]), new IntWritable(Integer.parseInt(tokens[0])));
            }
        }
    }

    public static class SortedValueReducer
            extends Reducer<Text, IntWritable, IntWritable, Text> {

        Map<String, ArrayList<Integer>> valueMap = new HashMap<String, ArrayList<Integer>>();

        public void reduce(Text key, Iterable<IntWritable> values,
                           Context context) throws IOException, InterruptedException {

            String keyStr = key.toString();
            ArrayList<Integer> storedValues = valueMap.get(keyStr);

            for (IntWritable value : values) {
                if (storedValues == null) {
                    storedValues = new ArrayList<Integer>();
                    valueMap.put(keyStr, storedValues);
                }
                storedValues.add(value.get());
            }

            Collections.sort(storedValues);
            for (Integer val : storedValues) {
                context.write(new IntWritable(val), key);
            }
        }
    }

    public static void main(String[] args) throws Exception {

        Configuration conf = new Configuration();

        Job job = Job.getInstance(conf, "CompositeKeyExample");
        job.setJarByClass(SortedValue.class);
        job.setMapperClass(SortedValueMapper.class);
        job.setReducerClass(SortedValueReducer.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        job.setOutputKeyClass(IntWritable.class);
        job.setOutputValueClass(Text.class);

        FileInputFormat.addInputPath(job, new Path("/in/in1.txt"));
        FileOutputFormat.setOutputPath(job, new Path("/out/"));

        System.exit(job.waitForCompletion(true) ? 0:1);

    }
}

Mapper Logic：

解析每一行。假设键和值由空白字符（“”）分隔。
如果该行包含2个令牌，则会发出（文件名，整数值）。对于例如对于第一条记录，它会发出（fileA，23）。

缩减逻辑：

它将（键，值）对放在HashMap中，其中key是文件名，value是该文件的整数列表。对于例如对于fileA，存储的值将为23,34和35
最后，它对特定键的值进行排序，并为每个值从reducer中发出（value，key）。对于例如对于fileA，记录输出为：（23，fileA），（34，fileA）和（35，fileA）

我为以下输入运行了这个程序：

34 fileB
35 fileA
60 fileC
60 fileA
23 fileA

我得到了以下输出：

23      fileA
35      fileA
60      fileA
34      fileB
60      fileC

MapReduce示例

1 个答案: