Hadoop Map减少散列程序

时间:2014-09-23 09:37:33

标签: java hadoop mapreduce

我在Hadoop中编写了一个Map Reduce程序,用于散列文件的所有记录,并将hased值作为附加属性附加到每个记录,然后输出到Hadoop文件系统 这是我写的代码

public class HashByMapReduce
{
public static class LineMapper extends Mapper<Text, Text, Text, Text>
{
    private Text word = new Text();

    public void map(Text key, Text value, Context context) throws IOException,    InterruptedException
      {
        key.set("single")
        String line = value.toString();
            word.set(line);
            context.write(key, line);

    }
}
public static class LineReducer
extends Reducer<Text,Text,Text,Text>
{
    private Text result = new Text();
    public void reduce(Text key, Iterable<Text> values,
    Context context
    ) throws IOException, InterruptedException
    {
        String translations = "";
        for (Text val : values)
        {
            translations = val.toString()+","+String.valueOf(hash64(val.toString())); //Point of Error 

        result.set(translations);
        context.write(key, result);
        }
    }
}
public static void main(String[] args) throws Exception
{
    Configuration conf = new Configuration();
    Job job = new Job(conf, "Hashing");
    job.setJarByClass(HashByMapReduce.class);
    job.setMapperClass(LineMapper.class);
    job.setReducerClass(LineReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);
    job.setInputFormatClass(KeyValueTextInputFormat.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
}

我编写了这段代码,其逻辑是每一行都由Map方法读取,该方法将所有值分配给单个键,然后传递给相同的Reducer方法。其中每个值都传递给hash64()函数。

但是我看到它将一个空值(空值)传递给哈希函数。我无法弄明白为什么?提前致谢

1 个答案:

答案 0 :(得分:2)

问题的原因很可能是由于KeyValueTextInputFormat的使用。来自Yahoo tutorial

  InputFormat:          Description:       Key:                     Value:

  TextInputFormat       Default format;    The byte offset          The line contents 
                        reads lines of     of the line                            
                        text files

  KeyValueInputFormat   Parses lines       Everything up to the     The remainder of                      
                        into key,          first tab character      the line
                        val pairs 

它打破了输入行wrt tab字符。我想你的行中没有tab。因此,key中的LineMapper是整行,而没有任何内容传递为value(不确定null或空)。

从您的代码中我认为最好使用TextInputFormat类作为inputformat,它将行偏移量生成为key,将整行生成为value。这应该可以解决您的问题。

编辑:我按照以下更改运行您的代码,它似乎工作正常:

  1. 将inputformat更改为TextInputFormat并相应地更改Mapper的声明
  2. 添加了适当的setMapOutputKeyClass&amp; setMapOutputValueClass job。这些不是强制性的,但通常会在运行时产生问题。
  3. 删除了您的ket.set("single")并向Mapper添加了私有密钥。
  4. 由于您未提供hash64方法的详细信息,因此我使用了String.toUpperCase进行测试。
  5. 如果问题仍然存在,那么我确信您的哈希方法没有很好地处理null

    完整代码:

     import org.apache.hadoop.conf.Configuration;
     import org.apache.hadoop.fs.Path;
     import org.apache.hadoop.io.LongWritable;
     import org.apache.hadoop.io.Text;
     import org.apache.hadoop.mapreduce.Job;
     import org.apache.hadoop.mapreduce.Mapper;
     import org.apache.hadoop.mapreduce.Reducer;
     import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
     import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
     import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    
     public class HashByMapReduce {
     public static class LineMapper extends
            Mapper<LongWritable, Text, Text, Text> {
        private Text word = new Text();
        private Text outKey = new Text("single");
    
        public void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            String line = value.toString();
            word.set(line);
            context.write(outKey, word);
        }
    }
    
    public static class LineReducer extends Reducer<Text, Text, Text, Text> {
        private Text result = new Text();
    
        public void reduce(Text key, Iterable<Text> values, Context context)
                throws IOException, InterruptedException {
            String translations = "";
            for (Text val : values) {
                translations = val.toString() + ","
                        + val.toString().toUpperCase(); // Point of Error
    
                result.set(translations);
                context.write(key, result);
            }
        }
    }
    
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = new Job(conf, "Hashing");
        job.setJarByClass(HashByMapReduce.class);
        job.setMapperClass(LineMapper.class);
        job.setReducerClass(LineReducer.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        job.setInputFormatClass(TextInputFormat.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
    

    }