用于删除重复记录的Hadoop MapReduce程序

时间:2015-11-13 20:51:23

标签: hadoop

有人可以帮我编写mapper和reducer来合并这两个文件,然后删除重复的记录吗?

这是两个文本文件:

file1.txt
2012-3-1a
2012-3-2b
2012-3-3c
2012-3-4d
2012-3-5a
2012-3-6b
2012-3-7c
2012-3-3c

和file2.txt:

2012-3-1b
2012-3-2a
2012-3-3b
2012-3-4d
2012-3-5a
2012-3-6c
2012-3-7d
2012-3-3c

3 个答案:

答案 0 :(得分:2)

一个简单的字数统计程序将为您完成这项工作。您需要做的唯一更改是,将Reducer的输出值设置为NullWritable.get()

答案 1 :(得分:0)

两个文件中是否有共用密钥,有助于识别记录是否匹配?如果是这样的话: Mappers输入:标准TextInputFormat 映射器的输出键:公共密钥和映射器的输出值:整个记录。 在reducer:不需要迭代Keys只需要只写一个Value for Write。

如果只有完整记录匹配才能得出匹配或重复:那么 Mappers输入:标准TextInputFormat 映射器的输出键:整个记录和映射器的输出值:NullWritable。 在reducer:不需要迭代Keys。只需要一个Key实例并将其写为Value。 减速器输出键:减速器输入键,减速器输出值:NullWritable

答案 2 :(得分:0)

下面是删除大文本数据中重复行的代码,它使用哈希来提高效率:

DRMapper.java

    import com.google.common.hash.Hashing;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Mapper;
    
    import java.io.IOException;
    import java.nio.charset.StandardCharsets;
    
    class DRMapper extends Mapper<LongWritable, Text, Text, Text> {
    
      private Text hashKey = new Text();
      private Text mappedValue = new Text();
    
      @Override
      public void map(LongWritable key, Text value, Context context)
          throws IOException, InterruptedException {
        String line = value.toString();

          hashKey.set(Hashing.murmur3_32().hashString(line, StandardCharsets.UTF_8).toString());
          mappedValue.set(line);
          context.write(hashKey, mappedValue);

      }
    
    }

DRReducer.java

    import org.apache.hadoop.io.NullWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Reducer;
    
    import java.io.IOException;
    
    public class DRReducer extends Reducer<Text, Text, Text, NullWritable> {
      @Override
      public void reduce(Text key, Iterable<Text> values, Context context)
          throws IOException, InterruptedException {
        Text value;
        if (values.iterator().hasNext()) {
          value = values.iterator().next();
          if (!(value.toString().isEmpty())) {
            context.write(value, NullWritable.get());
          }
        }
      }
    }

DuplicateRemover.java

    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    
    
    public class DuplicateRemover {
      private static final int DEFAULT_NUM_REDUCERS = 210;
    
      public static void main(String[] args) throws Exception {
        if (args.length != 2) {
          System.err.println("Usage: DuplicateRemover <input path> <output path>");
          System.exit(-1);
        }
    
   
        Job job = new Job();
        job.setJarByClass(DuplicateRemover.class);
        job.setJobName("Duplicate Remover");
    
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
    
        job.setMapperClass(DRMapper.class);
        job.setReducerClass(DRReducer.class);
    
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
    
        job.setNumReduceTasks(DEFAULT_NUM_REDUCERS);
    
        System.exit(job.waitForCompletion(true) ? 0 : 1);
      }
    }

编译:

javac -encoding UTF8 -cp $(hadoop classpath) *.java
jar cf dr.jar *.class

假设输入的文本文件在 in_folder 中,运行如下:

hadoop jar dr.jar in_folder out_folder