有人可以帮我编写mapper和reducer来合并这两个文件,然后删除重复的记录吗?
这是两个文本文件:
file1.txt
2012-3-1a
2012-3-2b
2012-3-3c
2012-3-4d
2012-3-5a
2012-3-6b
2012-3-7c
2012-3-3c
和file2.txt:
2012-3-1b
2012-3-2a
2012-3-3b
2012-3-4d
2012-3-5a
2012-3-6c
2012-3-7d
2012-3-3c
答案 0 :(得分:2)
一个简单的字数统计程序将为您完成这项工作。您需要做的唯一更改是,将Reducer的输出值设置为NullWritable.get()
答案 1 :(得分:0)
两个文件中是否有共用密钥,有助于识别记录是否匹配?如果是这样的话: Mappers输入:标准TextInputFormat 映射器的输出键:公共密钥和映射器的输出值:整个记录。 在reducer:不需要迭代Keys只需要只写一个Value for Write。
如果只有完整记录匹配才能得出匹配或重复:那么 Mappers输入:标准TextInputFormat 映射器的输出键:整个记录和映射器的输出值:NullWritable。 在reducer:不需要迭代Keys。只需要一个Key实例并将其写为Value。 减速器输出键:减速器输入键,减速器输出值:NullWritable
答案 2 :(得分:0)
下面是删除大文本数据中重复行的代码,它使用哈希来提高效率:
DRMapper.java
import com.google.common.hash.Hashing;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
class DRMapper extends Mapper<LongWritable, Text, Text, Text> {
private Text hashKey = new Text();
private Text mappedValue = new Text();
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
hashKey.set(Hashing.murmur3_32().hashString(line, StandardCharsets.UTF_8).toString());
mappedValue.set(line);
context.write(hashKey, mappedValue);
}
}
DRReducer.java
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class DRReducer extends Reducer<Text, Text, Text, NullWritable> {
@Override
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
Text value;
if (values.iterator().hasNext()) {
value = values.iterator().next();
if (!(value.toString().isEmpty())) {
context.write(value, NullWritable.get());
}
}
}
}
DuplicateRemover.java
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class DuplicateRemover {
private static final int DEFAULT_NUM_REDUCERS = 210;
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: DuplicateRemover <input path> <output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(DuplicateRemover.class);
job.setJobName("Duplicate Remover");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(DRMapper.class);
job.setReducerClass(DRReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setNumReduceTasks(DEFAULT_NUM_REDUCERS);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
编译:
javac -encoding UTF8 -cp $(hadoop classpath) *.java
jar cf dr.jar *.class
假设输入的文本文件在 in_folder 中,运行如下:
hadoop jar dr.jar in_folder out_folder