在直接跳到这个问题之前,我想先讲一下Hadoop中的序列文件。
SequenceFile 是一个由二进制键/值对组成的平面文件(尽管您始终可以使用键的文本数据类型)。事实上,内部Hadoop使用SequenceFile来存储地图的临时输出。
使用SequenceFile的另一个目的是打包'由于Hadoop的设计更喜欢大文件(我们假设Hadoop默认的数据块大小为64MB),因此许多小文件都会放入一个用于MapReduce计算的大型SequenceFile中。(左)
这不是精确的,而是对hadoop中序列文件的简要描述。
现在,
让我们看一个例子。
让我们假设我们有很多小图像,我们需要删除冗余图像以节省二级存储空间。为此,我们无法对每个图像文件的类似文件名进行搜索,并将其手动分离出来。首先是痛苦,其次需要花费很多时间。那么谁会为我们这样做呢?
Mapreduce !!
我们可以做的就是计算图像文件的 MD5 / SHA1 / SHA-256 并将它们作为键并从映射器中发出它们以便我们将输出作为不冗余的图片列表。让我们看看相同的代码!!
BinaryFilesToHadoopSequenceFileMapper:将二进制输入转换为Hadoop序列输入
package com.sqence;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.commons.logging.impl.Log4JLogger;
@SuppressWarnings("unused")
public class BinaryFilesToHadoopSequenceFile {
//Log4JLogger logger = new Log4JLogger().getLogger();
public static class BinaryFilesToHadoopSequenceFileMapper extends Mapper<Object, Text, Text, BytesWritable> {
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
//logger.info("map method called..");
String uri = value.toString();
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
FSDataInputStream in = null;
try {
in = fs.open(new Path(uri));
java.io.ByteArrayOutputStream bout = new ByteArrayOutputStream();
byte buffer[] = new byte[1024 * 1024];
while( in.read(buffer, 0, buffer.length) >= 0 ) {
bout.write(buffer);
}
context.write(value, new BytesWritable(bout.toByteArray()));
} finally {
IOUtils.closeStream(in);
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
//String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
// if (otherArgs.length != 2) {
// System.err.println("Usage: BinaryFilesToHadoopSequenceFile <in Path for url file> <out pat for sequence file>");
// System.exit(2);
// }
Job job =Job.getInstance(conf);
job.setJarByClass(BinaryFilesToHadoopSequenceFile.class);
job.setMapperClass(BinaryFilesToHadoopSequenceFileMapper.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(BytesWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
Path p= new Path(args[1]);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job,p);
p.getFileSystem(conf).delete(p, true);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
将二进制文件转换为Hadoop序列文件后,我们将使用mapreduce搜索重复的图像
ImageDuplicateMapper ,这里图像文件的MD5哈希值作为键从映射器发出,图像文件的URI作为值生成
package com.sqence;
import java.io.IOException;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class ImageDuplicatesMapper extends Mapper<Text, BytesWritable, Text, Text>{
public void map(Text key, BytesWritable value, Context context) throws IOException,InterruptedException {
//get the md5 for this specific file
String md5Str;
try {
md5Str = calculateMd5(value.getBytes());
} catch (NoSuchAlgorithmException e) {
e.printStackTrace();
context.setStatus("Internal error - can't find the algorithm for calculating the md5");
return;
}
Text md5Text = new Text(md5Str);
//put the file in the map where the md5 is the key, so duplicates will
// be grouped together for the reduce function
context.write(md5Text, key);
}
//md5 calculator
static String calculateMd5(byte[] imageData) throws NoSuchAlgorithmException {
//get the md5 for this specific data
MessageDigest md = MessageDigest.getInstance("MD5");
md.update(imageData);
byte[] hash = md.digest();
// Below code of converting Byte Array to hex
String hexString = new String();
for (int i=0; i < hash.length; i++) {
hexString += Integer.toString( ( hash[i] & 0xff ) + 0x100, 16).substring( 1 );
}
return hexString;
}
}
从映射器阶段,我们将图像的MD5哈希作为键和图像文件的URI作为值。从同样的地方到减少阶段。这是代码
package com.sqence;
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class ImageDupsReducer extends Reducer<Text,Text,Text,Text> {
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
//Key here is the md5 hash while the values are all the image files that
// are associated with it. for each md5 value we need to take only
// one file (the first)
Text imageFilePath = null;
for (Text filePath : values) {
imageFilePath = filePath;
break;//only the first one
}
// In the result file the key will be again the image file path.
context.write(imageFilePath, key);
}
}
最后但并非最不重要的是,**ImageDriver**
类
package com.sqence;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
//import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class ImageDriver extends Configured implements Tool
{
public int run(String[] args) throws Exception
{
//getting configuration object and setting job name
Configuration conf = new Configuration();
Job job = new Job(conf, "Word Count hadoop-0.20");
//setting the class names
job.setJarByClass(ImageDriver.class);
job.setMapperClass(ImageDuplicatesMapper.class);
job.setInputFormatClass(SequenceFileInputFormat.class);
//job.setCombinerClass(WordCountReducer.class);
job.setReducerClass(ImageDupsReducer.class);
//setting the output data type classes
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
Path p= new Path(args[1]);
//to accept the hdfs input and outpur dir at run time
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, p);
p.getFileSystem(conf).delete(p, true);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new ImageDriver(), args);
System.exit(res);
}
}
希望,你还在我身边。我刚刚学习了序列文件格式并将问题陈述编码为实践问题。程序按预期工作并提供准确的输出。
感谢您提出问题?请原谅我问你这个问题的方式对你来说是不正确的。
谢谢你提前帮忙。 :)