我有很多zip格式的压缩文件(以GB为单位),并且想要只写地图来解压缩它们。我的mapper类看起来像
import java.util.zip.*;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapred.OutputCollector;
import java.io.*;
public class DecompressMapper extends Mapper <LongWritable, Text, LongWritable, Text>
{
private static final int BUFFER_SIZE = 4096;
public void map(LongWritable key, Text value, OutputCollector<LongWritable, Text> output, Context context) throws IOException
{
FileSplit fileSplit = (FileSplit)context.getInputSplit();
String fileName = fileSplit.getPath().getName();
this.unzip(fileName, new File(fileName).getParent() + File.separator + "/test_poc");
}
public void unzip(String zipFilePath, String destDirectory) throws IOException {
File destDir = new File(destDirectory);
if (!destDir.exists()) {
destDir.mkdir();
}
ZipInputStream zipIn = new ZipInputStream(new FileInputStream(zipFilePath));
ZipEntry entry = zipIn.getNextEntry();
// iterates over entries in the zip file
while (entry != null) {
String filePath = destDirectory + File.separator + entry.getName();
if (!entry.isDirectory()) {
// if the entry is a file, extracts it
extractFile(zipIn, filePath);
} else {
// if the entry is a directory, make the directory
File dir = new File(filePath);
dir.mkdir();
}
zipIn.closeEntry();
entry = zipIn.getNextEntry();
}
zipIn.close();
}
private void extractFile(ZipInputStream zipIn, String filePath) throws IOException {
BufferedOutputStream bos = new BufferedOutputStream(new FileOutputStream(filePath));
byte[] bytesIn = new byte[BUFFER_SIZE];
int read = 0;
while ((read = zipIn.read(bytesIn)) != -1) {
bos.write(bytesIn, 0, read);
}
bos.close();
}
}
和我的驱动程序类
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class DecompressJob extends Configured implements Tool{
public static void main(String[] args) throws Exception
{
int res = ToolRunner.run(new Configuration(), new DecompressJob(),args);
System.exit(res);
}
public int run(String[] args) throws Exception
{
Job conf = Job.getInstance(getConf());
conf.setJobName("MapperOnly");
conf.setOutputKeyClass(LongWritable.class);
conf.setOutputValueClass(Text.class);
conf.setMapperClass(DecompressMapper.class);
conf.setNumReduceTasks(0);
Path inp = new Path(args[0]);
Path out = new Path(args[1]);
FileInputFormat.addInputPath(conf, inp);
FileOutputFormat.setOutputPath(conf, out);
return conf.waitForCompletion(true) ? 0: 1;
}
}
看来我的mapper类运行不正常。我没有在所需的目录中获得解压缩的文件。任何帮助表示赞赏。感谢...
答案 0 :(得分:2)
以上代码几乎没有问题
我们在编写map reduce程序时需要小心,因为hadoop使用完全不同的文件系统,我们需要在编写代码时考虑这一点,并且永远不要混用MR1和MR2 API。
答案 1 :(得分:2)
好吧没有特定的方法在hadoop文件系统中解压缩文件,但经过长时间的研究,我已经弄清楚如何直接在hadoop文件系统中解压缩 。前提是你需要复制在某个位置的zip文件,然后运行mapreduce作业。它的意思是hadoop不理解zipfile输入格式,所以我们需要自定义Mapper和reducer,这样我们就可以控制mapper发出和减少的消耗。请注意,此Mapreduce将在单个Mapper上运行,因为在自定义hadoop提供的Record Reader类时,我们禁用split方法,即将其设置为false。因此,Mapreduce将文件名作为键,未压缩文件的内容为作为值。当reducer使用它时,我已将输出outputkey设为null,因此只有解压缩的内容保留在reducer中,reducer的数量设置为1,因此所有转储都放在单个部件文件中。
我们都知道hadoop本身不能处理zip文件,但是java可以在自己的ZipFile类的帮助下处理,它可以通过 zipinputstrem和zip条目通过zipentry读取zip文件内容所以我们写一个自定义的ZipInputFormat类,扩展FileInputFormat。
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
public class ZipFileInputFormat extends FileInputFormat<Text, BytesWritable> {
/** See the comments on the setLenient() method */
private static boolean isLenient = false;
/**
* ZIP files are not splitable so they cannot be overrided so function
* return false
*/
@Override
protected boolean isSplitable(JobContext context, Path filename) {
return false;
}
/**
* Create the ZipFileRecordReader to parse the file
*/
@Override
public RecordReader<Text, BytesWritable> createRecordReader(
InputSplit split, TaskAttemptContext context) throws IOException,
InterruptedException {
return new ZipFileRecordReader();
}
/**
*
* @param lenient
*/
public static void setLenient(boolean lenient) {
isLenient = lenient;
}
public static boolean getLenient() {
return isLenient;
}
}
请注意,RecordReader类返回ZipFileRecordReadeader我们正在讨论的自定义版本的Hadoop RecordReader 类。现在让我们稍微简化一下RecordReader类
import java.io.IOException;
import java.io.ByteArrayOutputStream;
import java.io.EOFException;
import java.io.IOException;
import java.util.zip.ZipEntry;
import java.util.zip.ZipException;
import java.util.zip.ZipInputStream;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
public class ZipFileRecordReader extends RecordReader<Text, BytesWritable> {
/** InputStream used to read the ZIP file from the FileSystem */
private FSDataInputStream fsin;
/** ZIP file parser/decompresser */
private ZipInputStream zip;
/** Uncompressed file name */
private Text currentKey;
/** Uncompressed file contents */
private BytesWritable currentValue;
/** Used to indicate progress */
private boolean isFinished = false;
/**
* Initialise and open the ZIP file from the FileSystem
*/
@Override
public void initialize(InputSplit inputSplit,
TaskAttemptContext taskAttemptContext) throws IOException,
InterruptedException {
FileSplit split = (FileSplit) inputSplit;
Configuration conf = taskAttemptContext.getConfiguration();
Path path = split.getPath();
FileSystem fs = path.getFileSystem(conf);
// Open the stream
fsin = fs.open(path);
zip = new ZipInputStream(fsin);
}
/**
* Each ZipEntry is decompressed and readied for the Mapper. The contents of
* each file is held *in memory* in a BytesWritable object.
*
* If the ZipFileInputFormat has been set to Lenient (not the default),
* certain exceptions will be gracefully ignored to prevent a larger job
* from failing.
*/
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
ZipEntry entry = null;
try {
entry = zip.getNextEntry();
} catch (ZipException e) {
if (ZipFileInputFormat.getLenient() == false)
throw e;
}
// Sanity check
if (entry == null) {
isFinished = true;
return false;
}
// Filename
currentKey = new Text(entry.getName());
if (currentKey.toString().endsWith(".zip")) {
ByteArrayOutputStream bos = new ByteArrayOutputStream();
byte[] temp1 = new byte[8192];
while (true) {
int bytesread1 = 0;
try {
bytesread1 = zip.read(temp1, 0, 8192);
} catch (EOFException e) {
if (ZipFileInputFormat.getLenient() == false)
throw e;
return false;
}
if (bytesread1 > 0)
bos.write(temp1, 0, bytesread1);
else
break;
}
zip.closeEntry();
currentValue = new BytesWritable(bos.toByteArray());
return true;
}
// Read the file contents
ByteArrayOutputStream bos = new ByteArrayOutputStream();
byte[] temp = new byte[8192];
while (true) {
int bytesRead = 0;
try {
bytesRead = zip.read(temp, 0, 8192);
} catch (EOFException e) {
if (ZipFileInputFormat.getLenient() == false)
throw e;
return false;
}
if (bytesRead > 0)
bos.write(temp, 0, bytesRead);
else
break;
}
zip.closeEntry();
// Uncompressed contents
currentValue = new BytesWritable(bos.toByteArray());
return true;
}
/**
* Rather than calculating progress, we just keep it simple
*/
@Override
public float getProgress() throws IOException, InterruptedException {
return isFinished ? 1 : 0;
}
/**
* Returns the current key (name of the zipped file)
*/
@Override
public Text getCurrentKey() throws IOException, InterruptedException {
return currentKey;
}
/**
* Returns the current value (contents of the zipped file)
*/
@Override
public BytesWritable getCurrentValue() throws IOException,
InterruptedException {
return currentValue;
}
/**
* Close quietly, ignoring any exceptions
*/
@Override
public void close() throws IOException {
try {
zip.close();
} catch (Exception ignore) {
}
try {
fsin.close();
} catch (Exception ignore) {
}
}
}
为方便起见,我在源代码中给出了一些注释,以便您可以使用缓冲区内存轻松了解文件的读写方式。现在可以将上面的Mapper类写入类
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Mapper;
public class MyMapper extends Mapper<Text, BytesWritable, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Text key, BytesWritable value, Context context)
throws IOException, InterruptedException {
String filename = key.toString();
// We only want to process .txt files
if (filename.endsWith(".txt") == false)
return;
// Prepare the content
String content = new String(value.getBytes(), "UTF-8");
context.write(new Text(content), one);
}
}
让我们快速为同一个
编写Reducerimport java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
// context.write(key, new IntWritable(sum));
context.write(new Text(key), null);
}
}
让我们快速配置Mapper和Reducer的作业
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import com.saama.CustomisedMapperReducer.MyMapper;
import com.saama.CustomisedMapperReducer.MyReducer;
import com.saama.CustomisedMapperReducer.ZipFileInputFormat;
import com.saama.CustomisedMapperReducer.ZipFileRecordReader;
public class MyJob {
@SuppressWarnings("deprecation")
public static void main(String[] args) throws IOException,
ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJarByClass(MyJob.class);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setInputFormatClass(ZipFileInputFormat.class);
job.setOutputKeyClass(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
ZipFileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setNumReduceTasks(1);
job.waitForCompletion(true);
}
}
请注意,在作业类中,我们已将 InputFormatClass配置为ZipFileInputFormat类,而将OutputFormatClass配置为TextOutPutFormat类。
Mavenize Project并让依赖关系运行代码。导出Jar文件并将其部署在hadoop集群上。在CDH5.5 YARN上测试和部署。 POM文件的内容为
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.mithun</groupId>
<artifactId>CustomisedMapperReducer</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>CustomisedMapperReducer</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.6.0</version>
</dependency>
<dependency>
<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-mapper-asl</artifactId>
<version>1.9.13</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
</dependencies>
</project>