请对我说一点容易,因为我是hadoop和MapReduce的新手。
我有一个.tar.gz文件,我试图通过编写使用CompressionCodecfactory的自定义InputFormatter来使用mapReduce读取。
我通过Internet阅读了一些文档,可以使用CompressionCodecFactory来读取.tar.gz文件。因此我在我的代码中实现了它。
我得到的输出在运行代码之后是绝对垃圾。
我的输入文件如下:
"MAY 2013 KOTZEBUE, AK"
"RALPH WIEN MEMORIAL AIRPORT (PAOT)"
"Lat:66° 52'N Long: 162° 37'W Elev (Ground) 30 Feet"
"Time Zone : ALASKA WBAN: 26616 ISSN#: 0197-9833"
01,21,0,11,-11,3,11,54,0," ",4, ,0.0,0.00,30.06,30.09,10.2,36,10.0,25,360,22,360,01
02,25,3,14,-9,5,12,51,0," ",4, ,0.0,0.00,30.09,30.11,6.1,34,7.7,16,010,14,360,02
03,21,1,11,-12,7,11,54,0," ",4, ,0.0,0.00,30.14,30.15,5.0,28,6.0,17,270,16,270,03
04,20,8,14,-10,11,13,51,0,"SN BR",4, ,.001,.0001,30.09,30.11,8.6,26,9.2,20,280,15,280,04
05,29,19,24,-1,21,23,41,0,"SN BR",5, ,0.6,0.06,30.11,30.14,8.1,20,8.5,22,240,20,240,05
06,27,19,23,-3,21,23,42,0,"SN BR",4, ,0.1,0.01,30.14,30.15,8.7,19,9.4,18,200,15,200,06
我得到的输出很奇怪:
��@(���]�OX}�s���{Fw8OP��@ig@���e�1L'�����sAm�
��@���Q�eW�t�Ruk�@��AAB.2P�V�� \L}��+����.֏9U]N �)(���d��i(��%F�S<�ҫ ���EN��v�7�Y�%U�>��<�p���`]ݹ�@�#����9Dˬ��M�X2�'��\R��\1- ���V\K1�c_P▒W¨P[ÖÍãÏ2¨▒;O
以下是Custom InputFormat和RecordReader代码:
InputFormat
public class SZ_inptfrmtr extends FileInputFormat<Text, Text>
{
@Override
public RecordReader<Text, Text> getRecordReader(InputSplit split,
JobConf job_run, Reporter reporter) throws IOException {
// TODO Auto-generated method stub
return new SZ_recordreader(job_run, (FileSplit)split);
}
}
RecordReader:
public class SZ_recordreader implements RecordReader<Text, Text>
{
FileSplit split;
JobConf job_run;
boolean processed=false;
CompressionCodecFactory compressioncodec=null; // A factory that will find the correct codec(.file) for a given filename.
public SZ_recordreader(JobConf job_run, FileSplit split)
{
this.split=split;
this.job_run=job_run;
}
@Override
public void close() throws IOException {
// TODO Auto-generated method stub
}
@Override
public Text createKey() {
// TODO Auto-generated method stub
return new Text();
}
@Override
public Text createValue() {
// TODO Auto-generated method stub
return new Text();
}
@Override
public long getPos() throws IOException {
// TODO Auto-generated method stub
return processed ? split.getLength() : 0;
}
@Override
public float getProgress() throws IOException {
// TODO Auto-generated method stub
return processed ? 1.0f : 0.0f;
}
@Override
public boolean next(Text key, Text value) throws IOException {
// TODO Auto-generated method stub
FSDataInputStream in=null;
if (!processed)
{
byte [] bytestream= new byte [(int) split.getLength()];
Path path=split.getPath();
compressioncodec=new CompressionCodecFactory(job_run);
CompressionCodec code = compressioncodec.getCodec(path);
// compressioncodec will find the correct codec by visiting the path of the file and store the result in code
System.out.println(code);
FileSystem fs= path.getFileSystem(job_run);
try
{
in =fs.open(path);
IOUtils.readFully(in, bytestream, 0, bytestream.length);
System.out.println("the input is " +in+ in.toString());
key.set(path.getName());
value.set(bytestream, 0, bytestream.length);
}
finally
{
IOUtils.closeStream(in);
}
processed=true;
return true;
}
return false;
}
}
请有人指出这个漏洞..
答案 0 :(得分:3)
.gz
有一个编解码器,但没有.tar
的编解码器。
您的.tar.gz
正在解压缩为.tar
,但它仍然是tarball,而不是Hadoop系统可以理解的内容。
答案 1 :(得分:0)
您的代码可能会停留在mapper和reducer类通信中。要在MapReduce中使用压缩文件,您需要为您的作业设置一些配置选项。这些课程 必须在驱动程序类中设置:
conf.setBoolean("mapred.output.compress", true);//Compress The Reducer Out put
conf.setBoolean("mapred.compress.map.output", true);//Compress The Mapper Output
conf.setClass("mapred.output.compression.codec",
codecClass,
CompressionCodec.class);//Compression codec for Compresing mapper output
与未压缩对比的MapReduce作业之间的唯一区别 压缩IO是这三个带注释的行。
I read some document over Internet that CompressionCodecFactory can be used to read a .tar.gz file. hence I implemented that in my code.
甚至Compression编解码器也做得更好,但是有很多编解码器用于此目的,大多数是LzopCodec和SnappyCodec用于可能的大数据..你可以在这里找到LzopCodec的Git:https://github.com/twitter/hadoop-lzo/blob/master/src/main/java/com/hadoop/compression/lzo/LzopCodec.java