如何并行运行两个map任务并行读取两个文件

时间:2014-05-01 19:53:20

标签: hadoop mapreduce

请放心一点,因为我在Hadoop和Mapreduce只有3个月。

我有2个文件各120 MB,每个文件中的数据完全是非结构化的,但有一个共同的模式。由于数据结构的变化,我的要求不能满足默认的LineInputFormat。

因此,在读取文件时,我覆盖isSplitable()方法并通过返回false来停止拆分。这样1个映射器就可以访问一个完整的文件,我可以执行我的逻辑并达到要求。

我的机器可以并行运行两个映射器。因此,通过停止拆分,我通过逐个运行映射器为每个文件降低性能,而不是为文件并行运行两个映射器。

我的问题是如何为这两个文件并行运行两个映射器,以提高性能。

例如

When split was allowed:
    file 1: split 1 (1st mapper) || split 2 (2nd mapper)------ 2 min 
    file 2: split 1 (1st mapper) || split 2 (2nd mapper)------ 2 min

    Total Time for reading two files =====  4 min

When Split not allowed:
    file 1: no parallel jobs so (1st mapper)---------4 min
    file 2: no parallel jobs so (1st mapper)---------4 min

    Total Time to read two files ===== 8 min (Performance degraded)

What I want
    File 1 (1st Mapper) || file 2 (2nd Mapper) ------4 min

    Total time to read two files ====== 4 min 

基本上我希望两个不同的映射器同时读取两个文件。

请帮我实现这个场景。

以下是我的Custom InputFormat和Custom RecordReader Code。

public class NSI_inputformatter extends FileInputFormat<NullWritable, Text>{
@Override
public boolean isSplitable(FileSystem fs, Path filename)
{
    //System.out.println("Inside the isSplitable Method of NSI_inputformatter");
    return false;
}

@Override
public RecordReader<NullWritable, Text> getRecordReader(InputSplit split,
        JobConf job_run, Reporter reporter) throws IOException {
    // TODO Auto-generated method stub
    //System.out.println("Inside the getRecordReader method of NSI_inputformatter");

    return new NSI_record_reader(job_run, (FileSplit)split);
}

}

记录阅读器:

public class NSI_record_reader implements RecordReader<NullWritable, Text> 
{
FileSplit split;
JobConf job_run;
String text;
public boolean processed=false;
public NSI_record_reader(JobConf job_run, FileSplit split)
{
    //System.out.println("Inside the NSI_record_reader constructor");
    this.split=split;
    this.job_run=job_run;

    //System.out.println(split.toString());
}
@Override
public boolean next(NullWritable key, Text value) throws IOException {
    // TODO Auto-generated method stub
    //System.out.println("Inside the next method of the NLI_record_reader");
    if (!processed)
    {
        byte [] content_add=new byte[(int)(split.getLength())];
        Path file=split.getPath();
        FileSystem fs=file.getFileSystem(job_run);
        FSDataInputStream input=null;


        try{
            input=fs.open(file);
            System.out.println("the input is " +input+ input.toString());
            IOUtils.readFully(input, content_add, 0, content_add.length);
            value.set(content_add, 0, content_add.length);
        }
        finally
        {
            IOUtils.closeStream(input);

        }
        processed=true;
        return true;
    }

    return false;
}

@Override
public void close() throws IOException {
    // TODO Auto-generated method stub

}

@Override
public NullWritable createKey() {
    System.out.println("Inside createkey() mrthod of NSI_record_reader");
    // TODO Auto-generated method stub
    return  NullWritable.get();
}

@Override
public Text createValue() {
    System.out.println("Inside createValue() mrthod of NSI_record_reader");
    // TODO Auto-generated method stub
    return new Text();
}

@Override
public long getPos() throws IOException {
    // TODO Auto-generated method stub
    System.out.println("Inside getPs() mrthod of NSI_record_reader");
    return processed ? split.getLength() : 0;
}

@Override
public float getProgress() throws IOException {
    // TODO Auto-generated method stub
    System.out.println("Inside getProgress() mrthod of NSI_record_reader");
    return processed ? 1.0f : 0.0f;
}

}

输入样本:

<Dec 12, 2013 1:05:56 AM CST> <Error> <HTTP> <BEA-101017>       <[weblogic.servlet.internal.WebAppServletContext@42e87d99 - appName: 'Agile', name:    '/Agile', context-path: '/Agile', spec-version: 'null'] Root cause of ServletException.
  javax.servlet.jsp.JspException: Connection reset by peer: socket write error
at com.agile.ui.web.taglib.common.FormTag.writeFormHeader(FormTag.java:498)
at com.agile.ui.web.taglib.common.FormTag.doStartTag(FormTag.java:429)
at jsp_servlet._default.__login_45_cms._jspService(__login_45_cms.java:929)
at weblogic.servlet.jsp.JspBase.service(JspBase.java:34)
at    weblogic.servlet.internal.StubSecurityHelper$ServletServiceAction.run(StubSecurityHelper.ja va:227)
Truncated. see log file for complete stacktrace
>
Retrieving the value for the attribute Page Two.Validation Status for the Object 769630
Retrieving the value for the attribute Page Two.Pilot Required for the Object 769630
Retrieving the value for the attribute Page Two.NPO Contact for the Object 769630
<Dec 12, 2013 1:12:13 AM CST> <Warning> <Socket> <BEA-000449> <Closing socket as no         data read from it during the configured idle timeout of 0 secs> 

感谢。

1 个答案:

答案 0 :(得分:1)

您可以尝试设置属性-D mapred.min.split.size=209715200。在这种情况下,FileInputFormat不应拆分您的文件,因为它们小于mapred.min.split.size