我初始化群集的大多数时间都会发生EMR群集引导失败(超时)

时间:2016-06-16 06:00:03

标签: java amazon-web-services emr amazon-emr

我正在编写一个由4个链式MapReduce作业组成的应用程序,该作业在Amazon EMR上运行。我使用JobFlow界面链接作业。每个作业都包含在自己的类中,并且有自己的main方法。所有这些都打包成一个保存在S3中的.jar,并且从我的笔记本电脑上的一个小型本地应用程序初始化集群,该应用程序配置JobFlowRequest并将其提交给EMR。 对于我为启动集群而进行的大多数尝试,它都会失败,并显示错误消息Terminated with errors On the master instance (i-<cluster number>), bootstrap action 1 timed out executing。我查找了有关此问题的信息,我所能找到的是,如果群集的组合引导时间超过45分钟,则抛出此异常。但是,这仅在请求提交给EMR后大约15分钟发生,而忽略了请求的群集大小,无论是4个EC2实例,10个甚至是20个。这对我来说毫无意义,我错过了什么?

一些技术规格: - 该项目使用Java 1.7.79编译 - 请求的EMR映像是4.6.0,它使用Hadoop 2.7.2 - 我正在使用AWS SDK for Java v.1.10.64

这是我的本地主要方法,它设置并提交JobFlowRequest

import com.amazonaws.AmazonClientException;
import com.amazonaws.auth.AWSCredentials;
import com.amazonaws.auth.profile.ProfileCredentialsProvider;
import com.amazonaws.services.ec2.model.InstanceType;
import com.amazonaws.services.elasticmapreduce.AmazonElasticMapReduce;
import com.amazonaws.services.elasticmapreduce.AmazonElasticMapReduceClient;
import com.amazonaws.services.elasticmapreduce.model.*;

public class ExtractRelatedPairs {

public static void main(String[] args) throws Exception {

    if (args.length != 1) {
        System.err.println("Usage: ExtractRelatedPairs: <k>");
        System.exit(1);
    }
    int outputSize = Integer.parseInt(args[0]);
    if (outputSize < 0) {
        System.err.println("k should be positive");
        System.exit(1);
    }

    AWSCredentials credentials = null;
    try {
        credentials = new ProfileCredentialsProvider().getCredentials();
    } catch (Exception e) {
        throw new AmazonClientException(
                "Cannot load the credentials from the credential profiles file. " +
                        "Please make sure that your credentials file is at the correct " +
                        "location (~/.aws/credentials), and is in valid format.",
                e);
    }

    AmazonElasticMapReduce mapReduce = new AmazonElasticMapReduceClient(credentials);

    HadoopJarStepConfig jarStep1 = new HadoopJarStepConfig()
            .withJar("s3n://dsps162assignment2benasaf/jars/ExtractRelatedPairs.jar")
            .withMainClass("Phase1")
          .withArgs("s3://datasets.elasticmapreduce/ngrams/books/20090715/eng-gb-all/5gram/data/", "hdfs:///output1/");



    StepConfig step1Config = new StepConfig()
            .withName("Phase 1")
            .withHadoopJarStep(jarStep1)
            .withActionOnFailure("TERMINATE_JOB_FLOW");

    HadoopJarStepConfig jarStep2 = new HadoopJarStepConfig()
            .withJar("s3n://dsps162assignment2benasaf/jars/ExtractRelatedPairs.jar")
            .withMainClass("Phase2")
            .withArgs("shdfs:///output1/", "hdfs:///output2/");

    StepConfig step2Config = new StepConfig()
            .withName("Phase 2")
            .withHadoopJarStep(jarStep2)
            .withActionOnFailure("TERMINATE_JOB_FLOW");

    HadoopJarStepConfig jarStep3 = new HadoopJarStepConfig()
            .withJar("s3n://dsps162assignment2benasaf/jars/ExtractRelatedPairs.jar")
            .withMainClass("Phase3")
            .withArgs("hdfs:///output2/", "hdfs:///output3/", args[0]);

    StepConfig step3Config = new StepConfig()
            .withName("Phase 3")
            .withHadoopJarStep(jarStep3)
            .withActionOnFailure("TERMINATE_JOB_FLOW");

    HadoopJarStepConfig jarStep4 = new HadoopJarStepConfig()
            .withJar("s3n://dsps162assignment2benasaf/jars/ExtractRelatedPairs.jar")
            .withMainClass("Phase4")
            .withArgs("hdfs:///output3/", "s3n://dsps162assignment2benasaf/output4");

    StepConfig step4Config = new StepConfig()
            .withName("Phase 4")
            .withHadoopJarStep(jarStep4)
            .withActionOnFailure("TERMINATE_JOB_FLOW");

    JobFlowInstancesConfig instances = new JobFlowInstancesConfig()
            .withInstanceCount(10)
            .withMasterInstanceType(InstanceType.M1Small.toString())
            .withSlaveInstanceType(InstanceType.M1Small.toString())
            .withHadoopVersion("2.7.2")
            .withEc2KeyName("AWS")
            .withKeepJobFlowAliveWhenNoSteps(false)
            .withPlacement(new PlacementType("us-east-1a"));

    RunJobFlowRequest runFlowRequest = new RunJobFlowRequest()
            .withName("extract-related-word-pairs")
            .withInstances(instances)
            .withSteps(step1Config, step2Config, step3Config, step4Config)
            .withJobFlowRole("EMR_EC2_DefaultRole")
            .withServiceRole("EMR_DefaultRole")
            .withReleaseLabel("emr-4.6.0")
            .withLogUri("s3n://dsps162assignment2benasaf/logs/");

    System.out.println("Submitting the JobFlow Request to Amazon EMR and running it...");
    RunJobFlowResult runJobFlowResult = mapReduce.runJobFlow(runFlowRequest);
    String jobFlowId = runJobFlowResult.getJobFlowId();
    System.out.println("Ran job flow with id: " + jobFlowId);

}
}

1 个答案:

答案 0 :(得分:-1)

前段时间,我遇到了类似的问题,即使4.6.0的Vanilla EMR集群未能通过启动,因此它在引导步骤中抛出超时错误。

我最后只是在不同区域的另一个/新VPC上创建了一个集群并且工作正常,因此它让我相信原始VPC本身或4.6.0中的软件可能存在问题

此外,关于VPC,它特别为新创建的群集节点设置了问题并解析了DNS名称,即使旧版本的EMR没有出现此问题