我正在编写一个由4个链式MapReduce作业组成的应用程序,该作业在Amazon EMR上运行。我使用JobFlow界面链接作业。每个作业都包含在自己的类中,并且有自己的main
方法。所有这些都打包成一个保存在S3中的.jar
,并且从我的笔记本电脑上的一个小型本地应用程序初始化集群,该应用程序配置JobFlowRequest
并将其提交给EMR。
对于我为启动集群而进行的大多数尝试,它都会失败,并显示错误消息Terminated with errors On the master instance (i-<cluster number>), bootstrap action 1 timed out executing
。我查找了有关此问题的信息,我所能找到的是,如果群集的组合引导时间超过45分钟,则抛出此异常。但是,这仅在请求提交给EMR后大约15分钟发生,而忽略了请求的群集大小,无论是4个EC2实例,10个甚至是20个。这对我来说毫无意义,我错过了什么?
一些技术规格: - 该项目使用Java 1.7.79编译 - 请求的EMR映像是4.6.0,它使用Hadoop 2.7.2 - 我正在使用AWS SDK for Java v.1.10.64
这是我的本地主要方法,它设置并提交JobFlowRequest
:
import com.amazonaws.AmazonClientException;
import com.amazonaws.auth.AWSCredentials;
import com.amazonaws.auth.profile.ProfileCredentialsProvider;
import com.amazonaws.services.ec2.model.InstanceType;
import com.amazonaws.services.elasticmapreduce.AmazonElasticMapReduce;
import com.amazonaws.services.elasticmapreduce.AmazonElasticMapReduceClient;
import com.amazonaws.services.elasticmapreduce.model.*;
public class ExtractRelatedPairs {
public static void main(String[] args) throws Exception {
if (args.length != 1) {
System.err.println("Usage: ExtractRelatedPairs: <k>");
System.exit(1);
}
int outputSize = Integer.parseInt(args[0]);
if (outputSize < 0) {
System.err.println("k should be positive");
System.exit(1);
}
AWSCredentials credentials = null;
try {
credentials = new ProfileCredentialsProvider().getCredentials();
} catch (Exception e) {
throw new AmazonClientException(
"Cannot load the credentials from the credential profiles file. " +
"Please make sure that your credentials file is at the correct " +
"location (~/.aws/credentials), and is in valid format.",
e);
}
AmazonElasticMapReduce mapReduce = new AmazonElasticMapReduceClient(credentials);
HadoopJarStepConfig jarStep1 = new HadoopJarStepConfig()
.withJar("s3n://dsps162assignment2benasaf/jars/ExtractRelatedPairs.jar")
.withMainClass("Phase1")
.withArgs("s3://datasets.elasticmapreduce/ngrams/books/20090715/eng-gb-all/5gram/data/", "hdfs:///output1/");
StepConfig step1Config = new StepConfig()
.withName("Phase 1")
.withHadoopJarStep(jarStep1)
.withActionOnFailure("TERMINATE_JOB_FLOW");
HadoopJarStepConfig jarStep2 = new HadoopJarStepConfig()
.withJar("s3n://dsps162assignment2benasaf/jars/ExtractRelatedPairs.jar")
.withMainClass("Phase2")
.withArgs("shdfs:///output1/", "hdfs:///output2/");
StepConfig step2Config = new StepConfig()
.withName("Phase 2")
.withHadoopJarStep(jarStep2)
.withActionOnFailure("TERMINATE_JOB_FLOW");
HadoopJarStepConfig jarStep3 = new HadoopJarStepConfig()
.withJar("s3n://dsps162assignment2benasaf/jars/ExtractRelatedPairs.jar")
.withMainClass("Phase3")
.withArgs("hdfs:///output2/", "hdfs:///output3/", args[0]);
StepConfig step3Config = new StepConfig()
.withName("Phase 3")
.withHadoopJarStep(jarStep3)
.withActionOnFailure("TERMINATE_JOB_FLOW");
HadoopJarStepConfig jarStep4 = new HadoopJarStepConfig()
.withJar("s3n://dsps162assignment2benasaf/jars/ExtractRelatedPairs.jar")
.withMainClass("Phase4")
.withArgs("hdfs:///output3/", "s3n://dsps162assignment2benasaf/output4");
StepConfig step4Config = new StepConfig()
.withName("Phase 4")
.withHadoopJarStep(jarStep4)
.withActionOnFailure("TERMINATE_JOB_FLOW");
JobFlowInstancesConfig instances = new JobFlowInstancesConfig()
.withInstanceCount(10)
.withMasterInstanceType(InstanceType.M1Small.toString())
.withSlaveInstanceType(InstanceType.M1Small.toString())
.withHadoopVersion("2.7.2")
.withEc2KeyName("AWS")
.withKeepJobFlowAliveWhenNoSteps(false)
.withPlacement(new PlacementType("us-east-1a"));
RunJobFlowRequest runFlowRequest = new RunJobFlowRequest()
.withName("extract-related-word-pairs")
.withInstances(instances)
.withSteps(step1Config, step2Config, step3Config, step4Config)
.withJobFlowRole("EMR_EC2_DefaultRole")
.withServiceRole("EMR_DefaultRole")
.withReleaseLabel("emr-4.6.0")
.withLogUri("s3n://dsps162assignment2benasaf/logs/");
System.out.println("Submitting the JobFlow Request to Amazon EMR and running it...");
RunJobFlowResult runJobFlowResult = mapReduce.runJobFlow(runFlowRequest);
String jobFlowId = runJobFlowResult.getJobFlowId();
System.out.println("Ran job flow with id: " + jobFlowId);
}
}
答案 0 :(得分:-1)
前段时间,我遇到了类似的问题,即使4.6.0的Vanilla EMR集群未能通过启动,因此它在引导步骤中抛出超时错误。
我最后只是在不同区域的另一个/新VPC上创建了一个集群并且工作正常,因此它让我相信原始VPC本身或4.6.0中的软件可能存在问题
此外,关于VPC,它特别为新创建的群集节点设置了问题并解析了DNS名称,即使旧版本的EMR没有出现此问题