Question

我正在尝试使用boto库从python创建emr集群，我尝试了一些事情，但最终的结果是“关闭步骤失败” 我尝试运行亚马逊提供的关于wordcount的示例代码，但它仍然失败。

当我检查日志时，我发现emr无法找到映射器所在的位置。

s3n：//elasticmapreduce/samples/wordcount/wordSplitter.py“：error = 2，没有这样的文件或目录

这让我得到了亚马逊在某些网站上发现的回复：

您好，

从使用Hadoop 2的AMI 3.x开始，向前推进EMR Hadoop流将支持流媒体作业的标准Hadoop样式参考。

这意味着需要将s3引用的映射器和reducer放入＆gt; “-files”论证。

例如，

elastic-mapreduce --create --ami-version 3.0.1 --instance-type m1.large --log-＆gt; uri s3n：// mybucket / logs --stream --mapper   s3：//elasticmapreduce/samples/wordcount/wordSplitter.py --input   s3：//mybucket/input/alice.txt --output s3：// mybucket / output --reducer aggregate

变为：

elastic-mapreduce --create --ami-version 3.0.1 --instance-type m1.large --log-＆gt; uri s3n：// mybucket / logs --stream --arg“-files”--arg   “s3：//elasticmapreduce/samples/wordcount/wordSplitter.py”--mapper   wordSplitter.py --input s3：//mybucket/input/alice.txt --output   s3：// mybucket / output --reducer aggregate

现在我想看看这个解决方案对我来说，但我不明白如何设置--files标志和他提到的论点

这是我目前的代码：

self._steps.append(StreamingStep(
    name=step_description,
    mapper='s3n://elasticmapreduce/samples/wordcount/wordSplitter.py',
    reducer='aggregate',
    input='s3n://elasticmapreduce/samples/wordcount/input',
    output='s3n://'+'test'))

conn.run_jobflow(
    availability_zone='us-east-1b',
    name=job_description,
    master_instance_type='m3.xlarge',
    slave_instance_type='m3.xlarge',
    num_instances=3,
    action_on_failure='TERMINATE_JOB_FLOW',
    keep_alive=True,
    log_uri='s3://'+"logs",
    ami_version="3.6.0",
    steps=self._steps,
    bootstrap_actions=self._actions,
    visible_to_all_users=True
)

--------------编辑---------------
看起来这就是答案，我将ami_version降低到2.4.11，这是Hadoop 2的最后一个版本，同样的代码现在正常工作。我真的不知道我是否真的需要最新的Hadoop版本，可能不是，但是我不知道我没有使用亚马逊提供的最新版本。

-------------- EDIT2 ---------------
找到了解决方案，

//create a list and insert two elements
//the first element is the argument name '-files'
//the second is the full path to both the mapper and the reducer seperated by comma
//if you try to put it in a single line it fails...
step_args = list()
step_args.append('-files')
step_args.append('s3://<map_full_path>/<map_script_name>,s3://<reduce_full_path>/<reduce_script_name>')

// add step_args to the StreamingStep argument
self._steps.append(StreamingStep(
    name=step_description,
    mapper='<map_script_name>',
    reducer='<reduce_script_name>',
    input='s3n://elasticmapreduce/samples/wordcount/input',
    output='s3n://'+'test',
    step_args=step_args)

conn.run_jobflow(
    availability_zone='us-east-1b',
    name=job_description,
    master_instance_type='m3.xlarge',
    slave_instance_type='m3.xlarge',
    num_instances=3,
    action_on_failure='TERMINATE_JOB_FLOW',
    keep_alive=True,
    log_uri='s3://'+"logs",
    ami_version="3.6.0",
    steps=self._steps,
    bootstrap_actions=self._actions,
    visible_to_all_users=True
)

希望有人帮助......

Answer 1

找到解决方案，

//create a list and insert two elements
//the first element is the argument name '-files'
//the second is the full path to both the mapper and the reducer seperated by comma
//if you try to put it in a single line it fails...
step_args = list()
step_args.append('-files')
step_args.append('s3://<map_full_path>/<map_script_name>,s3://<reduce_full_path>/<reduce_script_name>')

// add step_args to the StreamingStep argument
self._steps.append(StreamingStep(
    name=step_description,
    mapper='<map_script_name>',
    reducer='<reduce_script_name>',
    input='s3n://elasticmapreduce/samples/wordcount/input',
    output='s3n://'+'test',
    step_args=step_args)

conn.run_jobflow(
    availability_zone='us-east-1b',
    name=job_description,
    master_instance_type='m3.xlarge',
    slave_instance_type='m3.xlarge',
    num_instances=3,
    action_on_failure='TERMINATE_JOB_FLOW',
    keep_alive=True,
    log_uri='s3://'+"logs",
    ami_version="3.6.0",
    steps=self._steps,
    bootstrap_actions=self._actions,
    visible_to_all_users=True
)

希望有人帮助......

使用Boto创建EMR失败

1 个答案: