我正在尝试使用boto库从python创建emr集群, 我尝试了一些事情,但最终的结果是“关闭步骤失败” 我尝试运行亚马逊提供的关于wordcount的示例代码,但它仍然失败。
当我检查日志时,我发现emr无法找到映射器所在的位置。
s3n://elasticmapreduce/samples/wordcount/wordSplitter.py“:error = 2,没有这样的文件或目录
这让我得到了亚马逊在某些网站上发现的回复:
您好,
从使用Hadoop 2的AMI 3.x开始,向前推进EMR Hadoop流将支持流媒体作业的标准Hadoop样式参考。
这意味着需要将s3引用的映射器和reducer放入> “-files”论证。
例如,
elastic-mapreduce --create --ami-version 3.0.1 --instance-type m1.large --log-> uri s3n:// mybucket / logs --stream --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py --input s3://mybucket/input/alice.txt --output s3:// mybucket / output --reducer aggregate
变为:
elastic-mapreduce --create --ami-version 3.0.1 --instance-type m1.large --log-> uri s3n:// mybucket / logs --stream --arg“-files”--arg “s3://elasticmapreduce/samples/wordcount/wordSplitter.py”--mapper wordSplitter.py --input s3://mybucket/input/alice.txt --output s3:// mybucket / output --reducer aggregate
现在我想看看这个解决方案对我来说,但我不明白如何设置--files标志和他提到的论点
这是我目前的代码:
self._steps.append(StreamingStep(
name=step_description,
mapper='s3n://elasticmapreduce/samples/wordcount/wordSplitter.py',
reducer='aggregate',
input='s3n://elasticmapreduce/samples/wordcount/input',
output='s3n://'+'test'))
conn.run_jobflow(
availability_zone='us-east-1b',
name=job_description,
master_instance_type='m3.xlarge',
slave_instance_type='m3.xlarge',
num_instances=3,
action_on_failure='TERMINATE_JOB_FLOW',
keep_alive=True,
log_uri='s3://'+"logs",
ami_version="3.6.0",
steps=self._steps,
bootstrap_actions=self._actions,
visible_to_all_users=True
)
--------------编辑---------------
看起来这就是答案,我将ami_version降低到2.4.11,这是Hadoop 2的最后一个版本,同样的代码现在正常工作。
我真的不知道我是否真的需要最新的Hadoop版本,可能不是,但是我不知道我没有使用亚马逊提供的最新版本。
-------------- EDIT2 ---------------
找到了解决方案,
//create a list and insert two elements
//the first element is the argument name '-files'
//the second is the full path to both the mapper and the reducer seperated by comma
//if you try to put it in a single line it fails...
step_args = list()
step_args.append('-files')
step_args.append('s3://<map_full_path>/<map_script_name>,s3://<reduce_full_path>/<reduce_script_name>')
// add step_args to the StreamingStep argument
self._steps.append(StreamingStep(
name=step_description,
mapper='<map_script_name>',
reducer='<reduce_script_name>',
input='s3n://elasticmapreduce/samples/wordcount/input',
output='s3n://'+'test',
step_args=step_args)
conn.run_jobflow(
availability_zone='us-east-1b',
name=job_description,
master_instance_type='m3.xlarge',
slave_instance_type='m3.xlarge',
num_instances=3,
action_on_failure='TERMINATE_JOB_FLOW',
keep_alive=True,
log_uri='s3://'+"logs",
ami_version="3.6.0",
steps=self._steps,
bootstrap_actions=self._actions,
visible_to_all_users=True
)
希望有人帮助......
答案 0 :(得分:1)
找到解决方案,
//create a list and insert two elements
//the first element is the argument name '-files'
//the second is the full path to both the mapper and the reducer seperated by comma
//if you try to put it in a single line it fails...
step_args = list()
step_args.append('-files')
step_args.append('s3://<map_full_path>/<map_script_name>,s3://<reduce_full_path>/<reduce_script_name>')
// add step_args to the StreamingStep argument
self._steps.append(StreamingStep(
name=step_description,
mapper='<map_script_name>',
reducer='<reduce_script_name>',
input='s3n://elasticmapreduce/samples/wordcount/input',
output='s3n://'+'test',
step_args=step_args)
conn.run_jobflow(
availability_zone='us-east-1b',
name=job_description,
master_instance_type='m3.xlarge',
slave_instance_type='m3.xlarge',
num_instances=3,
action_on_failure='TERMINATE_JOB_FLOW',
keep_alive=True,
log_uri='s3://'+"logs",
ami_version="3.6.0",
steps=self._steps,
bootstrap_actions=self._actions,
visible_to_all_users=True
)
希望有人帮助......