使用elastic-mapreduce将文件加载到EMR分布式缓存时出错

时间:2015-05-08 00:40:17

标签: python ruby hadoop mapreduce emr

我正在使用以下命令启动群集。

./elastic-mapreduce --create \
 --stream \
 --cache  s3n://bucket_name/code/totalInstallUsers#totalInstallUsers \
 --input s3n://bucket_name/input \
 --output s3n://bucket_name/output \
 --mapper s3n://bucket_name/code/mapper.py \
 --reducer s3n://bucket_name \
 --jobflow-role EMR_EC2_DefaultRole \
 --service-role EMR_DefaultRole \
 --debug \
 --log-uri s3n://bucket_name/logs

我总是得到以下错误消息。如果删除--cache语句,群集将成功启动。 错误:未定义的方法each' for #<String:0x00000002c28ba0> /home/ubuntu/data_processing/commands.rb:806:in步骤' /home/ubuntu/data_processing/commands.rb:1232:in block in enact' /home/ubuntu/data_processing/commands.rb:1232:in map' /home/ubuntu/data_processing/commands.rb:1232:in enact' /home/ubuntu/data_processing/commands.rb:49:in块中的'enact' /home/ubuntu/data_processing/commands.rb:49:in each' /home/ubuntu/data_processing/commands.rb:49:in enact' /home/ubuntu/data_processing/commands.rb:2422:in create_and_execute_commands' /home/ubuntu/data_processing/elastic-mapreduce-cli.rb:13:in' /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in require' /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in require' ./elastic-mapreduce:6:in`'

使用的原因--cache是​​我希望从mapper.py我可以通过“with open('。/ totalInstallUsers','r')打开数据文件作为infile:

有没有人能给我一些线索?感谢

1 个答案:

答案 0 :(得分:1)

这里发布我得到的解决方案,希望对其他人有所帮助。 使用AWS EMR,命令如下所示:

aws emr create-cluster 
    --name "cluster--name" 
    --enable-debugging 
    --log-uri s3://bucket-name/logs 
    --ami-version 3.7.0 
    --use-default-roles 
    --ec2-attributes KeyName=your-key 
    --instance-type m3.xlarge 
    --instance-count 3 
    --auto-terminate 
    --steps file://./streaming.json

And in Streaming.json, it looks like: 
[ 
    { 
    "Type": "STREAMING", 
    "Name": "Streaming program", 
    "ActionOnFailure": "TERMINATE_CLUSTER", 
    "Args": [ 
            "-files","s3://bucket-name/code/mapper.py,s3://bucket-name/code/reducer.py", 
            "-mapper","mapper.py", 
            "-reducer","reducer.py", 
            "-input","s3://bucket-name/input", 
            "-output","s3://bucket-name/output", 
            "-cacheFile", "s3://bucket_name/code/data-file-name#new-file-name" 
            ] 
    } 
]