AWS EMR无法从S3存储桶中查找Mapper文件 - 没有此类文件或目录

时间:2018-05-22 18:55:52

标签: amazon-web-services amazon-s3 amazon-emr

我正在尝试运行以下命令

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -files s3://foobar/hadoop-samples/wordSplitter.py -input s3://foobar/hadoop-samples/input -output s3://foobar/wordcount/output/ -mapper wordSplitter.py -reducer aggregate -verbose

但是Hadoop无法找到wordSplitter.py文件。我收到以下错误:

Caused by: java.io.IOException: Cannot run program "/mnt/yarn/usercache/hadoop/appcache/application_1116934618409_1114/container_1116934618409_1114_01_000018/./wordSplitter.py": error=2, No such file or directory
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
        at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:208)
        ... 23 more

我尝试修改activity命令以添加S3中wordSplitter.py文件的完整路径

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -files s3://foobar/hadoop-samples/wordSplitter.py -input s3://foobar/hadoop-samples/input -output s3://foobar/wordcount/output/ -mapper s3://foobar/hadoop-samples/wordSplitter.py -reducer aggregate -verbose

但我得到了同样的错误:

Caused by: java.io.IOException: Cannot run program "s3://foobar/hadoop-samples/wordSplitter.py": error=2, No such file or directory
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
        at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:208)
        ... 23 more

当我使用AWS提供的wordSplitter.py公共位置时,此活动命令有效,如下所示

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -files s3://elasticmapreduce/samples/wordcount/wordSplitter.py -input s3://elasticmapreduce/samples/wordcount/input -output s3://foobar/wordcount/output/ -mapper wordSplitter.py -reducer aggregate -verbose

但我无法继续使用公共AWS位置s3://elasticmapreduce/samples/wordcount/wordSplitter.py我现在需要从我自己的个人存储桶中获取该文件,并且由于某种原因它无法从我的个人存储桶中获取该文件。

更新

我找到了this AWS Developer Forum,事实上我确实在我的Windows机器上保存了python脚本,然后将其上传到S3但是我删除了文件并将其从源位置直接复制到我的S3存储桶以解决上述问题在论坛中使用以下命令aws s3 cp s3://elasticmapreduce/samples/wordcount/wordSplitter.py s3://foobar/hadoop-samples/wordSplitter.py,但这并没有解决我看到的问题。我仍然得到完全相同的错误。 (注意,我在我的EMR集群中运行aws s3 cp命令(我在框中输入并运行命令)。)

更新2:

EMR群集可以使用该命令找到wordSplitter.py文件 hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -files s3://foobar/hadoop-samples/wordSplitter.py -input s3://foobar/hadoop-samples/input -output s3://foobar/wordcount/output/ -mapper wordSplitter.py -reducer aggregate -verbose 当我从EMR群集中删除SecurityConfiguration设置时。我需要保留SecurityConfiguration设置,所以我不确定我需要对S3存储桶做什么才能让EMR集群访问该文件以用作映射器。

1 个答案:

答案 0 :(得分:0)

对于仍在处理此问题的任何人来说,这已为我解决了:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -files s3://foobar/hadoop-samples/wordSplitter.py#wordsplitter.py -input s3://foobar/hadoop-samples/ input -output s3://foobar/wordcount/output/ -mapper wordSplitter.py -reducer 聚合 -verbose

根据关于流的 Apache Hadoop 文档。 “-files 选项在任务的当前工作目录中创建一个符号链接,指向文件的本地副本。”。多个文件可以用逗号分隔。

我在 AWS EMR (v6.3.0) 上运行流式作业并使用带有以下参数的 hadoop 流式处理

hadoop-streaming -files s3://archit47bucket/Step1RelativeFrequency/mapperwithlogs.py#mapper.py,s3://archit47bucket/Step1RelativeFrequency/reducerwithlogs.py#reducer.py -mapper mapper.py -reducer reducer.py -输入 s3://archit47bucket/Step1RelativeFrequency/Transaction.dat -输出 s3://archit47bucket/Step1RelativeFrequency/NOP1

由于某种原因,使用 -mapper 指定 s3 位置似乎不起作用,但上面的 -files 参数解决了我的问题。

https://hadoop.apache.org/docs/current/hadoop-streaming/HadoopStreaming.html https://docs.aws.amazon.com/emr/latest/ReleaseGuide/UseCase_Streaming.html