我正在尝试运行以下命令
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -files s3://foobar/hadoop-samples/wordSplitter.py -input s3://foobar/hadoop-samples/input -output s3://foobar/wordcount/output/ -mapper wordSplitter.py -reducer aggregate -verbose
但是Hadoop无法找到wordSplitter.py
文件。我收到以下错误:
Caused by: java.io.IOException: Cannot run program "/mnt/yarn/usercache/hadoop/appcache/application_1116934618409_1114/container_1116934618409_1114_01_000018/./wordSplitter.py": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:208)
... 23 more
我尝试修改activity命令以添加S3中wordSplitter.py
文件的完整路径
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -files s3://foobar/hadoop-samples/wordSplitter.py -input s3://foobar/hadoop-samples/input -output s3://foobar/wordcount/output/ -mapper s3://foobar/hadoop-samples/wordSplitter.py -reducer aggregate -verbose
但我得到了同样的错误:
Caused by: java.io.IOException: Cannot run program "s3://foobar/hadoop-samples/wordSplitter.py": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:208)
... 23 more
当我使用AWS提供的wordSplitter.py
公共位置时,此活动命令有效,如下所示
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -files s3://elasticmapreduce/samples/wordcount/wordSplitter.py -input s3://elasticmapreduce/samples/wordcount/input -output s3://foobar/wordcount/output/ -mapper wordSplitter.py -reducer aggregate -verbose
但我无法继续使用公共AWS位置s3://elasticmapreduce/samples/wordcount/wordSplitter.py
我现在需要从我自己的个人存储桶中获取该文件,并且由于某种原因它无法从我的个人存储桶中获取该文件。
更新
我找到了this AWS Developer Forum,事实上我确实在我的Windows机器上保存了python脚本,然后将其上传到S3但是我删除了文件并将其从源位置直接复制到我的S3存储桶以解决上述问题在论坛中使用以下命令aws s3 cp s3://elasticmapreduce/samples/wordcount/wordSplitter.py s3://foobar/hadoop-samples/wordSplitter.py
,但这并没有解决我看到的问题。我仍然得到完全相同的错误。 (注意,我在我的EMR集群中运行aws s3 cp
命令(我在框中输入并运行命令)。)
更新2:
EMR群集可以使用该命令找到wordSplitter.py
文件
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -files s3://foobar/hadoop-samples/wordSplitter.py -input s3://foobar/hadoop-samples/input -output s3://foobar/wordcount/output/ -mapper wordSplitter.py -reducer aggregate -verbose
当我从EMR群集中删除SecurityConfiguration设置时。我需要保留SecurityConfiguration设置,所以我不确定我需要对S3存储桶做什么才能让EMR集群访问该文件以用作映射器。
答案 0 :(得分:0)
对于仍在处理此问题的任何人来说,这已为我解决了:
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -files s3://foobar/hadoop-samples/wordSplitter.py#wordsplitter.py -input s3://foobar/hadoop-samples/ input -output s3://foobar/wordcount/output/ -mapper wordSplitter.py -reducer 聚合 -verbose
根据关于流的 Apache Hadoop 文档。 “-files 选项在任务的当前工作目录中创建一个符号链接,指向文件的本地副本。”。多个文件可以用逗号分隔。
我在 AWS EMR (v6.3.0) 上运行流式作业并使用带有以下参数的 hadoop 流式处理
hadoop-streaming -files s3://archit47bucket/Step1RelativeFrequency/mapperwithlogs.py#mapper.py,s3://archit47bucket/Step1RelativeFrequency/reducerwithlogs.py#reducer.py -mapper mapper.py -reducer reducer.py -输入 s3://archit47bucket/Step1RelativeFrequency/Transaction.dat -输出 s3://archit47bucket/Step1RelativeFrequency/NOP1
由于某种原因,使用 -mapper 指定 s3 位置似乎不起作用,但上面的 -files 参数解决了我的问题。
https://hadoop.apache.org/docs/current/hadoop-streaming/HadoopStreaming.html https://docs.aws.amazon.com/emr/latest/ReleaseGuide/UseCase_Streaming.html