我是Spark的新手,无法在EMR文档中复制example,无法通过AWS CLI通过spark-submit
提交基本用户应用程序。它似乎运行没有错误但不产生输出。我在下面的工作流程中使用添加步骤的语法有问题吗?
目标是计算S3中文档中的单词,本例中为lorem-ipsum的1000个单词:
$ aws s3 cp s3://projects/wordcount/input/some_document.txt - | head -n1
Lorem ipsum dolor sit amet, consectetur adipiscing [... etc.]
从文档中复制,python脚本如下:
$ aws s3 cp s3://projects/wordcount/wordcount.py -
from __future__ import print_function
from pyspark import SparkContext
import sys
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: wordcount ", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="WordCount")
text_file = sc.textFile(sys.argv[1])
counts = text_file.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile(sys.argv[2])
sc.stop()
目标文件夹(当前为空):
$ aws s3 ls s3://projects/wordcount/output
PRE output/
在doc工作,我有一个正在运行的集群,其中包含日志:
aws emr create-cluster --name TestSparkCluster \
--release-label emr-5.11.0 --applications Name=Spark \
--enable-debugging --log-uri s3://projects/wordcount/log \
--instance-type m3.xlarge --instance-count 3 --use-default-roles
返回消息显示已创建{"ClusterID": "j-XXXXXXXXXXXXX"}
直接查看example,我将add-steps
提交为:
aws emr add-steps --cluster-id j-XXXXXXXXXXXXX \
--steps Type=spark,Name=SparkWordCountApp,\
Args=[--deploy-mode,cluster,--master,yarn,\
--conf,spark.yarn.submit.waitAppCompletion=false,\
--num-executors,2,--executor-cores,2,--executor-memory,1g,\
s3://projects/wordcount/wordcount.py,\
s3://projects/wordcount/input/some_document.txt,\
s3://projects/wordcount/output/],\
ActionOnFailure=CONTINUE
启动{ "StepIds":["s-YYYYYYYYYYY"] }
输出文件夹为空 - 为什么?
我在EMR控制台中验证标识为SparkWordCountApp
的{{1}}步s-YYYYYYYYYYY
。
在控制台中,我检查控制器日志文件和stderr输出(如下所示),以验证步骤是否以退出状态0完成。
在Spark documentation中,使用了稍微不同的语法。它不是将脚本命名为参数列表的第一个位置,而是说:
对于Python应用程序,只需传递一个.py文件即可 而不是JAR,并添加Python .zip,.egg或.py 使用--py-files将文件写入搜索路径。
但是,该示例使用Status:Completed
,其中sys.argv [0]为sys.argv
控制器日志文件:
wordcount.py
Stderr日志文件:
2018-01-24T15:54:05.945Z INFO Ensure step 3 jar file command-runner.jar
2018-01-24T15:54:05.945Z INFO StepRunner: Created Runner for step 3
INFO startExec 'hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --deploy-mode cluster --master yarn --conf spark.yarn.submit.waitAppCompletion=false --num-executors 2 --executor-cores 2 --executor-memory 1g s3://projects/wordcount/wordcount.py s3://projects/wordcount/input/some_document.txt s3://projects/wordcount/output/'
INFO Environment:
PATH=/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/sbin:/opt/aws/bin
LESS_TERMCAP_md=[01;38;5;208m
LESS_TERMCAP_me=[0m
HISTCONTROL=ignoredups
LESS_TERMCAP_mb=[01;31m
AWS_AUTO_SCALING_HOME=/opt/aws/apitools/as
UPSTART_JOB=rc
LESS_TERMCAP_se=[0m
HISTSIZE=1000
HADOOP_ROOT_LOGGER=INFO,DRFA
JAVA_HOME=/etc/alternatives/jre
AWS_DEFAULT_REGION=us-west-2
AWS_ELB_HOME=/opt/aws/apitools/elb
LESS_TERMCAP_us=[04;38;5;111m
EC2_HOME=/opt/aws/apitools/ec2
TERM=linux
XFILESEARCHPATH=/usr/dt/app-defaults/%L/Dt
runlevel=3
LANG=en_US.UTF-8
AWS_CLOUDWATCH_HOME=/opt/aws/apitools/mon
MAIL=/var/spool/mail/hadoop
LESS_TERMCAP_ue=[0m
LOGNAME=hadoop
PWD=/
LANGSH_SOURCED=1
HADOOP_CLIENT_OPTS=-Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/s-29XVS3IGMSK1/tmp
_=/etc/alternatives/jre/bin/java
CONSOLETYPE=serial
RUNLEVEL=3
LESSOPEN=||/usr/bin/lesspipe.sh %s
previous=N
UPSTART_EVENTS=runlevel
AWS_PATH=/opt/aws
USER=hadoop
UPSTART_INSTANCE=
PREVLEVEL=N
HADOOP_LOGFILE=syslog
PYTHON_INSTALL_LAYOUT=amzn
HOSTNAME=ip-172-31-12-232
NLSPATH=/usr/dt/lib/nls/msg/%L/%N.cat
HADOOP_LOG_DIR=/mnt/var/log/hadoop/steps/s-29XVS3IGMSK1
EC2_AMITOOL_HOME=/opt/aws/amitools/ec2
SHLVL=5
HOME=/home/hadoop
HADOOP_IDENT_STRING=hadoop
INFO redirectOutput to /mnt/var/log/hadoop/steps/s-29XVS3IGMSK1/stdout
INFO redirectError to /mnt/var/log/hadoop/steps/s-29XVS3IGMSK1/stderr
INFO Working dir /mnt/var/lib/hadoop/steps/s-29XVS3IGMSK1
INFO ProcessRunner started child process 20797 :
hadoop 20797 3347 0 15:54 ? 00:00:00 bash /usr/lib/hadoop/bin/hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --deploy-mode cluster --master yarn --conf spark.yarn.submit.waitAppCompletion=false --num-executors 2 --executor-cores 2 --executor-memory 1g s3://projects/wordcount/wordcount.py s3://projects/wordcount/input/some_document.txt s3://projects/wordcount/output/
2018-01-24T15:54:09.956Z INFO HadoopJarStepRunner.Runner: startRun() called for s-29XVS3IGMSK1 Child Pid: 20797
INFO Synchronously wait child process to complete : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO waitProcessCompletion ended with exit code 0 : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO total process run time: 16 seconds
2018-01-24T15:54:24.072Z INFO Step created jobs:
2018-01-24T15:54:24.072Z INFO Step succeeded with exitCode 0 and took 16 seconds
答案 0 :(得分:2)
事实证明问题是由目标文件夹已经存在(即使是空的)引起的。删除输出文件夹使示例有效。
我通过阅读S3中的步骤日志而不是EMR控制台中的实例日志来解决这个问题 - 在那些日志中,我看到org.apache.hadoop.mapred.FileAlreadyExistsException
这让我陷入了困境。
预先存在的S3文件夹对于我已经完成的其他写作任务(例如PigStorage
)来说不是问题,因此我并不期待这样做。
我会在其他人遇到此事件的(不太可能的)事件中提出这个问题。
答案 1 :(得分:0)
以下添加步骤对我有用,是从主节点运行的:
aws emr add-steps --cluster-id yourclusterid --steps Type = spark,Name = SparkWordCountApp,Args = [-deploy-mode,cluster,-master,yarn, --conf,spark.yarn.submit.waitAppCompletion = false,-num-executors,2,-executor-cores,2,-executor-memory,1g,s3://yourbucketname/yourcode.py,s3: //yourbucketname/yourinputfile.txt,s3://yourbucketname/youroutputfile.out], ActionOnFailure =继续