在AWS EMR中使用spark-submit启动Python应用程序

时间:2018-01-24 16:19:53

标签: python apache-spark pyspark amazon-emr spark-submit

我是Spark的新手,无法在EMR文档中复制example,无法通过AWS CLI通过spark-submit提交基本用户应用程序。它似乎运行没有错误但不产生输出。我在下面的工作流程中使用添加步骤的语法有问题吗?

示例脚本

目标是计算S3中文档中的单词,本例中为lorem-ipsum的1000个单词:

$ aws s3 cp s3://projects/wordcount/input/some_document.txt - | head -n1
Lorem ipsum dolor sit amet, consectetur adipiscing [... etc.]

从文档中复制,python脚本如下:

$ aws s3 cp s3://projects/wordcount/wordcount.py -
from __future__ import print_function
from pyspark import SparkContext
import sys
if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: wordcount  ", file=sys.stderr)
        exit(-1)
    sc = SparkContext(appName="WordCount")
    text_file = sc.textFile(sys.argv[1])
    counts = text_file.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
    counts.saveAsTextFile(sys.argv[2])
    sc.stop()

目标文件夹(当前为空):

$ aws s3 ls s3://projects/wordcount/output
                           PRE output/

创建群集

doc工作,我有一个正在运行的集群,其中包含日志:

aws emr create-cluster --name TestSparkCluster \
--release-label emr-5.11.0 --applications Name=Spark \
--enable-debugging --log-uri s3://projects/wordcount/log \
--instance-type m3.xlarge --instance-count 3 --use-default-roles

返回消息显示已创建{"ClusterID": "j-XXXXXXXXXXXXX"}

附加步骤

直接查看example,我将add-steps提交为:

aws emr add-steps --cluster-id j-XXXXXXXXXXXXX \
--steps Type=spark,Name=SparkWordCountApp,\
Args=[--deploy-mode,cluster,--master,yarn,\
--conf,spark.yarn.submit.waitAppCompletion=false,\
--num-executors,2,--executor-cores,2,--executor-memory,1g,\
s3://projects/wordcount/wordcount.py,\
s3://projects/wordcount/input/some_document.txt,\
s3://projects/wordcount/output/],\
ActionOnFailure=CONTINUE

启动{ "StepIds":["s-YYYYYYYYYYY"] }

问题

输出文件夹为空 - 为什么?

我在EMR控制台中验证标识为SparkWordCountApp的{​​{1}}步s-YYYYYYYYYYY

在控制台中,我检查控制器日志文件和stderr输出(如下所示),以验证步骤是否以退出状态0完成。

Spark documentation中,使用了稍微不同的语法。它不是将脚本命名为参数列表的第一个位置,而是说:

  

对于Python应用程序,只需传递一个.py文件即可    而不是JAR,并添加Python .zip,.egg或.py   使用--py-files将文件写入搜索路径。

但是,该示例使用Status:Completed,其中sys.argv [0]为sys.argv

Additonal info:logs

控制器日志文件:

wordcount.py

Stderr日志文件:

2018-01-24T15:54:05.945Z INFO Ensure step 3 jar file command-runner.jar
2018-01-24T15:54:05.945Z INFO StepRunner: Created Runner for step 3
INFO startExec 'hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --deploy-mode cluster --master yarn --conf spark.yarn.submit.waitAppCompletion=false --num-executors 2 --executor-cores 2 --executor-memory 1g s3://projects/wordcount/wordcount.py s3://projects/wordcount/input/some_document.txt s3://projects/wordcount/output/'
INFO Environment:
  PATH=/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/sbin:/opt/aws/bin
  LESS_TERMCAP_md=[01;38;5;208m
  LESS_TERMCAP_me=[0m
  HISTCONTROL=ignoredups
  LESS_TERMCAP_mb=[01;31m
  AWS_AUTO_SCALING_HOME=/opt/aws/apitools/as
  UPSTART_JOB=rc
  LESS_TERMCAP_se=[0m
  HISTSIZE=1000
  HADOOP_ROOT_LOGGER=INFO,DRFA
  JAVA_HOME=/etc/alternatives/jre
  AWS_DEFAULT_REGION=us-west-2
  AWS_ELB_HOME=/opt/aws/apitools/elb
  LESS_TERMCAP_us=[04;38;5;111m
  EC2_HOME=/opt/aws/apitools/ec2
  TERM=linux
  XFILESEARCHPATH=/usr/dt/app-defaults/%L/Dt
  runlevel=3
  LANG=en_US.UTF-8
  AWS_CLOUDWATCH_HOME=/opt/aws/apitools/mon
  MAIL=/var/spool/mail/hadoop
  LESS_TERMCAP_ue=[0m
  LOGNAME=hadoop
  PWD=/
  LANGSH_SOURCED=1
  HADOOP_CLIENT_OPTS=-Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/s-29XVS3IGMSK1/tmp
  _=/etc/alternatives/jre/bin/java
  CONSOLETYPE=serial
  RUNLEVEL=3
  LESSOPEN=||/usr/bin/lesspipe.sh %s
  previous=N
  UPSTART_EVENTS=runlevel
  AWS_PATH=/opt/aws
  USER=hadoop
  UPSTART_INSTANCE=
  PREVLEVEL=N
  HADOOP_LOGFILE=syslog
  PYTHON_INSTALL_LAYOUT=amzn
  HOSTNAME=ip-172-31-12-232
  NLSPATH=/usr/dt/lib/nls/msg/%L/%N.cat
  HADOOP_LOG_DIR=/mnt/var/log/hadoop/steps/s-29XVS3IGMSK1
  EC2_AMITOOL_HOME=/opt/aws/amitools/ec2
  SHLVL=5
  HOME=/home/hadoop
  HADOOP_IDENT_STRING=hadoop
INFO redirectOutput to /mnt/var/log/hadoop/steps/s-29XVS3IGMSK1/stdout
INFO redirectError to /mnt/var/log/hadoop/steps/s-29XVS3IGMSK1/stderr
INFO Working dir /mnt/var/lib/hadoop/steps/s-29XVS3IGMSK1
INFO ProcessRunner started child process 20797 :
hadoop   20797  3347  0 15:54 ?        00:00:00 bash /usr/lib/hadoop/bin/hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --deploy-mode cluster --master yarn --conf spark.yarn.submit.waitAppCompletion=false --num-executors 2 --executor-cores 2 --executor-memory 1g s3://projects/wordcount/wordcount.py s3://projects/wordcount/input/some_document.txt s3://projects/wordcount/output/
2018-01-24T15:54:09.956Z INFO HadoopJarStepRunner.Runner: startRun() called for s-29XVS3IGMSK1 Child Pid: 20797
INFO Synchronously wait child process to complete : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO waitProcessCompletion ended with exit code 0 : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO total process run time: 16 seconds
2018-01-24T15:54:24.072Z INFO Step created jobs: 
2018-01-24T15:54:24.072Z INFO Step succeeded with exitCode 0 and took 16 seconds

2 个答案:

答案 0 :(得分:2)

事实证明问题是由目标文件夹已经存在(即使是空的)引起的。删除输出文件夹使示例有效。

我通过阅读S3中的步骤日志而不是EMR控制台中的实例日志来解决这个问题 - 在那些日志中,我看到org.apache.hadoop.mapred.FileAlreadyExistsException这让我陷入了困境。

预先存在的S3文件夹对于我已经完成的其他写作任务(例如PigStorage)来说不是问题,因此我并不期待这样做。

我会在其他人遇到此事件的(不太可能的)事件中提出这个问题。

答案 1 :(得分:0)

以下添加步骤对我有用,是从主节点运行的:

aws emr add-steps --cluster-id yourclusterid --steps Type = spark,Name = SparkWordCountApp,Args = [-deploy-mode,cluster,-master,yarn, --conf,spark.yarn.submit.waitAppCompletion = false,-num-executors,2,-executor-cores,2,-executor-memory,1g,s3://yourbucketname/yourcode.py,s3: //yourbucketname/yourinputfile.txt,s3://yourbucketname/youroutputfile.out], ActionOnFailure =继续