从终端将pyspark作业提交到Amazon EMR集群

时间:2020-06-17 08:14:53

标签: pyspark amazon-emr spark-submit

我已通过SSH连接到Amazon EMR服务器,并且我想从终端以Python编写提交Spark作业(简单的字数文件和sample.txt都在Amazon EMR服务器上)。我该怎么办?语法是什么?

word_count.py如下:

from pyspark import SparkConf, SparkContext

from operator import add
import sys
## Constants
APP_NAME = " HelloWorld of Big Data"
##OTHER FUNCTIONS/CLASSES

def main(sc,filename):
   textRDD = sc.textFile(filename)
   words = textRDD.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1))
   wordcount = words.reduceByKey(add).collect()
   for wc in wordcount:
      print (wc[0],wc[1])

if __name__ == "__main__":

   # Configure Spark
   conf = SparkConf().setAppName(APP_NAME)
   conf = conf.setMaster("local[*]")
   sc   = SparkContext(conf=conf)
   sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId","XXXX")
   sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey","YYYY")
   filename = "s3a://bucket_name/sample.txt"
   # filename = sys.argv[1]
   # Execute Main functionality
   main(sc, filename)

1 个答案:

答案 0 :(得分:0)

您可以运行以下命令:

spark-submit s3://your_bucket/your_program.py

如果您需要使用python3运行脚本,则可以在提交火花之前运行以下命令:

export PYSPARK_PYTHON=python3.6

请记住在 spark-submit 之前将程序保存在存储桶中。