我已通过SSH连接到Amazon EMR服务器,并且我想从终端以Python编写提交Spark作业(简单的字数文件和sample.txt都在Amazon EMR服务器上)。我该怎么办?语法是什么?
word_count.py如下:
from pyspark import SparkConf, SparkContext
from operator import add
import sys
## Constants
APP_NAME = " HelloWorld of Big Data"
##OTHER FUNCTIONS/CLASSES
def main(sc,filename):
textRDD = sc.textFile(filename)
words = textRDD.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1))
wordcount = words.reduceByKey(add).collect()
for wc in wordcount:
print (wc[0],wc[1])
if __name__ == "__main__":
# Configure Spark
conf = SparkConf().setAppName(APP_NAME)
conf = conf.setMaster("local[*]")
sc = SparkContext(conf=conf)
sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId","XXXX")
sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey","YYYY")
filename = "s3a://bucket_name/sample.txt"
# filename = sys.argv[1]
# Execute Main functionality
main(sc, filename)
答案 0 :(得分:0)
您可以运行以下命令:
spark-submit s3://your_bucket/your_program.py
如果您需要使用python3运行脚本,则可以在提交火花之前运行以下命令:
export PYSPARK_PYTHON=python3.6
请记住在 spark-submit 之前将程序保存在存储桶中。