我需要使用Apache Spark Hidden REST API提交py文件 当我按照arturmkrtchyan教程进行操作时,我找不到任何关于如何提交py文件的示例或文档。
有没有人有任何想法? 是否可以替换py文件而不是jar:
curl -X POST http://spark-cluster-ip:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data '{
"action" : "CreateSubmissionRequest",
"appArgs" : [ "myAppArgument1" ],
"appResource" : "file:/path/to/py/file/file.py",
"clientSparkVersion" : "1.5.0",
"environmentVariables" : {
"SPARK_ENV_LOADED" : "1"
},
"mainClass" : "com.mycompany.MyJob",
"sparkProperties" : {
"spark.submit.pyFiles": "/path/to/py/file/file.py",
"spark.driver.supervise" : "false",
"spark.app.name" : "MyJob",
"spark.eventLog.enabled": "true",
"spark.submit.deployMode" : "cluster",
"spark.master" : "spark://spark-cluster-ip:6066"
}
}'
或者还有其他办法吗?
答案 0 :(得分:3)
该方法实际上与您共享的链接中描述的方法类似。
以下是一个例子:
让我们首先定义我们需要运行的python脚本。我采用了火花pi的例子,即spark_pi.py
:
from __future__ import print_function
import sys
from random import random
from operator import add
from pyspark.sql import SparkSession
if __name__ == "__main__":
"""
Usage: pi [partitions]
"""
spark = SparkSession\
.builder\
.appName("PythonPi")\
.getOrCreate()
partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
n = 100000 * partitions
def f(_):
x = random() * 2 - 1
y = random() * 2 - 1
return 1 if x ** 2 + y ** 2 <= 1 else 0
count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count / n))
spark.stop()
在运行作业之前,您需要确保/tmp/spark-events
已经存在。
现在您可以提交以下内容:
curl -X POST http://[spark-cluster-ip]:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data '{
"action":"CreateSubmissionRequest",
"appArgs":[
"/home/eliasah/Desktop/spark_pi.py"
],
"appResource":"file:/home/eliasah/Desktop/spark_pi.py",
"clientSparkVersion":"2.2.1",
"environmentVariables":{
"SPARK_ENV_LOADED":"1"
},
"mainClass":"org.apache.spark.deploy.SparkSubmit",
"sparkProperties":{
"spark.driver.supervise":"false",
"spark.app.name":"Simple App",
"spark.eventLog.enabled":"true",
"spark.submit.deployMode":"cluster",
"spark.master":"spark://[spark-master]:6066"
}
}'
正如您所注意到的,我们已将脚本的文件路径提供为应用程序资源以及应用程序参数。
PS:将[spark-cluster-ip]和[spark-master]替换为与您的spark群集对应的正确值。
这将产生以下结果:
{
"action" : "CreateSubmissionResponse",
"message" : "Driver successfully submitted as driver-20180522165321-0001",
"serverSparkVersion" : "2.2.1",
"submissionId" : "driver-20180522165321-0001",
"success" : true
}
您还可以查看 Spark UI 来监控您的工作。
要在输入脚本中使用参数,可以将它们添加到appArgs属性:
"appArgs": [ "/home/eliasah/Desktop/spark_pi.py", "arg1" ]