如何在Airflow SPARK提交操作符中设置SPARK_MAJOR_VERSION和HADOOP_USER_NAME?

时间:2019-03-05 18:58:35

标签: apache-spark airflow

我正在尝试使用SPARK_SUBMIT_OPERATOR调用SPARK_SUBMIT,在执行SPARK_SUBMIT之前,我必须设置SPARK_MAJOR_VERSION和HADOOP_USER_NAME。有人可以帮我吗?

我试图以YARN模式运行,我已通过env_vars。仍未设置SPARK_MAJOR_VERSION。信息-[2019-03-11 21:07:03,525] {base_hook.py:83}信息-使用连接到:id:spark_default。主机:yarn:// XXXX,端口:8088,架构:无,登录名:peddnade,密码:XXXXXXXX,额外:{u'queue':u'priority',u'namespace':u'default',u'spark -home':u'/ usr /'} [2019-03-11 21:07:03,526] {logging_mixin.py:95}信息-[2019-03-11 21:07:03,526] {spark_submit_hook.py:283}信息-Spark-Submit cmd:[u' / usr / bin / spark-submit','--master','yarn:/ XX:8088','--conf','spark.dynamicAllocation.enabled = true','--conf','spark。 hadoop.mapreduce.fileoutputcommitter.algorithm.version = 1','--conf','spark.app.name = RDM','--conf','spark.yarn.queue = priority','--conf' ,'spark.shuffle.service.enabled = true','--conf','spark.yarn.appMasterEnv.SPARK_MAJOR_VERSION = 2','--conf','spark.yarn.appMasterEnv.HADOOP_USER_NAME = ppeddnade',' --files','/ usr / hdp / current / spark-client / conf / hive-site.xml','-jars','/ usr / hdp / current / spark-client / lib / datanucleus-api- jdo-3.2.6.jar,/ usr / hdp / current / spark-client / lib / datanucleus-rdbms-3.2.9.jar,/ usr / hdp / current / spark-client / lib / datanucleus-core-3.2。 10.jar”,“-num-executors”,“ 4”,“-total-executor-cores”,“ 4”,“-executor-cores”,“ 4”,“-executor-memory” ,'5g','-驱动程序内存','10g','-name',u'airflow-spark-example','-class', 'com.hilton.eim.job.SubmitSparkJob','-queue',u'priority','/ home / ppeddnade / XX.jar',u'XX'] [2019-03-11 21:07:03,542] {logging_mixin.py:95}信息-[2019-03-11 21:07:03,542] {spark_submit_hook.py:415}信息-已安装多个版本的Spark,但SPARK_MAJOR_VERSION没有设置 [2019-03-11 21:07:03,542] {logging_mixin.py:95}信息-[2019-03-11 21:07:03,542] {spark_submit_hook.py:415}信息-默认情况下会选择Spark1

1 个答案:

答案 0 :(得分:0)

SparkSubmitOperator提供了env_vars参数,用于设置您的环境变量(也可用in SparkSubmitHook

  

:param env_vars:提交火花的环境变量。它                    也支持yarn和k8s模式。 (已模板化)


您可以尝试推断其用法from test_spark_submit_hook.py

hook = SparkSubmitHook(conn_id='spark_standalone_cluster_client_mode',
                       env_vars={"bar": "foo"})

即使您没有要求,也可能希望在远程群集上执行 spark-submit ,因为它可以查看available options