我们正在尝试使用气流的spark_submit_operator
中来自@CTiPKA的spark_submit_operaor示例容器化的示例python应用程序我们能够使用BashOperator在气流之外和dag内运行该应用程序。 spark_spark_submit运算符的问题来自调用PYSPARK_PYTHON
之前使用的spark-submit
到打包环境的python路径的“软”链接。
我们尝试了以下操作:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.contrib.operators.spark_submit_operator import SparkSubmitOperator
from airflow.models import Variable
from datetime import datetime, timedelta
import bdap
airflowConfig = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2019, 12, 4),
}
dag = DAG(
'nltk_app', default_args=airflowConfig, schedule_interval=timedelta(1))
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
print_path_env_task = BashOperator(
task_id='print_path_env',
bash_command='echo $PATH',
dag=dag)
spark_submit_task = SparkSubmitOperator(
#spark_binary='spark-submit',
#driver_class_path='PYSPARK_PYTHON=./NLTK/nltk_env/bin/python',
env_vars={
'PYSPARK_PYTHON':'./NLTK/nltk_env/bin/python',
},
task_id='spark_submit_job',
conn_id='spark_default',
application='/somepath/airflow/workflows/nltk_app/spark_nltk_sample.py',
total_executor_cores='1',
executor_cores='1',
executor_memory='2g',
num_executors='2',
name='nltk_app_in_dag',
verbose=True,
driver_memory='1g',
conf={
'spark.yarn.appMasterEnv.PYSPARK_PYTHON':'./NLTK/nltk_env/bin/python',
'spark.yarn.appMasterEnv.NLTK_DATA':'./'
},
archives='/somepath/airflow/workflows/nltk_app/nltk_env.zip#NLTK,/somepath/airflow/workflows/nltk_app/tokenizers.zip#tokenizers,/somepath/airflow/workflows/nltk_app/taggers.zip#taggers',
dag=dag,
)
t1.set_upstream(print_path_env_task)
spark_submit_task.set_upstream(t1)
以上内容导致错误,因为在外壳中导出PYTHON_PATH与通过PYSPARK_PYTHON=./NLTK/nltk_env/bin/python spark-submit
内联调用相对路径不同。显然,spark-submit在此内联调用中起到了一些魔术作用,我仍然无法破解。
我希望其他人已经尝试过此方法,并成功使用了spark_submit_operator和此类相对链接,并且可以给我们提示如何正确使用它们。当有单独的spark_submit运算符时,BashOperator似乎不太直观,有点“麻烦”。