气流sparkSubmitOperator和容器化python应用

时间:2020-05-07 06:35:58

标签: python apache-spark airflow spark-submit airflow-operator

我们正在尝试使用气流的spark_submit_operator

要运行以下Python guide和以下thread

中来自@CTiPKA的spark_submit_operaor示例容器化的示例python应用程序

我们能够使用BashOperator在气流之外和dag内运行该应用程序。 spark_spark_submit运算符的问题来自调用PYSPARK_PYTHON之前使用的spark-submit到打包环境的python路径的“软”链接。

我们尝试了以下操作:

from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.contrib.operators.spark_submit_operator import SparkSubmitOperator
from airflow.models import Variable
from datetime import datetime, timedelta
import bdap

airflowConfig = {
        'owner': 'airflow',
        'depends_on_past': False,
        'start_date': datetime(2019, 12, 4),
    }

dag = DAG(
    'nltk_app', default_args=airflowConfig, schedule_interval=timedelta(1))

t1 = BashOperator(
    task_id='print_date',
    bash_command='date',
    dag=dag)

print_path_env_task = BashOperator(
    task_id='print_path_env',
    bash_command='echo $PATH',
    dag=dag)

spark_submit_task = SparkSubmitOperator(
    #spark_binary='spark-submit',
    #driver_class_path='PYSPARK_PYTHON=./NLTK/nltk_env/bin/python',
    env_vars={
            'PYSPARK_PYTHON':'./NLTK/nltk_env/bin/python',
    },
    task_id='spark_submit_job',
    conn_id='spark_default',
    application='/somepath/airflow/workflows/nltk_app/spark_nltk_sample.py',
    total_executor_cores='1',
    executor_cores='1',
    executor_memory='2g',
    num_executors='2',
    name='nltk_app_in_dag',
    verbose=True,
    driver_memory='1g',
    conf={
        'spark.yarn.appMasterEnv.PYSPARK_PYTHON':'./NLTK/nltk_env/bin/python',
        'spark.yarn.appMasterEnv.NLTK_DATA':'./'
    },
    archives='/somepath/airflow/workflows/nltk_app/nltk_env.zip#NLTK,/somepath/airflow/workflows/nltk_app/tokenizers.zip#tokenizers,/somepath/airflow/workflows/nltk_app/taggers.zip#taggers',
    dag=dag,
)

t1.set_upstream(print_path_env_task)
spark_submit_task.set_upstream(t1)

以上内容导致错误,因为在外壳中导出PYTHON_PATH与通过PYSPARK_PYTHON=./NLTK/nltk_env/bin/python spark-submit内联调用相对路径不同。显然,spark-submit在此内联调用中起到了一些魔术作用,我仍然无法破解。

我希望其他人已经尝试过此方法,并成功使用了spark_submit_operator和此类相对链接,并且可以给我们提示如何正确使用它们。当有单独的spark_submit运算符时,BashOperator似乎不太直观,有点“麻烦”。

0 个答案:

没有答案