
时间:2019-11-05 11:34:44

标签: python airflow airflow-operator

如何从yarn_application_id中检索SparkSubmitHook? 我尝试使用自定义运算符和task_instance属性,但是我想我错过了一些东西……

def task_failure_callback(context):
    task_instance = context.get('task_instance')  # Need to access yarn_application_id here
    operator = task_instance.operator
    application_id = operator.yarn_application_id
    return ...

default_args = {
    'start_date': ...,
    'on_failure_callback': task_failure_callback

with DAG(DAG_ID, default_args=default_args, catchup=CATCHUP, schedule_interval=SCHEDULE_INTERVAL) as dag:


class CustomSparkSubmitHook(SparkSubmitHook, LoggingMixin):
    def __init__(self, ...):

    def submit_with_context(self, context, application="", **kwargs):
        # Build spark submit cmd
        # Run cmd as subprocess
        # Process spark submit log
        # Check spark-submit return code. In Kubernetes mode, also check the value
        # of exit code in the log, as it may differ.

        # We want the Airflow job to wait until the Spark driver is finished
        if self._should_track_driver_status:
            if self._driver_id is None:
                raise AirflowException(
                    "No driver id is known: something went wrong when executing " +
                    "the spark submit command"

            # We start with the SUBMITTED status as initial status
            self._driver_status = "SUBMITTED"

            # Trying to export yarn_application_id unsuccessfully
            context['yarn_application_id'] = self.yarn_application_id

            # Start tracking the driver status (blocking function)

    def yarn_application_id(self):
        return self._yarn_application_id

0 个答案:
