气流 dag - 任务立即进入“up_for_retry”状态(“start_date”是 1 天前)

时间:2021-05-17 09:18:41

标签: celery airflow airflow-scheduler

我不知道我是否缺乏气流调度程序知识,或者这是否是气流的潜在错误。

情况是这样的:

  • 我的 dag 的开始日期设置为 "start_date": airflow.utils.dates.days_ago(1),
  • 我将 dag 上传到了气流扫描 DAG 的文件夹
  • 然后我打开了 dag(默认情况下是“关闭”的)
  • 管道中的任务会立即进入“up_for_retry”,您实际上看不到之前尝试过的内容。
  • 气流版本信息:Version : 1.10.14。它在 azure 的 kubenetes 上运行
  • 在 Redis 中使用 Celery 执行器
  • 下面列出了任务实例的详细信息:
Task Instance Details
Dependencies Blocking Task From Getting Scheduled
Dependency  Reason
Task Instance State Task is in the 'up_for_retry' state which is not a valid state for execution. The task must be cleared in order to be run.
Not In Retry Period Task is not ready for retry yet but will be retried automatically. Current date is 2021-05-17T09:06:57.239015+00:00 and task will be retried at 2021-05-17T09:09:50.662150+00:00.

我是否遗漏了一些东西来判断它是错误还是预期?

addition,下面是要求的 DAG 定义。

import airflow
from airflow import DAG
from airflow.contrib.operators.databricks_operator import DatabricksSubmitRunOperator
from airflow.models import Variable

dag_args = {
    "owner": "our_project_team_name",
    "retries": 1,
    "email": ["ouremail_address_replaced_by_this_string"],
    "email_on_failure": True,
    "email_on_retry": True,
    "depends_on_past": False,
    "start_date": airflow.utils.dates.days_ago(1),
}
# Implement cluster reuse on Databricks, pick from light, medium, heavy cluster type based on workloads
clusters = Variable.get("our_project_team_namejob_cluster_config", deserialize_json=True)

databricks_connection = "our_company_databricks"
adl_connection = "our_company_wasb"

pipeline_name = "process_our_data_from_boomi"

dag = DAG(dag_id=pipeline_name, default_args=dag_args, schedule_interval="0 3 * * *")

notebook_dir = "/Shared/our_data_name"
lib_path_sub = ""
lib_name_dev_plus_branch = ""
atlas_library = {
    "whl": f"dbfs:/python-wheels/atlas{lib_path_sub}/atlas_library-0{lib_name_dev_plus_branch}-py3-none-any.whl"
}

create_our_data_name_source_data_from_boomi_notebook_params = {
    "existing_cluster_id": clusters["our_cluster_name"],
    "notebook_task": {
        "notebook_path": f"{notebook_dir}/create_our_data_name_source_data_from_boomi",
        "base_parameters": {"Extraction_date": "{{ ds_nodash  }}"},
    },
}

create_our_data_name_standardized_table_from_source_xml_notebook_params = {
    "existing_cluster_id": clusters["our_cluster_name"],
    "notebook_task": {
        "notebook_path": f"{notebook_dir}/create_our_data_name_standardized_table_from_source_xml",
        "base_parameters": {"Extraction_date": "{{ ds_nodash  }}"},
    },
}

create_our_data_name_enriched_table_from_standardized_notebook_params = {
    "existing_cluster_id": clusters["our_cluster_name"],
    "notebook_task": {
        "notebook_path": f"{notebook_dir}/create_our_data_name_enriched",
        "base_parameters": {"Extraction_date": "{{ ds_nodash  }}"},
    },
}

layer_1_task = DatabricksSubmitRunOperator(
    task_id="Load_our_data_name_to_source",
    databricks_conn_id=databricks_connection,
    dag=dag,
    json=create_our_data_name_source_data_from_boomi_notebook_params,
    libraries=[atlas_library],
)

layer_2_task = DatabricksSubmitRunOperator(
    task_id="Load_our_data_name_to_standardized",
    databricks_conn_id=databricks_connection,
    dag=dag,
    json=create_our_data_name_standardized_table_from_source_xml_notebook_params,
    libraries=[
        {"maven": {"coordinates": "com.databricks:spark-xml_2.11:0.5.0"}},
        {"pypi": {"package": "inflection"}},
        atlas_library,
    ],
)

layer_3_task = DatabricksSubmitRunOperator(
    task_id="Load_our_data_name_to_enriched",
    databricks_conn_id=databricks_connection,
    dag=dag,
    json=create_our_data_name_enriched_table_from_standardized_notebook_params,
    libraries=[atlas_library],
)

layer_1_task >> layer_2_task >> layer_3_task

1 个答案:

答案 0 :(得分:0)

从@AnandVidvat 获得有关尝试进行 retry=0 实验的一些帮助以及将操作符更改为 DummyOperator 或 PythonOperator 的一些帮助后,我可以确认问题与 DatabricksOperator 或气流版本 1.10 无关.x。即它不是气流错误。

总而言之,当 DAG 具有有意义的运算符时,我的设置在没有任何任务日志的第一次执行中失败,并且在 重试 期间工作正常(任务日志隐藏了它已重试的事实,因为失败没有日志)。

为了减少总运行时间。在找到真正原因之前,解决方法/修补程序是将 retry_delay 设置为 10 秒(默认为 5 分钟,这会使 DAG 运行时间过长而不必要。)

下一步是通过检查我们当前设置(azure K8s、postgresql、Redis、celery executor)中调度程序或唤醒器 pod 上的日志,找出导致第 1 次失败的原因。

附言我使用下面的 DAG 测试并得出结论。

import airflow
from airflow import DAG

from airflow.operators.python_operator import PythonOperator
import time
from pprint import pprint

dag_args = {
    "owner": "min_test",
    "retries": 1,
    "email": ["c243d70b.domain.onmicrosoft.com@emea.teams.ms"],
    "email_on_failure": True,
    "email_on_retry": True,
    "depends_on_past": False,
    "start_date": airflow.utils.dates.days_ago(1),
}

pipeline_name = "min_test_debug_airflow_baseline_PythonOperator_1_retry"

dag = DAG(
    dag_id=pipeline_name,
    default_args=dag_args,
    schedule_interval="0 3 * * *",
    tags=["min_test_airflow"],
)


def my_sleeping_function(random_base):
    """This is a function that will run within the DAG execution"""
    time.sleep(random_base)


def print_context(ds, **kwargs):
    pprint(kwargs)
    print(ds)
    return "Whatever you return gets printed in the logs"


run_this = PythonOperator(
    task_id="print_the_context",
    provide_context=True,
    python_callable=print_context,
    dag=dag,
)

# Generate 3 sleeping tasks, sleeping from 0 to 2 seconds respectively
for i in range(3):
    task = PythonOperator(
        task_id="sleep_for_" + str(i),
        python_callable=my_sleeping_function,
        op_kwargs={"random_base": float(i) / 10},
        dag=dag,
    )

    task.set_upstream(run_this)