在DAG中动态创建任务时的Apache Airflow超时错误

时间:2019-07-02 18:53:57

标签: python airflow directed-acyclic-graphs

在我的旧DAG中,我创建了如下任务:

.jpg

每个任务运行一个不同的查询,并且每个任务同时运行。我修改了我的代码,因为其中很多是多余的,可以放在函数内部。在我的新代码中,我还尝试通过从.json读取每个任务的查询和元数据来动态创建任务。

新代码:

text=auto

这个新代码看起来不错,但是它阻止了我运行 start_task = DummyOperator(task_id = "start_task") t1 = PythonOperator(task_id = "t1", python_callable = get_t1) t2 = PythonOperator(task_id = "t2", python_callable = get_t2) t3 = PythonOperator(task_id = "t3", python_callable = get_t3) t4 = PythonOperator(task_id = "t4", python_callable = get_t4) t5 = PythonOperator(task_id = "t5", python_callable = get_t5) t6 = PythonOperator(task_id = "t6", python_callable = get_t6) t7 = PythonOperator(task_id = "t7", python_callable = get_t7) t8 = PythonOperator(task_id = "t8", python_callable = get_t8) t9 = PythonOperator(task_id = "t9", python_callable = get_t9) t10 = PythonOperator(task_id = "t10", python_callable = get_t10) t11 = PythonOperator(task_id = "t11", python_callable = get_t11) end_task = DummyOperator(task_id = "end_task") start_task >> [t1, t2, t3, t4, t5, t6, t7, t8, t9, t10, t11] >> end_task 。运行命令后,终端将等待并永远不会完成,直到我最终使用CRTL + C杀死它,然后在杀死后最终给我一个错误:

    loaded_info = load_info()  # function call to load .json data into a list
    start_task = DummyOperator(task_id = "start_task")
    end_task = DummyOperator(task_id = "end_task")
    tasks = []  # empty list to append tasks to in for loop
    for x in loaded_info:
        qce = QCError(**x)
        id = qce.column
        task = PythonOperator(task_id = id, python_callable = create_task(qce))
        tasks.append(task)
    start_task >> tasks >> end_task

(注意:上面错误语句中的查询只是.json中的第一个查询)。考虑到我以前的DAG从未遇到此错误,我假设这是由于动态任务创建引起的,但是我需要帮助来确定到底是什么导致了此错误。

我尝试过的事情:

  • 在Airflow Webserver Admin Ad-Hoc中单独运行每个查询(它们都可以正常运行)
  • 创建一个测试脚本以在本地运行并输出.json的内容,以确保所有内容的格式正确,等等。

1 个答案:

答案 0 :(得分:1)

我设法使airflow initdb最终可以运行(但是我尚未测试我的工作,以后将更新其状态)。

事实证明,在定义python运算符时,不能像我之前那样包含参数:

 task = PythonOperator(task_id = id, python_callable = create_task(qce))

导致错误的原因是将qce传递到create_tasks中。要将参数传递到任务中,请参阅here

对于那些想要查看我的确切案例的修复程序的人,我有这个:

with DAG("dva_event_analysis_dag", default_args = DEFAULT_ARGS, schedule_interval = None, catchup = False) as dag:
    loaded_info = load_info()
    start_task = DummyOperator(task_id = "start_task")
    end_task = DummyOperator(task_id = "end_task")
    tasks = []
    for x in loaded_info:
        id = x["column"]
        task = PythonOperator(task_id = id, provide_context = True, python_callable = create_task, op_kwargs = x)
        tasks.append(task)
    start_task >> tasks >> end_task

更新(7/03/2019):作业状态为成功。这确实是解决我的错误的方法。希望这可以帮助其他遇到类似问题的人。