Question

我的问题是关于一个动态定义group of parallel tasks的DAG，它基于计算MySQL表中由上游任务删除和重建的行数。我遇到的困难是，在我的上游任务中，TRUNCATE此表在重新重建之前清除它。这是sherlock_join_and_export_task。当我这样做时，行计数下降到零，我的动态生成的任务不再被定义。恢复表时，图表的结构也是如此，但任务不再执行。相反，它们在树视图中显示为黑框：

在sherlock_join_and_export_task删除行count = worker.count_online_table()中引用的表格之后，DAG就像这样：

sherlock_join_and_export_task完成后，这就是DAG的样子：

但是，这些任务都没有排队和执行。 DAG只是继续运行而没有任何反应。

这是我使用sub-DAG的情况吗？有关如何设置或重写现有DAG的任何见解？我在AWS ECS上使用LocalExecutor运行此操作。以下代码供参考：

from datetime import datetime
import os
import sys

from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator

BATCH_SIZE = 75000

from preprocessing.marketing.minimalist.table_builder import OnlineOfflinePreprocess

worker = OnlineOfflinePreprocess()

def partial_process_flow(batch_size, offset):
    worker = OnlineOfflinePreprocess()
    worker.import_offline_data()
    worker.import_online_data(batch_size, offset)
    worker.merge_aurum_to_sherlock()
    worker.upload_table('aurum_to_sherlock')

def batch_worker(batch_size, offset, DAG):
    return PythonOperator(
        task_id="{0}_{1}".format(offset, batch_size),
        python_callable=partial_process_flow,
        op_args=[batch_size, offset],
        dag=DAG)

DAG = DAG(
  dag_id='minimalist_data_preproc',
  start_date=datetime(2018, 1, 7, 2, 0, 0, 0), #..EC2 time. Equal to 11pm hora México
  max_active_runs=1,
  concurrency=4,
  schedule_interval='0 9 * * *', #..4am hora mexico
  catchup=False
)

clear_table_task = PythonOperator(
    task_id='clear_table_task',
    python_callable=worker.clear_marketing_table,
    op_args=['aurum_to_sherlock'],
    dag=DAG
)

sherlock_join_and_export_task = PythonOperator(
    task_id='sherlock_join_and_export_task',
    python_callable=worker.join_online_and_send_to_galileo,
    dag=DAG
)

sherlock_join_and_export_task >> clear_table_task

count = worker.count_online_table()
if count == 0:
    sherlock_join_and_export_task >> batch_worker(-99, -99, DAG) #..dummy task for when left join deleted
else:
    format_table_task = PythonOperator(
        task_id='format_table_task',
        python_callable=worker.format_final_table,
        dag=DAG
    )

    build_attributions_task = PythonOperator(
        task_id='build_attributions_task',
        python_callable=worker.build_attribution_weightings,
        dag=DAG
    )

    update_attributions_task = PythonOperator(
        task_id='update_attributions_task',
        python_callable=worker.update_attributions,
        dag=DAG
    )

    first_task = batch_worker(BATCH_SIZE, 0, DAG)
    clear_table_task >> first_task
    for offset in range(BATCH_SIZE, count, BATCH_SIZE):
        first_task >> batch_worker(BATCH_SIZE, offset, DAG) >> format_table_task

    format_table_task >> build_attributions_task >> update_attributions_task

以下是DAG正在做的简化概念：

...

def batch_worker(batch_size, offset, DAG):
    #..A function the dynamically generates tasks based on counting the reference table
    return dag_task

worker = ClassMethodsForDAG()
count = worker.method_that_counts_reference table()

if count == 0:
    delete_and_rebuild_reference_table_task >> batch_worker(-99, -99, DAG) 
else:
    first_task = batch_worker(BATCH_SIZE, 0, DAG)
    clear_table_task >> first_task
    for offset in range(BATCH_SIZE, count, BATCH_SIZE):
        first_task >> batch_worker(BATCH_SIZE, offset, DAG) >> downstream_task

Answer 1

我和这个用例打了很长时间。简而言之，基于不断变化的资源状态（尤其是数据库表）构建的dag在气流中不会飞得很好。

我的解决方案是编写一个小的自定义运算符，如果是truggerdagoperator，它是子类，它执行查询，然后为每个子进程触发dagruns。

它使进程“加入”下游更有趣，但在我的用例中，我能够使用另一个dag进行解决，如果某一天的所有子进程都已完成，则轮询和短路。在其他情况下，分区传感器可以做到这一点。

我有几个像这样的用例（基于动态源的迭代dag触发器），经过大量的动态Subdags工作（很多）之后，我切换到这个“触发子进程”策略并且一直在做从那以后。

注意 - 这可能会为一个目标（目标）制造大量的dagruns。这使得UI在某些地方具有挑战性，但它是可行的（我已经开始直接查询数据库，因为我还没准备好编写一个可以完成UI工作的插件）

Answer 2

查看你的dag我认为你已经实现了一个非幂等流程，气流并没有真正配置。您可能应该保留已配置的任务并仅更新start_date / end_date以启用和禁用它们以在任务级别进行调度，或者甚至在每次迭代时运行所有这些任务，而不是截断/更新您正在构建的表。如果作业被禁用，您的脚本会检查表以运行hello world。

下游任务定义依赖于上游结果时如何设置DAG

2 个答案: