Question

我有一个由三个运算符组成的简单DAG。第一个是function playThis(index) { var liElems = document.querySelectorAll('ul > li'); // remove the active class for all the li liElems.forEach(function(elem) { elem.classList.remove('active'); }); // add the active class to the clicked li this.parentNode.classList.add('active'); playerInstance.playlistItem(index); }，具有我们自己的功能，另外两个是来自PythonOperator（airflow.contrib和FileToGoogleCloudStorageOperator的标准运算符）。他们按顺序工作。我们的自定义任务会生成许多文件，通常介于2到5之间，具体取决于参数。所有这些文件必须分别由后续任务处理。这意味着我想要几个下游分支，但是在DAG运行之前有多少是不可知的。

你会如何解决这个问题？

更新：

使用jhnclvr在another reply中提到的GoogleCloudStorageToBigQueryOperator作为出发点，我根据条件创建了一个跳过或继续执行分支的运算符。这种方法是可行的，因为已知尽可能多的分支并且足够小。

运营商：

BranchPythonOperator

用法：

class SkipOperator(PythonOperator):
    def execute(self, context):
        boolean = super(SkipOperator, self).execute(context)
        session = settings.Session()
        for task in context['task'].downstream_list:
            if boolean is False:
                ti = TaskInstance(
                    task, execution_date=context['ti'].execution_date)
                ti.state = State.SKIPPED
                ti.start_date = datetime.now()
                ti.end_date = datetime.now()
                session.merge(ti)
        session.commit()
        session.close()

Answer 1

See a similar (but different) question here

基本上，您无法在DAG运行时向其添加任务。您需要提前知道要添加多少任务。

您可以使用单个运算符处理N个文件。

或者，如果您有另一个单独的dag处理文件，您可以触发DAG N次，在conf中传递文件的名称。

See here for an example of the TriggerDagRunOperator.

See here for the DAG that would be triggered.

And lastly see this post from which the above examples are from.

如何动态嵌套Airflow DAG？

1 个答案: