Question

我有一个DAG，可以创建一个集群，开始计算任务，完成后，将这个集群拆除。我想将此群集上进行的计算任务的并发性限制为固定数量。因此，从逻辑上讲，我需要一个专用于任务创建的集群的池。我不想干扰其他DAG或同一DAG的不同运行。

我认为我可以通过在创建集群后从任务动态创建一个池来解决此问题，并在计算任务完成后将其删除。我以为可以对计算任务的pool参数进行模板处理，以使它们使用此动态创建的集群。

# execute registers a pool and returns with the pool name
create_pool = CreatePoolOperator(
    slots=4,
    task_id='create_pool',
    dag=self
)

# the pool parameter is templated
computation = ComputeOperator(
    task_id=compute_subtask_name,
    pool="{{ ti.xcom_pull(task_ids='create_pool') }}",
    dag=self
)

create_pool >> computation

但是这样一来，计算任务将永远不会被触发。因此，我认为pool参数在进行模板化之前已保存在任务实例中。我想听听您对如何实现所需行为的想法。

Answer 1

请查看concurrency上的airflow.models.DAG属性是否可以解决问题，而不是尝试使用动态池。它将限制当前进程的运行中正在运行的任务的数量。

Answer 2

这个答案可能会加剧一些问题，但这仍然是一条可行的道路，因此值得记录。使Airflow比其竞争对手更强大的核心功能是，所有内容均使用代码定义。归根结底，如果Airflow无法为我们提供功能，那么我们总是可以使用Python自己创建功能。

您希望能够在DAG中集中任务，但只能针对特定的DAG运行。因此，尝试仅在您的任务上创建一个自定义池。这是我头上的一些伪代码

List<String> tasksPoolQueue = new ArrayList<String>();

def taskOnesFunction() 

  while true:

    if tasksPoolQueue.get(0) == "taskOnesTurn":
       print("Do some work it's your turn")

       # Delete this run from the list and shift the list over to the left one index
       # So that the next value is now the first value in the list
       tasksPoolQueue.delete(0)

       return 0

    else:
      sleep(10 seconds)

def taskTwosFunction()

  while true:

    if tasksPoolQueue.get(0) == "taskTwosTurn":
       print("Do some work it's your turn")

       # Delete this run from the list and shift the list over to the left one index
       # So that the next value is now the first value in the list
       tasksPoolQueue.delete(0)

       return 0

    else:
      sleep(10 seconds)

def createLogicalOrderingOfTaskPoolQueue():

    if foobar == true:
      tasksPoolQueue[0] = "taskOnesTurn"
      tasksPoolQueue[1] = "taskTwosTurn"
    else:
      tasksPoolQueue[0] = "taskTwosTurn"
      tasksPoolQueue[1] = "taskOnesTurn"

    return 0


determine_pool_queue_ordering = PythonOperator(
    task_id='determine_pool_queue_ordering',
    retries=0,
    dag=dag,
    provide_context=True,
    python_callable=createLogicalOrderingOfTaskPoolQueue,
    op_args=[])

task1 = PythonOperator(
    task_id='task1',
    retries=0,
    dag=dag,
    provide_context=True,
    python_callable=taskOnesFunction,
    op_args=[])

task2= PythonOperator(
    task_id='task2',
    retries=0,
    dag=dag,
    provide_context=True,
    python_callable=taskTwosFunction,
    op_args=[])

determine_pool_queue_ordering.set_downstream(task1)
determine_pool_queue_ordering.set_downstream(task2)

所以希望每个人都可以遵循我的伪代码。我不知道创建自定义池的最佳方法是不引入“竞争条件”，所以这个列表队列的想法是我乍看之下的想法。但是这里的要点是，task1和task2都将在它们的函数中同时运行，但我可以这样做，以便该函数在经过if语句阻止它运行实际代码之前不会做任何有意义的事情。

第一个任务将使用列表动态设置哪些任务首先运行以及以什么顺序运行。然后，使该自定义池中需要的所有功能都引用该列表。由于我们的if语句仅在其taskName在列表中时才等于true，因此从本质上讲意味着一次只能运行一个任务。列表中的第一个任务一旦完成处理所需的操作，便会从列表中删除自己。然后其他任务将在等待任务名称在列表中排在第一位时进入睡眠状态。

因此，只需进行一些类似于我的自定义逻辑即可。

Answer 3

如果不存在池，这是一个创建池的操作员。

from airflow.api.common.experimental.pool import get_pool, create_pool
from airflow.exceptions import PoolNotFound
from airflow.models import BaseOperator
from airflow.utils import apply_defaults


class CreatePoolOperator(BaseOperator):
    # its pool blue, get it?
    ui_color = '#b8e9ee'

    @apply_defaults
    def __init__(
            self,
            name,
            slots,
            description='',
            *args, **kwargs):
        super(CreatePoolOperator, self).__init__(*args, **kwargs)
        self.description = description
        self.slots = slots
        self.name = name

    def execute(self, context):
        try:
            pool = get_pool(name=self.name)
            if pool:
                self.log(f'Pool exists: {pool}')
                return
        except PoolNotFound:
            # create the pool
            pool = create_pool(name=self.name, slots=self.slots, description=self.description)
            self.log(f'Created pool: {pool}')

删除池可以用类似的方式完成。

在Airflow中创建动态池

3 个答案: