Question

我们通常使用trigger_dag CLI命令启动Airflow DAG。例如：

airflow trigger_dag my_dag --conf '{"field1": 1, "field2": 2}'

我们使用context[‘dag_run’].conf

在操作员中访问此conf

有时候，当DAG在某些任务上中断时，我们想使用此新的conf“更新”配置文件并重新启动中断的任务（和下游依赖项）。例如：

新配置-> {"field1": 3, "field2": 4}

是否可以使用这样的新json字符串“更新” dag_run conf？

将有兴趣听取有关此，其他解决方案或避免这种情况的潜在方法的想法。

使用Apache Airflow v1.10.3

非常感谢您。

Answer 1

在创建 dag 运行后更新 conf 并不像从 conf 中读取那样直接，因为在创建 dag 运行后每次使用时，都会从 dag_run 元数据表中读取 conf。虽然变量具有写入和读取元数据表的方法，但 dag 运行只能让您读取。

我同意变量是一个有用的工具，但是当您有 k=v 对并且只想用于单次运行时，它会变得复杂和混乱。

下面是一个操作符，可让您在实例化后更新 dag_run 的配置（在 v1.10.10 中测试）：

#! /usr/bin/env python3
"""Operator to overwrite a dag run's conf after creation."""


import os

from airflow.models import BaseOperator
from airflow.utils.db import provide_session
from airflow.utils.decorators import apply_defaults
from airflow.utils.operator_helpers import context_to_airflow_vars


class UpdateConfOperator(BaseOperator):
    """Updates an existing DagRun's conf with `given_conf`.

    Args:
        given_conf: A dictionary of k:v values to update a DagRun's conf with. Templated.
        replace: Whether or not `given_conf` should replace conf (True)
                 or be used to update the existing conf (False).
                 Defaults to True.

    """

    template_fields = ("given_conf",)
    ui_color = "#ffefeb"

    @apply_defaults
    def __init__(self, given_conf: Dict, replace: bool = True, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.given_conf = given_conf
        self.replace = replace

    @staticmethod
    def update_conf(given_conf: Dict, replace: bool = True, **context) -> None:
        @provide_session
        def save_to_db(dag_run, session):
            session.add(dag_run)
            session.commit()
            dag_run.refresh_from_db()

        dag_run = context["dag_run"]
        # When there's no conf provided,
        # conf will be None if scheduled or {} if manually triggered
        if replace or not dag_run.conf:
            dag_run.conf = given_conf
        elif dag_run.conf:
            # Note: dag_run.conf.update(given_conf) doesn't work
            dag_run.conf = {**dag_run.conf, **given_conf}

        save_to_db(dag_run)

    def execute(self, context):
        # Export context to make it available for callables to use.
        airflow_context_vars = context_to_airflow_vars(context, in_env_var_format=True)
        self.log.debug(
            "Exporting the following env vars:\n%s",
            "\n".join(["{}={}".format(k, v) for k, v in airflow_context_vars.items()]),
        )
        os.environ.update(airflow_context_vars)

        self.update_conf(given_conf=self.given_conf, replace=self.replace, **context)

示例用法：

CONF = {"field1": 3, "field2": 4}
with DAG(
    "some_dag",
    # schedule_interval="*/1 * * * *",
    schedule_interval=None,
    max_active_runs=1,
    catchup=False,
) as dag:
    t_update_conf = UpdateConfOperator(
        task_id="update_conf", given_conf=CONF,
    )
    t_print_conf = BashOperator(
        task_id="print_conf",
        bash_command="echo {{ dag_run['conf'] }}",
    )
    t_update_conf >> t_print_conf

Answer 2

这似乎是Airflow Variables的一个很好的用例。如果您要read your configs from Variables，则可以轻松地从Airflow UI本身查看和修改配置输入。

在通过另一个Airflow任务本身重新运行Task / DAG之前，您甚至可以发挥创意并自动进行配置更新（现在将其存储在变量中）。参见With code, how do you update and airflow variable

是否可以更新/覆盖Airflow ['dag_run']。conf？

2 个答案: