每天/定期根据dagfile定义代码更新气流dag

时间:2019-10-24 21:49:54

标签: airflow

是否有一种方法可以基于dagfile定义代码每天/定期更新气流dag?例如。更新可能在dag定义中使用的日期值。

对于上下文:我有一个气流请求,每天要从远程数据库获取新的表行,然后将它们移至本地数据库。为了从远程获取最新的行,我们有一个从本地获取最新日期的函数。目前有一个定义为...的dag。

...
def get_latest_date(tablename):
    # get latest import date from local table
    ....

for table in tables: # type list(dict)

    task_1 = BashOperator(
        task_id='task_1_%s' % table["tablename"],
        bash_command='bash %s/task_1.sh %s' % (PROJECT_HOME, table["latest_date"]),
        execution_timeout=timedelta(minutes=30),
        dag=dag)

    task_2 = BashOperator(
        task_id='task_2_%s' % table["tablename"],
        bash_command='bash %s/task_2.sh' % PROJECT_HOME,
        execution_timeout=timedelta(minutes=30),
        dag=dag)

    task_1 >> task_2

where table是dict,其中代码中较早构造的字段之一是给定表的最新日期的字符串rep。在task_1.sh脚本中打印假定的最新日期时,发现该日期不是每天更新。需要一种每天重新建立表格列表的方法,以获取正确的日期值。

1 个答案:

答案 0 :(得分:0)

使用以下代码,您可以为每个表从本地数据库中动态提取latest_date,并使用Airflow XComBashOperator中使用它。

from airflow import DAG
import airflow
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import ShortCircuitOperator
from airflow.operators.bash_operator import BashOperator
import logging

from datetime import datetime, timedelta

args = {
    'owner': 'Airflow',
    'start_date': airflow.utils.dates.days_ago(2),
}

dag = DAG(
    dag_id='example_dag',
    default_args=args,
    schedule_interval=None,
)


def get_latest_date(**kwargs):
    # get latest import date from local table
    logging.info("Table Name: {0}".format(kwargs['table_name']))
    # below i am doing a datetime.today() for demonstration. In your function, it will be your actual logic to get the latest date from your local DB
    latest_date = (datetime.today() - timedelta(days=kwargs['date_diff'])).strftime('%d-%m-%Y')
    logging.info("Latest Date: {0}".format(latest_date))
    #pus the latest date to the task xcom
    kwargs['ti'].xcom_push(key='latest_date', value=latest_date)

    return latest_date

start_task = DummyOperator(task_id='Start_Task', dag=dag)
end_task = DummyOperator(task_id='End_Task', dag=dag)

# below list will no longer require latest_date entry in each of the table dictionary 
tables_list = [{'tablename': 'table1'}, {'tablename': 'table2'}, {'tablename': 'table3'}, {'tablename': 'table4'}]
# below i am using idx (index) for date difference. I am doing a date difference to get difference latest_date values for different tasks. This is just for demonstration purpose
for idx, table in enumerate(tables_list): # type list(dict)

    get_latest_date_task = ShortCircuitOperator(
        task_id='Get_Latest_Date_In_Table_{0}'.format(table['tablename']),
        provide_context=True,
        python_callable=get_latest_date,
        op_kwargs={
            'table_name': table['tablename'],
            'date_diff': idx
        },
        dag=dag)

    # you can create a variable xcom_str like below and use that xcom_str in BashOperator bash_command or you can directly embed that in bash_command (like I did in task_2 BashOperator)
    xcom_str = "{{ ti.xcom_pull(task_ids='Get_Latest_Date_In_Table_{}', key='latest_date') }}".format(table['tablename'])
    task_1 = BashOperator(
        task_id='task_1_{0}'.format(table['tablename']),
        bash_command='echo "{' + xcom_str + '}"',                
        execution_timeout=timedelta(minutes=30),
        dag=dag)

    task_2 = BashOperator(
        task_id='task_2_{0}'.format(table['tablename']),
        bash_command='echo "{{ ti.xcom_pull("Get_Latest_Date_In_Table_' + table['tablename'] + '", key="latest_date") }}"',
        execution_timeout=timedelta(minutes=30),
        dag=dag)

    start_task >> get_latest_date_task >> task_1 >> task_2 >> end_task