Airflow Task does not move onto dependency but re-runs task

时间:2018-06-04 17:43:16

标签: airflow airflow-scheduler

I have an Airflow Workflow that consists of three tasks; with the second task dependent on the first and the third task dependent on the second.

If I run the DAG via the webserver, the first task completes but then begins to re-run instead of triggering the second task. One thing to keep in mind is that the first task does take more than 130 seconds to run. Is this happening because of the duration of the first task?

import airflow
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import timedelta, datetime


default_args = {
    'owner': 'David',
    'depends_on_past': True,
    'start_date': datetime(2018,5,18),
    'email': ['email_address'],
    'email_on_failure': True,
    'email_on_retry': True,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

dag = DAG(
    'DCM_Floodlight_Report_API',
    default_args=default_args,
    description='Pull ABG DCM Floodlight report. Then upload into S3 bucket.',
    schedule_interval='30 14 * * *')

t1 = BashOperator(
    task_id='Pull_DCM_Report',
    bash_command='python "/Users/run_report.py" 2737542 134267867', dag=dag)

t2 = BashOperator(
    task_id='Cleanse_File',
    bash_command='python "/Users/cleanse_file.py"',dag=dag)

t3 = BashOperator(
    task_id='S3_Bucket_Creation_Upload_File',
    bash_command='python "/Users/aws_s3_creation&load.py"',dag=dag)

t2.set_upstream(t1)
t3.set_upstream(t2)

2 个答案:

答案 0 :(得分:2)

我认为您的任务的运行时不是问题。 - 此行为很可能是由let compareFunc = (initial, arrayToCheck) => { let initial_digits = initial.toString().split(''); initial_digits.sort(); arrayToCheck.forEach(function(item) { let item_digits = item.toString().split(''); item_digits.sort(); if(item_digits.toString() === initial_digits.toString()) { console.log(item + ' is a permutation of ' + initial); } }) } compareFunc(321, [1,2,123,213]);参数引起的,默认为catchup

https://airflow.apache.org/scheduler.html#backfill-and-catchup

这意味着Airflow正在为True和当前时间之间的每个计划间隔安排第一项任务。 您可以在UI中查看树视图,以查看是否正在安排多个DagRun。如果你只是测试你的DAG,我建议你在测试时将schedule_interval设置为start_date,然后再安排它运行过去或将来的日期。

答案 1 :(得分:0)

尝试不使用重试逻辑,看看它是如何执行的。使用这些默认参数和dag信息:

`default_args = {
'owner': 'David',
'depends_on_past': False,
'start_date': datetime(2018,5,18),
'email': ['email_address'],
'email_on_failure': True,
'email_on_retry': True
}

dag = DAG(
 dag_id='DCM_Floodlight_Report_API',
 default_args=default_args,
 catchup=False,
 description='Pull ABG DCM Floodlight report. Then upload into S3 bucket.',
 schedule_interval='30 14 * * *')

我添加了catchup并将其设置为False并将depends_on_past更改为False。我也删除了重试逻辑。这可能会解决您的问题 - 请告诉我们!