使用ExternalTask​​Sensor进行DAG任务依赖时,Airflow 1.10.2 ETL运行缓慢吗?

时间:2019-05-21 15:26:38

标签: celery airflow airflow-scheduler

我需要使用Airflow 1.10.2 + CeleryExecutor来运行两个DAG。第一个DAG(DAG1)是从s3到Redshift(3个多小时)的长期数据加载。我的第二个DAG(DAG2)对DAG1加载的数据执行计算。我想在DAG2中包含一个ExternalTask​​Sensor,以便在加载数据后可靠地执行计算。理论上是如此简单!

通过确保两个DAG被安排为同时启动(两个DAG的时间表均为“ 0 8 * * *”),我可以成功地使DAG2等待DAG1完成,并且DAG2取决于DAG1中的最终任务。但是,当我介绍传感器时,在DAG1上的ETL出现了很大的延迟。我起初虽然是因为我的原始实现使用了mode="poke",但据我了解它锁定了一个工作程序。但是,即使我在文档https://airflow.readthedocs.io/en/stable/_modules/airflow/sensors/base_sensor_operator.html中阅读时将其更改为mode="reschedule",我仍然看到大量的ETL延迟。

我在下面的DAG2中使用ExternalTask​​Sensor代码:

wait_for_data_load = ExternalTaskSensor(
    dag=dag,
    task_id="wait_for_data_load",
    external_dag_id="dag1",
    external_task_id="dag1_final_task_id",
    mode="reschedule",
    poke_interval=1800,  # check every 30 min
    timeout=43200,  # timeout after 12 hours (catch delayed data load runs)
    soft_fail=False  # if the task fails, we assume a failure
)

如果代码正常运行,我希望传感器能够快速检查DAG1是否已完成,如果没有完成,则按poke_interval的规定重新安排30分钟的时间,从而不会延迟DAG1 ETL。如果DAG1在12小时后未能完成,则DAG2将停止戳戳并失败。

相反,DAG1中的每个任务都经常出现错误,例如(Executor reports task instance <TaskInstance: dag1.data_table_temp_redshift_load 2019-05-20 08:00:00+00:00 [queued]> finished (failed) although the task says its queued. Was the task killed externally?),即使任务已成功完成(有一定延迟)。在发送此错误之前,尽管(再次)我可以看到任务成功,但我在Sentry日志中看到一行显示Executor reports dag1.data_table_temp_redshift_load execution_date=2019-05-20 08:00:00+00:00 as failed for try_number 1

DAG2上的日志看起来也有些奇怪。我看到重复的尝试以相同的时间间隔记录下来,例如以下摘录:

--------------------------------------------------------------------------------
Starting attempt 1 of 4
--------------------------------------------------------------------------------

[2019-05-21 08:01:48,417] {{models.py:1593}} INFO - Executing <Task(ExternalTaskSensor): wait_for_data_load> on 2019-05-20T08:00:00+00:00
[2019-05-21 08:01:48,419] {{base_task_runner.py:118}} INFO - Running: ['bash', '-c', 'airflow run dag2 wait_for_data_load 2019-05-20T08:00:00+00:00 --job_id 572075 --raw -sd DAGS_FOLDER/dag2.py --cfg_path /tmp/tmp4g2_27c7']
[2019-05-21 08:02:02,543] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:02,542] {{settings.py:174}} INFO - settings.configure_orm(): Using pool settings. pool_size=5, pool_recycle=1800, pid=28219
[2019-05-21 08:02:12,000] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:11,996] {{__init__.py:51}} INFO - Using executor CeleryExecutor
[2019-05-21 08:02:15,840] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:15,827] {{models.py:273}} INFO - Filling up the DagBag from /usr/local/airflow/dags/dag2.py
[2019-05-21 08:02:16,746] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:16,745] {{dag2.py:40}} INFO - Waiting for the dag1_final_task_id operator to complete in the dag1 DAG
[2019-05-21 08:02:17,199] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:17,198] {{cli.py:520}} INFO - Running <TaskInstance: dag1. wait_for_data_load 2019-05-20T08:00:00+00:00 [running]> on host 11d93b0b0c2d
[2019-05-21 08:02:17,708] {{external_task_sensor.py:91}} INFO - Poking for dag1. dag1_final_task_id on 2019-05-20T08:00:00+00:00 ... 
[2019-05-21 08:02:17,890] {{models.py:1784}} INFO - Rescheduling task, marking task as UP_FOR_RESCHEDULE
[2019-05-21 08:02:17,892] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load /usr/local/lib/python3.6/site-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.25.2) or chardet (3.0.4) doesn't match a supported version!
[2019-05-21 08:02:17,893] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load   RequestsDependencyWarning)
[2019-05-21 08:02:17,893] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load /usr/local/lib/python3.6/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
[2019-05-21 08:02:17,894] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load   """)
[2019-05-21 08:02:22,597] {{logging_mixin.py:95}} INFO - [2019-05-21 08:02:22,589] {{jobs.py:2527}} INFO - Task exited with return code 0

[2019-05-21 08:01:48,125] {{models.py:1359}} INFO - Dependencies all met for <TaskInstance: dag2. wait_for_data_load 2019-05-20T08:00:00+00:00 [queued]>
[2019-05-21 08:01:48,311] {{models.py:1359}} INFO - Dependencies all met for <TaskInstance: dag2. wait_for_data_load 2019-05-20T08:00:00+00:00 [queued]>
[2019-05-21 08:01:48,311] {{models.py:1571}} INFO - 
--------------------------------------------------------------------------------
Starting attempt 1 of 4
--------------------------------------------------------------------------------

[2019-05-21 08:01:48,417] {{models.py:1593}} INFO - Executing <Task(ExternalTaskSensor): wait_for_data_load> on 2019-05-20T08:00:00+00:00
[2019-05-21 08:01:48,419] {{base_task_runner.py:118}} INFO - Running: ['bash', '-c', 'airflow run dag2 wait_for_data_load 2019-05-20T08:00:00+00:00 --job_id 572075 --raw -sd DAGS_FOLDER/dag2.py --cfg_path /tmp/tmp4g2_27c7']
[2019-05-21 08:02:02,543] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:02,542] {{settings.py:174}} INFO - settings.configure_orm(): Using pool settings. pool_size=5, pool_recycle=1800, pid=28219
[2019-05-21 08:02:12,000] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:11,996] {{__init__.py:51}} INFO - Using executor CeleryExecutor
[2019-05-21 08:02:15,840] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:15,827] {{models.py:273}} INFO - Filling up the DagBag from /usr/local/airflow/dags/dag2.py
[2019-05-21 08:02:16,746] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:16,745] {{dag2.py:40}} INFO - Waiting for the dag1_final_task_id operator to complete in the dag1 DAG
[2019-05-21 08:02:17,199] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load [2019-05-21 08:02:17,198] {{cli.py:520}} INFO - Running <TaskInstance: dag2.wait_for_data_load 2019-05-20T08:00:00+00:00 [running]> on host 11d93b0b0c2d
[2019-05-21 08:02:17,708] {{external_task_sensor.py:91}} INFO - Poking for dag1.dag1_final_task_id on 2019-05-20T08:00:00+00:00 ... 
[2019-05-21 08:02:17,890] {{models.py:1784}} INFO - Rescheduling task, marking task as UP_FOR_RESCHEDULE
[2019-05-21 08:02:17,892] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load /usr/local/lib/python3.6/site-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.25.2) or chardet (3.0.4) doesn't match a supported version!
[2019-05-21 08:02:17,893] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load   RequestsDependencyWarning)
[2019-05-21 08:02:17,893] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load /usr/local/lib/python3.6/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
[2019-05-21 08:02:17,894] {{base_task_runner.py:101}} INFO - Job 572075: Subtask wait_for_data_load   """)
[2019-05-21 08:02:22,597] {{logging_mixin.py:95}} INFO - [2019-05-21 08:02:22,589] {{jobs.py:2527}} INFO - Task exited with return code 0
[2019-05-21 08:33:31,875] {{models.py:1359}} INFO - Dependencies all met for <TaskInstance: dag2.wait_for_data_load 2019-05-20T08:00:00+00:00 [queued]>
[2019-05-21 08:33:31,903] {{models.py:1359}} INFO - Dependencies all met for <TaskInstance: dag2.wait_for_data_load 2019-05-20T08:00:00+00:00 [queued]>
[2019-05-21 08:33:31,903] {{models.py:1571}} INFO - 
--------------------------------------------------------------------------------
Starting attempt 1 of 4
--------------------------------------------------------------------------------

尽管所有日志都显示Starting attempt 1 of 4,但我确实看到了大约每30分钟的尝试记录,但是我看到了每个时间间隔的多个日志(每30分钟间隔打印了10多个相同日志)。

通过搜索,我发现其他人在生产流程https://eng.lyft.com/running-apache-airflow-at-lyft-6e53bb8fccff中使用了传感器,这使我认为有解决方案,或者我正在实施错误的解决方案。但是我也看到气流项目中与此问题相关的未解决问题,所以该项目中可能还有更深层次的问题吗?我还在Apache Airflow 1.10.3: Executor reports task instance ??? finished (failed) although the task says its queued. Was the task killed externally?

上找到了相关但未答复的帖子

此外,我们正在使用以下配置设置:

# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
parallelism = 32

# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 16

# Are DAGs paused by default at creation
dags_are_paused_at_creation = True

# When not using pools, tasks are run in the "default pool",
# whose size is guided by this config element
non_pooled_task_slot_count = 128

# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 16

1 个答案:

答案 0 :(得分:0)

这些症状实际上是由DAG1主体中的Variable.set()调用引起的,DAG2随后使用该调用来检索动态生成的dag_id的DAG1。 Variable.set()全部引起错误(在工作日志中发现)。如here所述,调度程序会在每次心跳时轮询DAG定义,以更新DAG以保持最新状态。这意味着每次心跳都会出错,从而导致大量的ETL延迟。