Question

Airflow ETL dag每天都有错误

我们的气流装置正在使用CeleryExecutor。并发配置为

# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
parallelism = 32

# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 16

# Are DAGs paused by default at creation
dags_are_paused_at_creation = True

# When not using pools, tasks are run in the "default pool",
# whose size is guided by this config element
non_pooled_task_slot_count = 128

# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 16
[celery]
# This section only applies if you are using the CeleryExecutor in
# [core] section above

# The app name that will be used by celery
celery_app_name = airflow.executors.celery_executor

# The concurrency that will be used when starting workers with the
# "airflow worker" command. This defines the number of task instances that
# a worker will take, so size up your workers based on the resources on
# your worker box and the nature of your tasks
celeryd_concurrency = 16

我们每天执行一次dag。它遵循一种模式来并行执行约21个任务，该模式可检测数据是否存在于hdfs中，然后休眠10分钟，最后上传到s3。

某些任务遇到以下错误：

2019-05-12 00:00:46,209 INFO - Executor reports wh_hdfs_to_s3.check_hdfs_data_dct_order_item_15 execution_date=2019-05-11 04:00:00+00:00 as failed for try_number 1
2019-05-12 00:00:46,212 ERROR - Executor reports task instance <TaskInstance: wh_hdfs_to_s3.check_hdfs_data_dct_order_item_15 2019-05-11 04:00:00+00:00 [queued]> finished (failed) although the task says its queued. Was the task killed externally?
2019-05-12 00:00:46,212 INFO - Filling up the DagBag from /opt/DataLoader/airflow/dags/wh_hdfs_to_s3.py
2019-05-12 00:00:46,425 INFO - Using connection to: id: wh_aws_mysql. Host: db1.prod.coex.us-east-1.aws.owneriq.net, Port: None, Schema: WAREHOUSE_MOST, Login: whuser, Password: XXXXXXXX, extra: {}
2019-05-12 00:00:46,557 ERROR - Executor reports task instance <TaskInstance: wh_hdfs_to_s3.check_hdfs_data_dct_order_item_15 2019-05-11 04:00:00+00:00 [queued]> finished (failed) although the task says its queued. Was the task killed externally?
None
2019-05-12 00:00:46,558 INFO - Marking task as UP_FOR_RETRY
2019-05-12 00:00:46,561 WARNING - section/key [smtp/smtp_user] not found in config
2019-05-12 00:00:46,640 INFO - Sent an alert email to [u'wh-report-admin@owneriq.com']
2019-05-12 00:00:46,679 INFO - Executor reports wh_hdfs_to_s3.check_hdfs_data_tbldimmostlineitem_105 execution_date=2019-05-11 04:00:00+00:00 as failed for try_number 1
2019-05-12 00:00:46,682 ERROR - Executor reports task instance <TaskInstance: wh_hdfs_to_s3.check_hdfs_data_tbldimmostlineitem_105 2019-05-11 04:00:00+00:00 [queued]> finished (failed) although the task says its queued. Was the task killed externally?
2019-05-12 00:00:46,682 INFO - Filling up the DagBag from /opt/DataLoader/airflow/dags/wh_hdfs_to_s3.py
2019-05-12 00:00:46,686 INFO - Using connection to: id: wh_aws_mysql. Host: db1.prod.coex.us-east-1.aws.owneriq.net, Port: None, Schema: WAREHOUSE_MOST, Login: whuser, Password: XXXXXXXX, extra: {}
2019-05-12 00:00:46,822 ERROR - Executor reports task instance <TaskInstance: wh_hdfs_to_s3.check_hdfs_data_tbldimmostlineitem_105 2019-05-11 04:00:00+00:00 [queued]> finished (failed) although the task says its queued. Was the task killed externally?
None
2019-05-12 00:00:46,822 INFO - Marking task as UP_FOR_RETRY
2019-05-12 00:00:46,826 WARNING - section/key [smtp/smtp_user] not found in config
2019-05-12 00:00:46,902 INFO - Sent an alert email to [u'wh-report-admin@owneriq.com']
2019-05-12 00:00:46,918 INFO - Executor reports wh_hdfs_to_s3.check_hdfs_data_tbldimdatasourcetag_135 execution_date=2019-05-11 04:00:00+00:00 as success for try_number 1
2019-05-12 00:00:46,921 INFO - Executor reports wh_hdfs_to_s3.check_hdfs_data_flight_69 execution_date=2019-05-11 04:00:00+00:00 as success for try_number 1
2019-05-12 00:00:46,923 INFO - Executor reports wh_hdfs_to_s3.check_hdfs_data_tbldimariamode_93 execution_date=2019-05-11 04:00:00+00:00 as success for try_number 1

在这些任务中随机发生这种错误。发生此错误时，任务实例的状态立即设置为up_for_retry，并且工作节点中没有日志。重试后，他们执行并最终完成。

此问题有时会给我们带来很大的ETL延迟。有人知道如何解决这个问题吗？

Answer 1

我在DagRun中看到非常相似的症状。我认为这是由于ExternalTaskSensor和并发问题导致的，排队和终止的任务语言看起来像这样：Executor reports task instance <TaskInstance: dag1.data_table_temp_redshift_load 2019-05-20 08:00:00+00:00 [queued]> finished (failed) although the task says its queued. Was the task killed externally?但是当我查看工作日志时，我发现由于使用{ {1}}在我的DAG中。此处duplicate key value violates unique constraint when adding path variable in airflow dag对此问题进行了说明，其中调度程序定期轮询dagbag，以动态刷新任何更改。每个心跳的错误都会导致严重的ETL延迟。

您是否在wh_hdfs_to_s3 DAG（或其他）中执行了可能导致错误或延迟/这些症状的任何逻辑？

Answer 2

我们遇到了类似的问题，该问题已由

解决

[[1, 2, 3, 4, 5], [1, 2, 3, 4, -5], [1, 2, 3, -4, 5], [1, 2, 3, -4, -5], [1, 2, -3, 4, 5], [1, 2, -3, 4, -5], [1, 2, -3, -4, 5], [1, 2, -3, -4, -5], [1, -2, 3, 4, 5], [1, -2, 3, 4, -5], [1, -2, 3, -4, 5], [1, -2, 3, -4, -5], [1, -2, -3, 4, 5], [1, -2, -3, 4, -5], [1, -2, -3, -4, 5], [1, -2, -3, -4, -5], [-1, 2, 3, 4, 5], [-1, 2, 3, 4, -5], [-1, 2, 3, -4, 5], [-1, 2, 3, -4, -5], [-1, 2, -3, 4, 5], [-1, 2, -3, 4, -5], [-1, 2, -3, -4, 5], [-1, 2, -3, -4, -5], [-1, -2, 3, 4, 5], [-1, -2, 3, 4, -5], [-1, -2, 3, -4, 5], [-1, -2, 3, -4, -5], [-1, -2, -3, 4, 5], [-1, -2, -3, 4, -5], [-1, -2, -3, -4, 5], [-1, -2, -3, -4, -5]]选项。

有关更多信息：-https://airflow.apache.org/cli.html#backfill

Answer 3

我们已经解决了这个问题。让我回答我自己的问题：

我们有5个气流工作节点。安装花之后，监视分配给这些节点的任务。我们发现失败的任务总是发送到特定节点。我们尝试使用airflow test命令在其他节点上运行任务，并且它们起作用了。最终，原因是该特定节点中的python软件包错误。

Apache Airflow 1.10.3：执行程序报告任务实例？完成（失败），尽管任务表明其已排队。任务是在外部终止的吗？

3 个答案: