给定RAM和CPU约束,如何使用Airflow主动控制DAG

时间:2019-01-09 11:33:46

标签: python load airflow

我通过尝试许多示例来非常熟悉airflow的编程功能。.使我无法进一步挖掘的是,它如何在不使CPU或RAM过载的情况下执行其工作,是否有一种方法来控制负载,以便不会耗尽资源

我知道一种方法,可以通过将以下字段min_file_process_interval和scheduler_heartbeat_sec的值更改为一分钟左右的间隔来减少调度程序“更频繁地调度和挑选文件”的工作量。尽管它减少了持续不断的CPU占用,但是当间隔过去后(即一分钟后),它突然又回到了启动时的95%的CPU占用率。至少不会消耗超过70%的CPU吗?

已编辑

此外,当scheduler_heartbeat间隔过去时,我看到我所有的python脚本再次执行..这是这样工作的吗?我以为它将在间隔之后获取新的DAG,否则将无济于事。

1 个答案:

答案 0 :(得分:3)

There are a few techniques you can use to control the number of processes running on airflow.

  1. Use Pools. You can assign pools in the dag setup, or you can just add it to your operator so that the random dag creator has that detail hidden from them.
  2. For backfilling tasks I think there is a parameter concurrency and max_active_runs which are defined when you initialize a DAG
  3. Distribute your compute if you are using CeleryExecutor. You can have the CeleryExecutor execute on remote machines.[Didn't try this myself, but I have heard success stories with this.]

These are the ones I have used. You'll have to be smart about the allocation to control CPU spikes and memory issues.