有没有一种方法可以在脚本内为Dask YarnCluster更新而不是覆盖worker_env?

时间:2019-08-16 21:30:06

标签: dask

在我的Dask-Yarn的配置文件中,即~.config/dask/yarn.yaml,我将工作环境变量设置如下:

yarn:

  name: dask                 # Application name
  queue: default             # Yarn queue to deploy to
  deploy-mode: remote        # The deploy mode to use (either remote or local)
  environment: /dask_yarn.tar.gz          # Path to conda packed environment
  user: ''                     # The user to submit the application on behalf of

  worker:                   # Specifications of worker containers
    count: 0                # Number of workers to start on initialization
    restarts: -1            # Allowed number of restarts, -1 for unlimited
    env: {"ARROW_LIBHDFS_DIR": "/usr/hdp/lib"}                 # A map of environment variables to set on the worker

现在,在我的脚本中,我想在我的工作程序中设置另一个从脚本派生的环境变量,例如

cluster = YarnCluster(worker_env={"env_var": env_val})

其中env_val在此脚本中在上述语句之前派生。但是此语句将覆盖~.config/dask/yarn.yaml中先前指定的配置。我不想在脚本中硬编码ARROW_LIBHDFS_DIR,也不能在~.config/dask/yarn.yaml中设置此变量,因为它是在脚本执行期间派生的。那么有没有办法在脚本中更新工作环境而不覆盖它?

1 个答案:

答案 0 :(得分:1)

构造函数没有选项,但是您可以通过立即访问dask的配置来做到这一点:

import dask
# Get the existing worker_env field (use `.copy` so as not to mutate it)
worker_env = dask.config.get("yarn.worker.env", {}).copy()
# Add a new environment variable
worker_env["env_var"] = env_var
# Create your cluster
cluster = YarnCluster(worker_env=worker_env, ...)