在我的Dask-Yarn的配置文件中,即~.config/dask/yarn.yaml
,我将工作环境变量设置如下:
yarn:
name: dask # Application name
queue: default # Yarn queue to deploy to
deploy-mode: remote # The deploy mode to use (either remote or local)
environment: /dask_yarn.tar.gz # Path to conda packed environment
user: '' # The user to submit the application on behalf of
worker: # Specifications of worker containers
count: 0 # Number of workers to start on initialization
restarts: -1 # Allowed number of restarts, -1 for unlimited
env: {"ARROW_LIBHDFS_DIR": "/usr/hdp/lib"} # A map of environment variables to set on the worker
现在,在我的脚本中,我想在我的工作程序中设置另一个从脚本派生的环境变量,例如
cluster = YarnCluster(worker_env={"env_var": env_val})
其中env_val
在此脚本中在上述语句之前派生。但是此语句将覆盖~.config/dask/yarn.yaml
中先前指定的配置。我不想在脚本中硬编码ARROW_LIBHDFS_DIR
,也不能在~.config/dask/yarn.yaml
中设置此变量,因为它是在脚本执行期间派生的。那么有没有办法在脚本中更新工作环境而不覆盖它?
答案 0 :(得分:1)
构造函数没有选项,但是您可以通过立即访问dask的配置来做到这一点:
import dask
# Get the existing worker_env field (use `.copy` so as not to mutate it)
worker_env = dask.config.get("yarn.worker.env", {}).copy()
# Add a new environment variable
worker_env["env_var"] = env_var
# Create your cluster
cluster = YarnCluster(worker_env=worker_env, ...)