在dask worker中同时有多个get_dataset

时间:2017-10-03 05:54:57

标签: python dask dask-distributed

TL; DR
如果在另一个查询正在下载所需数据集时有多个查询进入--Dask会尝试多次下载数据集吗?或者它是否会承认它“在飞行中”并自动等待它完成?

背景
如果我有一个刚刚启动的工作人员(没有数据集加载到内存中)并且我的函数要求输入数据集,则会根据需要将其下载到工作人员上。一个简单的场景:

(1) Worker boots
(2) Receives query which needs a dataset
(3) Downloads dataset (takes X seconds)
(4) Executes query

但是,如果我有以下情况:

(1) Worker boots
(2) Receives query which needs a dataset
(3) Downloads dataset (takes X seconds)
(4) Receives query which needs the same dataset which is currently downloading - will it download it again or detect in-flight?
(5) Receives another query which needs the same dataset which is currently downloading - will it download it again or detect in-flight?
(6) Execute queries

Dask是否尝试多次下载数据集,还是会确认它是“在飞行中”并自动等待它完成?

我已经阅读了源代码,但数据集发布/列表对我来说仍然是一个黑盒子。

1 个答案:

答案 0 :(得分:0)

client.get_dataset的每次调用都是独立的,多个请求将导致冗余工作。话虽这么说,你不应该在元数据之外的数据集中存储任何东西(比如指向远程期货的dask集合),所以这个下载在正确使用时应该只需要几毫秒。