TL; DR
如果在另一个查询正在下载所需数据集时有多个查询进入--Dask会尝试多次下载数据集吗?或者它是否会承认它“在飞行中”并自动等待它完成?
背景
如果我有一个刚刚启动的工作人员(没有数据集加载到内存中)并且我的函数要求输入数据集,则会根据需要将其下载到工作人员上。一个简单的场景:
(1) Worker boots
(2) Receives query which needs a dataset
(3) Downloads dataset (takes X seconds)
(4) Executes query
但是,如果我有以下情况:
(1) Worker boots
(2) Receives query which needs a dataset
(3) Downloads dataset (takes X seconds)
(4) Receives query which needs the same dataset which is currently downloading - will it download it again or detect in-flight?
(5) Receives another query which needs the same dataset which is currently downloading - will it download it again or detect in-flight?
(6) Execute queries
Dask是否尝试多次下载数据集,还是会确认它是“在飞行中”并自动等待它完成?
我已经阅读了源代码,但数据集发布/列表对我来说仍然是一个黑盒子。
答案 0 :(得分:0)
对client.get_dataset
的每次调用都是独立的,多个请求将导致冗余工作。话虽这么说,你不应该在元数据之外的数据集中存储任何东西(比如指向远程期货的dask集合),所以这个下载在正确使用时应该只需要几毫秒。