Question

我想使用dask来处理大约5000个将结果存储在关系数据库中的批处理任务，并且在它们全部完成之后我想运行一个最终任务来查询数据库并生成结果文件（这将是存储在AWS S3中）

所以它或多或少是这样的：

from dask import bag, delayed batches = bag.from_sequence(my_batches()) results = batches.map(process_batch_and_store_results_in_database) graph = delayed(read_database_and_store_bundled_result_into_s3)(results) client = Client('the_scheduler:8786') client.compute(graph)

这样可行，但是：接近处理结束时，许多工作人员都处于闲置状态，我希望能够将其关闭（并在AWS EC2上节省一些资金），但如果我这样做，调度程序将“忘记“那些任务已经完成，并尝试再次对其余工人进行操作。

我知道这实际上是一个功能，而不是一个bug，因为Dask试图在启动read_database_and_store_bundled_result_into_s3之前跟踪所有结果，但是：有什么方法我可以告诉dask只是为了协调分布式处理图并不担心状态管理？

Answer 1

我建议您在完成后忘记期货。此解决方案使用dask.distributed concurrent.futures接口而不是dask.bag。特别是它使用as_completed迭代器。

from dask.distributed import Client, as_completed
client = Client('the_scheduler:8786')

futures = client.map(process_batch_and_store_results_in_database, my_batches())

seq = as_completed(futures)
del futures # now only reference to the futures is within seq

for future in seq:
    pass  # let future be garbage collected

DASK - 在执行期间停止工作人员会导致已完成的任务被启动两次

1 个答案: