在不占用工人的情况下睡觉的Dask功能?

时间:2018-04-30 22:02:17

标签: dask dask-distributed

在某些类别的数据管道中,等待一些外部进程完成(例如)观察文件是否写入是很有用的。
在dask中实现这个天真会导致长时间运行的任务在整个持续时间内阻塞工作者。

def wait_for_file(filename='some_filename', max_wait_time=600):
    start_time = time.time()
    while True:
        if time.time() - start_time > max_wait_time:
            raise Exception('Timeout')
        if exists(filename):
            return filename
        time.sleep(0.1)

file_exists = delayed(wait_for_file)()    
res = delayed(process_file)(file_exists)

如何使此代码不阻止工作者

1 个答案:

答案 0 :(得分:1)

使用http://dask.pydata.org/en/latest/futures.html#submit-tasks-from-tasks中提到的secederejoin,您可以按如下方式编写此等待功能

def wait_for_file(filename='some_filename', max_wait_time=600):
    start_time = time.time()
    # detach from the scheduler
    distributed.secede()
    try:
        while True:
            if time.time() - start_time > max_wait_time:
                raise Exception('Timeout')
            if exists(filename):
                # rejoin to the pool of dask executor threads and return
                distributed.rejoin()
                return filename
            time.sleep(0.1)
    finally:
        # in the case where something goes wrong you want to rejoin
        # so that your client knows that this function call failed
        distributed.rejoin()