Question

我有一些Python代码使用ThreadPoolExecutor来销售昂贵的工作，我想跟踪哪些已经完成，所以如果我必须重新启动这个系统，我不必重做已经完成的东西。在单线程环境中，我可以标记我在架子上所做的事情。这是多线程环境中这个想法的天真端口：

from concurrent.futures import ThreadPoolExecutor
import subprocess
import shelve


def do_thing(done, x):
    # Don't let the command run in the background; we want to be able to tell when it's done
    _ = subprocess.check_output(["some_expensive_command", x])
    done[x] = True


futs = []
with shelve.open("done") as done:
    with ThreadPoolExecutor(max_workers=18) as executor:
        for x in things_to_do:
            if done.get(x, False):
                continue
            futs.append(executor.submit(do_thing, done, x))
            # Can't run `done[x] = True` here--have to wait until do_thing finishes
        for future in futs:
            future.result()

    # Don't want to wait until here to mark stuff done, as the whole system might be killed at some point
    # before we get through all of things_to_do

我能逃脱这个吗？ documentation for shelve不包含任何有关线程安全的保证，因此我不这么认为。

那么处理这个问题的简单方法是什么？我认为也许在done[x] = True中坚持future.add_done_callback会做到这一点，但that will often run in the same thread as the future itself。也许有一个与ThreadPoolExecutor很好地配合的锁定机制？对我来说，写一个睡眠然后检查已完成的未来的循环对我来说似乎更清晰。

Answer 1

虽然你仍然在最外层with上下文管理器中，done搁置只是一个普通的python对象 - 它只在上下文管理器关闭时写入磁盘并且它运行其__exit__方法。因此，由于GIL（只要您使用CPython），它就像任何其他python对象一样是线程安全的。

具体来说，重新分配done[x] = True是线程安全的/将以原子方式完成。

值得注意的是，虽然搁置的__exit__方法将在Ctrl-C之后运行，但如果python进程突然结束，它就不会胜利，并且搁置赢了＆＃39;保存到磁盘。

为防止出现此类故障，我建议使用基于文件的轻量级线程安全数据库，例如sqllite3。

Python中

1 个答案: