Question

我正在编写一个python程序，它运行一个外部模型并行一定次数来定义参数空间中的数据。由于外部模型的编写方式（我保证有充分的理由），如果我想同时运行它，我必须制作模型文件夹的新副本。我制作了我的主模型文件夹foo的副本，并将它们称为foo0，foo1，foo2和foo3。现在我希望能够根据线程进入特定目录，进行一些更改，运行模型，写入主文件，然后转到下一次运行。每个模型运行可能需要30到200秒，因此并行和串行运行的好处。

import subprocess
from joblib import Parallel

def run_model(base_path):
    #Make some changes using a random number generator in the folder
    ....
    #Run the model using bash on windows. Note the str(threading.get_ident()) is my
    #attempt to get the thread 0,1,2,3
    subprocess.run(['bash','-c', base_path + str(threading.get_ident()) + '/Model.exe'])
    #Write some input for the run to a main file that will store all runs
    with open('Inputs.txt','a') as file:
        with open(base_path + str(threading.get_ident()) + '/inp.txt') as inp_file:
            for i,line in enumerate(inp_file):
                if i == 5:
                    file.write(line)

Parallel(n_jobs=4, backend="threading")(run_model('Models/foo') for i in range(0,10000))

但是，我一直收到FileNotFoundError，因为线程ID不断变化，文件夹不存在。该模型很大，因此使用新的线程ID（类似于名为foo + thread_id的文件夹）复制模型既慢又占用大量磁盘空间。有没有什么办法可以限制某个模型的某个副本在某个线程上运行，确保它没有被任何其他线程使用？

Answer 1

您可以像这样构建您的程序：

首先，主线程搜索需要处理的文件夹，并将它们放入线程安全的队列中。在此阶段，您确保队列仅包含唯一项。在队列周围使用同步原语，以确保一次只能访问一件事。
工作线程已启动
工作线程从syncrhonized线程安全队列中取出工作，然后处理它们。
当队列中没有剩余工作时，线程会加入
当所有线程都已加入时，工作就完成了。

这张照片就像：

   Queue of unique dirs is constructed.
                  ||
                  \/            Consumer 0
                              /
                             / /Consumer 1
Queue(DirD->DirC->DirB->DirA)    ...
                             \ \Consumer i
                  ||          \  ...
                  \/            Consumer n

           Dirs are processed.

Answer 2

只需将流程专用于每个目录：

import subprocess
from multiprocessing import Process, Lock

def run_model(base_path, iterations, output_lock):
    for x in range(iterations):
        #Make some changes using a random number generator in the folder
        ...
        #Run the model using bash on windows.
        subprocess.run(['bash','-c', base_path + '/Model.exe'])
        #Write some input for the run to a main file that will store all runs
        with output_lock:
            with open('Inputs.txt','a') as out_file:
                with open(base_path + '/inp.txt') as inp_file:
                    for i,line in enumerate(inp_file):
                        if i == 5:
                            out_file.write(line)

N = 4
total_runs = 10000
process_list = list()
output_lock = Lock()
for x in range(N):
    arguments = ("Models/foo%s" % x, int(total_runs / N), output_lock)
    p = Process(target=run_model, args=arguments)
    p.daemon = True
    p.start()
    process_list.append(p)
for p in process_list:
    p.join()

我冒昧地重命名输出文件句柄，这样它就不会覆盖内置的file类。另外，我添加了一个锁以保护输出文件。

Answer 3

在大多数情况下，除了一个之外，不要依赖threading.get_ident()：告诉两个线程实例是否实际上是同一个。然而，最初的实施并不是唯一的法律案例，这就是它陷入混乱的原因。

尝试将对run_model('Models/foo')的调用重新调整为其他表单，例如：

    run_model('Models/foo', i%4)
    run_model('Models{}/foo'.format(i%4))

取决于青睐。然后在run_model()中修改主体以利用此新参数来创建所需的线程工作者。我认为这应该主要解决问题。

但更重要的是，您的代码主体缺乏同步。在新线程的调用之间需要一些lock()或wait()机制，否则你会陷入另一个混乱，例如：一次创建10,000个线程或访问同一文件的2,500个线程。：）

如何防止python中的并行for循环访问同一个文件夹两次？

3 个答案: