Question

我正在进行大量计算，将结果写入文件。使用多处理我试图将计算并行化。

这里的问题是我正在写一个输出文件，所有工作人员也在写这个文件。我对多处理很陌生，并想知道如何让它发挥作用。

下面给出了一个非常简单的代码概念：

from multiprocessing import Pool

fout_=open('test'+'.txt','w')

def f(x):
    fout_.write(str(x) + "\n")


if __name__ == '__main__':
    p = Pool(5)
    p.map(f, [1, 2, 3])

我想要的结果是一个文件：

1 2 3

然而现在我得到一个空文件。有什么建议？我非常感谢任何帮助：）！

Answer 1

Multiprocessing.pool生成进程，在没有锁定的情况下写入公共文件可能会导致数据丢失。正如您所说，您正在尝试并行计算，multiprocessing.pool可用于并行化计算。

下面是进行并行计算并将结果写入文件的解决方案，希望它有所帮助：

from multiprocessing import Pool

# library for time 
import datetime

# file in which you want to write 
fout = open('test.txt', 'wb')

# function for your calculations, i have tried it to make time consuming
def calc(x):
    x = x**2
    sum = 0
    for i in range(0, 1000000):
        sum += i
    return x

# function to write in txt file, it takes list of item to write
def f(res):
    global fout
    for x in res:
        fout.write(str(x) + "\n")

if __name__ == '__main__':
    qs = datetime.datetime.now()
    arr = [1, 2, 3, 4, 5, 6, 7]
    p = Pool(5)
    res = p.map(calc, arr)
    # write the calculated list in file
    f(res)
    qe = datetime.datetime.now()
    print (qe-qs).total_seconds()*1000
    # to compare the improvement using multiprocessing, iterative solution
    qs = datetime.datetime.now()
    for item in arr:
        x = calc(item)
        fout.write(str(x)+"\n")
    qe = datetime.datetime.now()
    print (qe-qs).total_seconds()*1000

Answer 2

您不应该让所有工作人员/进程写入单个文件。它们都可以从一个文件读取（由于工作人员等待其中一个文件完成读取，可能导致速度减慢），但写入同一文件将导致冲突并可能导致损坏。

如评论中所述，写入单独的文件，然后在单个进程中将它们合并为一个文件。这个小程序根据你帖子中的程序说明了它：

from multiprocessing import Pool

def f(args):
    ''' Perform computation and write
    to separate file for each '''
    x = args[0]
    fname = args[1]
    with open(fname, 'w') as fout:
        fout.write(str(x) + "\n")

def fcombine(orig, dest):
    ''' Combine files with names in 
    orig into one file named dest '''
    with open(dest, 'w') as fout:
        for o in orig:
            with open(o, 'r') as fin:
                for line in fin:
                    fout.write(line)

if __name__ == '__main__':
    # Each sublist is a combination
    # of arguments - number and temporary output
    # file name
    x = range(1,4)
    names = ['temp_' + str(y) + '.txt' for y in x]
    args = list(zip(x,names))

    p = Pool(3)
    p.map(f, args)

    p.close()
    p.join()

    fcombine(names, 'final.txt')

它为每个参数组合运行f，在这种情况下，它是x和临时文件名的值。它使用嵌套的参数组合列表，因为pool.map不接受多个参数。还有其他方法可以解决这个问题，尤其是在较新的Python版本上。

对于每个参数组合和池成员，它会创建一个单独的文件，并将其写入输出。原则上你的输出会更长，你可以简单地添加另一个函数来计算它到f函数。此外，不需要将Pool（5）用于3个参数（尽管我假设只有三个工作者是活跃的）。

在this帖子中详细解释了致电close()和join()的原因。事实证明（在链接帖子的评论中）{{1}}正在阻止，所以在这里你原来不需要它们（等到它们全部完成然后只从一个写入组合输出文件）处理）。如果以后添加其他并行功能，我仍会使用它们。

在最后一步中，map将所有临时文件收集并复制到一个文件中。它有点过于嵌套，如果您在复制后决定删除临时文件，则可能需要在fcombine或下面的for循环下使用单独的函数 - 以提高可读性和功能。

在Pool multiprocessing中编写文件（Python 2.7）

2 个答案: