产生多个进程来编写不同的Python文件

时间:2016-11-07 19:55:41

标签: python

我们的想法是使用N进程编写N个文件。

要写入的文件的数据来自多个文件,这些文件存储在以列表作为值的字典中,如下所示:

dic = {'file1':['data11.txt', 'data12.txt', ..., 'data1M.txt'],
       'file2':['data21.txt', 'data22.txt', ..., 'data2M.txt'], 
        ...
       'fileN':['dataN1.txt', 'dataN2.txt', ..., 'dataNM.txt']}

所以file1data11 + data12 + ... + data1M等......

所以我的代码看起来像这样:

jobs = []
for d in dic:
    outfile = str(d)+"_merged.txt"
    with open(outfile, 'w') as out:
        p = multiprocessing.Process(target = merger.merger, args=(dic[d], name, out))
        jobs.append(p)
        p.start()
        out.close()

并且merger.py看起来像这样:

def merger(files, name, outfile):
    time.sleep(2)
    sys.stdout.write("Merging %n...\n" % name)

    # the reason for this step is that all the different files have a header
    # but I only need the header from the first file.
    with open(files[0], 'r') as infile:
        for line in infile:
            print "writing to outfile: ", name, line
            outfile.write(line) 
    for f in files[1:]:
        with open(f, 'r') as infile:
            next(infile) # skip first line
            for line in infile:
                outfile.write(line)
    sys.stdout.write("Done with: %s\n" % name)

我确实看到文件写在应该去的文件夹上,但它是空的。没有头,没什么。我把印刷品放在那里,看看是否一切都正确,但没有任何效果。

帮助!

2 个答案:

答案 0 :(得分:2)

由于工作进程与创建它们的主进程并行运行,因此名为out的文件在工作者可以写入之前关闭。即使您因out.close()语句删除with,也会发生这种情况。而是将每个进程传递给文件名,让进程打开并关闭文件。

答案 1 :(得分:2)

问题是您没有关闭子文件中的文件,因此内部缓冲的数据会丢失。您可以将文件打开到子项或将整个事物包装在try / finally块中以确保文件关闭。在父母中打开的一个潜在优势是你可以在那里处理文件错误。我不是说它引人注目,只是一种选择。

def merger(files, name, outfile):
    try:
        time.sleep(2)
        sys.stdout.write("Merging %n...\n" % name)

        # the reason for this step is that all the different files have a header
        # but I only need the header from the first file.
        with open(files[0], 'r') as infile:
            for line in infile:
                print "writing to outfile: ", name, line
                outfile.write(line) 
        for f in files[1:]:
            with open(f, 'r') as infile:
                next(infile) # skip first line
                for line in infile:
                    outfile.write(line)
        sys.stdout.write("Done with: %s\n" % name)
    finally:
        outfile.close()

<强>更新

关于父/子文件描述符以及子文件中发生的情况,存在一些混淆。如果程序退出时文件仍处于打开状态,则底层C库不会将数据刷新到磁盘。理论上说,正常运行的程序会在退出之前关闭事物。这是一个孩子因为没有关闭文件而丢失数据的例子。

import multiprocessing as mp
import os
import time

if os.path.exists('mytestfile.txt'):
    os.remove('mytestfile.txt')

def worker(f, do_close=False):
    time.sleep(2)
    print('writing')
    f.write("this is data")
    if do_close:
        print("closing")
        f.close()


print('without close')
f = open('mytestfile.txt', 'w')
p = mp.Process(target=worker, args=(f, False))
p.start()
f.close()
p.join()
print('file data:', open('mytestfile.txt').read())

print('with close')
os.remove('mytestfile.txt')
f = open('mytestfile.txt', 'w')
p = mp.Process(target=worker, args=(f, True))
p.start()
f.close()
p.join()
print('file data:', open('mytestfile.txt').read())

我在linux上运行它,我得到了

without close
writing
file data: 
with close
writing
closing
file data: this is data