通过多处理写入同一文件(避免锁定)

时间:2019-06-12 11:12:16

标签: python-3.x multiprocessing locking

我正在使用多重处理功能在多个csv文件上运行脚本。
如果某行与正则表达式匹配,则会将该行写入一个或多个新文件(新文件名等于match)。
我注意到从不同进程(文件锁定)写入同一文件时出现问题。我该如何解决?

我的代码:

import re
import glob
import os
import multiprocessing

pattern ='abc|def|ghi|jkl|mno'
regex = re.compile(pattern, re.IGNORECASE)

def process_files (file):
    res_path = r'd:\results'
    with open(file, 'r+', buffering=1) as ifile:
        for line in ifile:
            matches = set(regex.findall(line))
            for match in matches:
                res_file = os.path.join(res_path, match + '.csv') 
                with open(res_file, 'a') as rf:
                    rf.write(line)

def main():

    p = multiprocessing.Pool()
    for file in glob.iglob(r'D:\csv_files\**\*.csv', recursive=True):
        p.apply_async(process, [file]) 

    p.close()
    p.join()

if __name__ == '__main__':
    main()

谢谢!

1 个答案:

答案 0 :(得分:3)

使每个子进程的文件名唯一:

def process_files (file, id):
    res_path = r'd:\results'
    for line in file:
        matches = set(regex.findall(line))
        for match in matches:
            filename = "{}_{}.csv".format(match, id)
            res_file = os.path.join(res_path, filename) 
            with open(res_file, 'a') as rf:
                rf.write(line)

def main():

    p = multiprocessing.Pool()
    for id, file in enumerate(glob.iglob(r'D:\csv_files\**\*.csv', recursive=True)):
        p.apply_async(process, [file, id]) 

然后,您将必须添加一些代码以将不同的“ _.csv”文件合并为单个“ .csv”文件。

要避免在同一文件上进行并发写入-您没有文件锁并且最终损坏了数据,或者您拥有文件锁,然后这会减慢该过程,从而破坏整个过程使其并行化。