Question

我正在使用多重处理功能在多个csv文件上运行脚本。
如果某行与正则表达式匹配，则会将该行写入一个或多个新文件（新文件名等于match）。
我注意到从不同进程（文件锁定）写入同一文件时出现问题。我该如何解决？

我的代码：

import re
import glob
import os
import multiprocessing

pattern ='abc|def|ghi|jkl|mno'
regex = re.compile(pattern, re.IGNORECASE)

def process_files (file):
    res_path = r'd:\results'
    with open(file, 'r+', buffering=1) as ifile:
        for line in ifile:
            matches = set(regex.findall(line))
            for match in matches:
                res_file = os.path.join(res_path, match + '.csv') 
                with open(res_file, 'a') as rf:
                    rf.write(line)

def main():

    p = multiprocessing.Pool()
    for file in glob.iglob(r'D:\csv_files\**\*.csv', recursive=True):
        p.apply_async(process, [file]) 

    p.close()
    p.join()

if __name__ == '__main__':
    main()

谢谢！

Answer 1

使每个子进程的文件名唯一：

def process_files (file, id):
    res_path = r'd:\results'
    for line in file:
        matches = set(regex.findall(line))
        for match in matches:
            filename = "{}_{}.csv".format(match, id)
            res_file = os.path.join(res_path, filename) 
            with open(res_file, 'a') as rf:
                rf.write(line)

def main():

    p = multiprocessing.Pool()
    for id, file in enumerate(glob.iglob(r'D:\csv_files\**\*.csv', recursive=True)):
        p.apply_async(process, [file, id])

然后，您将必须添加一些代码以将不同的“ _.csv”文件合并为单个“ .csv”文件。

要避免在同一文件上进行并发写入-您没有文件锁并且最终损坏了数据，或者您拥有文件锁，然后这会减慢该过程，从而破坏整个过程使其并行化。

通过多处理写入同一文件（避免锁定）

1 个答案: