我正在使用多重处理功能在多个csv文件上运行脚本。
如果某行与正则表达式匹配,则会将该行写入一个或多个新文件(新文件名等于match)。
我注意到从不同进程(文件锁定)写入同一文件时出现问题。我该如何解决?
我的代码:
import re
import glob
import os
import multiprocessing
pattern ='abc|def|ghi|jkl|mno'
regex = re.compile(pattern, re.IGNORECASE)
def process_files (file):
res_path = r'd:\results'
with open(file, 'r+', buffering=1) as ifile:
for line in ifile:
matches = set(regex.findall(line))
for match in matches:
res_file = os.path.join(res_path, match + '.csv')
with open(res_file, 'a') as rf:
rf.write(line)
def main():
p = multiprocessing.Pool()
for file in glob.iglob(r'D:\csv_files\**\*.csv', recursive=True):
p.apply_async(process, [file])
p.close()
p.join()
if __name__ == '__main__':
main()
谢谢!
答案 0 :(得分:3)
使每个子进程的文件名唯一:
def process_files (file, id):
res_path = r'd:\results'
for line in file:
matches = set(regex.findall(line))
for match in matches:
filename = "{}_{}.csv".format(match, id)
res_file = os.path.join(res_path, filename)
with open(res_file, 'a') as rf:
rf.write(line)
def main():
p = multiprocessing.Pool()
for id, file in enumerate(glob.iglob(r'D:\csv_files\**\*.csv', recursive=True)):
p.apply_async(process, [file, id])
然后,您将必须添加一些代码以将不同的“ _.csv”文件合并为单个“ .csv”文件。
要避免在同一文件上进行并发写入-您没有文件锁并且最终损坏了数据,或者您拥有文件锁,然后这会减慢该过程,从而破坏整个过程使其并行化。