我压缩文件。一个过程对于其中的一些是好的,但是我压缩了数千个过程,这可能(并且已经过了几天),所以我想通过多处理加速它。我read我应该避免让多个进程同时读取文件,我猜我不应该同时写多个进程。这是我目前单独运行的方法:
import tarfile, bz2, os
def compress(folder):
"compresses a folder into a file"
bz_file = bz2.BZ2File(folder+'.tbz', 'w')
with tarfile.open(mode='w', fileobj = bz_file) as tar:
for fn in os.listdir(folder):
read each file in the folder and do some pre processing
that will make the compressed file much smaller than without
tar.addfile( processed file )
bz_file.close()
return
这是一个文件夹并将其所有内容压缩到一个文件中。这使它们更容易处理和更有条理。如果我把它扔到一个池中,那么我会有几个进程同时读取和写入,所以我想避免这种情况。我可以重做它,所以只有一个进程正在读取文件,但我还有多个写入文件:
import multiprocessing as mp
import tarfile, bz2, os
def compress(file_list):
folder = file_list[0]
bz_file = bz2.BZ2File(folder+'.tbz', 'w')
with tarfile.open(mode='w', fileobj = bz_file) as tar:
for i in file_list[1:]:
preprocess file data
tar.addfile(processed data)
bz_file.close()
return
cpu_count = mp.cpu_count()
p = mp.Pool(cpu_count)
for subfolder in os.listdir(main_folder):
read all files in subfolder into memory, place into file_list
place file_list into fld_list until fld_list contains cpu_count
file lists. then pass to p.map(compress, fld_list)
这仍然有许多进程一次写入压缩文件。只是告诉tarfile使用什么样的压缩开始写入硬盘驱动器的行为。我无法读取我需要压缩到内存中的所有文件,因为我没有这么多RAM来执行此操作 - 因此它也存在我多次重启Pool.map的问题。
如何在一个进程中读写文件,但在多个进程中进行所有压缩,同时避免多次重启多处理.Pool?
答案 0 :(得分:4)
不应使用multiprocessing.Pool
,而应使用multiprocessing.Queue
并创建收件箱和发件箱。
启动单个进程以读入文件并将数据放入收件箱队列,并限制队列的大小,这样就不会最终填满RAM。此处的示例压缩单个文件,但可以调整它以一次处理整个文件夹。
def reader(inbox, input_path, num_procs):
"process that reads in files to be compressed and puts to inbox"
for fn in os.listdir(input_path):
path = os.path.join(input_path, fn)
# read in each file, put data into inbox
fname = os.path.basename(fn)
with open(fn, 'r') as src: lines = src.readlines()
data = [fname, lines]
inbox.put(data)
# read in everything, add finished notice for all running processes
for i in range(num_procs):
inbox.put(None) # when a compressor sees a None, it will stop
inbox.close()
return
但这只是问题的一半,另一部分是压缩文件而不必将其写入磁盘。我们给压缩函数一个StringIO
对象而不是一个打开的文件;它传递给tarfile
。压缩后,我们将StringIO对象放入发件箱队列。
除非我们不能这样做,因为无法对StringIO对象进行pickle,只有pickleable个对象可以进入队列。但是,StringIO的getvalue
函数可以以可选择的格式提供内容,因此使用getvalue获取内容,关闭StringIO对象,然后将内容放入发件箱。
from io import StringIO
import tarfile
def compressHandler(inbox, outbox):
"process that pulls from inbox, compresses and puts to outbox"
supplier = iter(inbox.get, None) # stops when gets a None
while True:
try:
data = next(supplier) # grab data from inbox
pressed = compress(data) # compress it
ou_que.put(pressed) # put into outbox
except StopIteration:
outbox.put(None) # finished compressing, inform the writer
return # and quit
def compress(data):
"compress file"
bz_file = StringIO()
fname, lines = dat # see reader def for package order
with tarfile.open(mode='w:bz2', fileobj=bz_file) as tar:
info = tarfile.TarInfo(fname) # store file name
tar.addfile(info, StringIO(''.join(lines))) # compress
data = bz_file.getvalue()
bz_file.close()
return data
然后,编写器进程从发件箱队列中提取内容并将其写入磁盘。这个函数需要知道有多少压缩进程被启动,因此它只知道在听到每个进程都已停止时停止。
def writer(outbox, output_path, num_procs):
"single process that writes compressed files to disk"
num_fin = 0
while True:
# all compression processes have finished
if num_finished >= num_procs: break
tardata = outbox.get()
# a compression process has finished
if tardata == None:
num_fin += 1
continue
fn, data = tardata
name = os.path.join(output_path, fn) + '.tbz'
with open(name, 'wb') as dst: dst.write(data)
return
最后,还有将它们放在一起的设置
import multiprocessing as mp
import os
def setup():
fld = 'file/path'
# multiprocess setup
num_procs = mp.cpu_count()
# inbox and outbox queues
inbox = mp.Queue(4*num_procs) # limit size
outbox = mp.Queue()
# one process to read
reader = mp.Process(target = reader, args = (inbox, fld, num_procs))
reader.start()
# n processes to compress
compressors = [mp.Process(target = compressHandler, args = (inbox, outbox))
for i in range(num_procs)]
for c in compressors: c.start()
# one process to write
writer = mp.Process(target = writer, args=(outbox, fld, num_procs))
writer.start()
writer.join() # wait for it to finish
print('done!')