我正在尝试使用描述的here代理方法在多个处理中共享现有对象。我的多处理习惯用法是工作人员/队列设置,模仿第四个例子here之后。
代码需要对存储在磁盘上相当大的文件中的数据进行一些计算。我有一个封装所有I / O交互的类,一旦它从磁盘读取文件,它就会将数据保存在内存中,以便下次任务需要使用相同的数据(经常发生)。
我认为通过阅读上面链接的示例,我已经完成了所有工作。下面是使用numpy随机数组来模拟磁盘I / O的代码的模拟:
import numpy
from multiprocessing import Process, Queue, current_process, Lock
from multiprocessing.managers import BaseManager
nfiles = 200
njobs = 1000
class BigFiles:
def __init__(self, nfiles):
# Start out with nothing read in.
self.data = [ None for i in range(nfiles) ]
# Use a lock to make sure only one process is reading from disk at a time.
self.lock = Lock()
def access(self, i):
# Get the data for a particular file
# In my real application, this function reads in files from disk.
# Here I mock it up with random numpy arrays.
if self.data[i] is None:
with self.lock:
self.data[i] = numpy.random.rand(1024,1024)
return self.data[i]
def summary(self):
return 'BigFiles: %d, %d Storing %d of %d files in memory'%(
id(self),id(self.data),
(len(self.data) - self.data.count(None)),
len(self.data) )
# I'm using a worker/queue setup for the multprocessing:
def worker(input, output):
proc = current_process().name
for job in iter(input.get, 'STOP'):
(big_files, i, ifile) = job
data = big_files.access(ifile)
# Do some calculations on the data
answer = numpy.var(data)
msg = '%s, job %d'%(proc, i)
msg += '\n Answer for file %d = %f'%(ifile, answer)
msg += '\n ' + big_files.summary()
output.put(msg)
# A class that returns an existing file when called.
# This is my attempted workaround for the fact that Manager.register needs a callable.
class ObjectGetter:
def __init__(self, obj):
self.obj = obj
def __call__(self):
return self.obj
def main():
# Prior to the place where I want to do the multprocessing,
# I already have a BigFiles object, which might have some data already read in.
# (Here I start it out empty.)
big_files = BigFiles(nfiles)
print 'Initial big_files.summary = ',big_files.summary()
# My attempt at making a proxy class to pass big_files to the workers
class BigFileManager(BaseManager):
pass
getter = ObjectGetter(big_files)
BigFileManager.register('big_files', callable = getter)
manager = BigFileManager()
manager.start()
# Set up the jobs:
task_queue = Queue()
for i in range(njobs):
ifile = numpy.random.randint(0, nfiles)
big_files_proxy = manager.big_files()
task_queue.put( (big_files_proxy, i, ifile) )
# Set up the workers
nproc = 12
done_queue = Queue()
process_list = []
for j in range(nproc):
p = Process(target=worker, args=(task_queue, done_queue))
p.start()
process_list.append(p)
task_queue.put('STOP')
# Log the results
for i in range(njobs):
msg = done_queue.get()
print msg
print 'Finished all jobs'
print 'big_files.summary = ',big_files.summary()
# Shut down the workers
for j in range(nproc):
process_list[j].join()
task_queue.close()
done_queue.close()
main()
这在某种意义上说它可以正确地计算所有内容,并且它正在缓存沿途读取的数据。我遇到的唯一问题是,最后,big_files对象没有加载任何文件。返回的最终消息是:
Process-2, job 999. Answer for file 198 = 0.083406
BigFiles: 4303246400, 4314056248 Storing 198 of 200 files in memory
但是在完成所有工作之后,我们有了:
Finished all jobs
big_files.summary = BigFiles: 4303246400, 4314056248 Storing 0 of 200 files in memory
所以我的问题是:所有存储的数据发生了什么变化?它声称根据id(self.data)使用相同的self.data。但现在它已经空了。
我希望big_files的结束状态包含它在路上累积的所有已保存数据,因为我实际上必须多次重复整个过程,所以我不想重做所有(慢)每次I / O.
我认为它必须与我的ObjectGetter类有关。使用BaseManager的示例仅显示如何创建将要共享的新对象,而不是共享现有对象。我是否因为获取现有的big_files对象而做错了什么?任何人都可以建议更好的方法来做这一步吗?
非常感谢!