以下代码可以正常工作,但由于传递了大型数据集,因此速度非常慢。在实际实现中,创建流程和发送数据所需的速度与计算时间几乎相同,因此在创建第二个流程时,第一个流程几乎完成了计算,并行化了?没有意义的。
代码与此问题Multiprocessing has cutoff at 992 integers being joined as result中的代码相同,建议的更改在下面工作并实施。然而,我遇到了常见问题,因为我认为其他人需要长时间腌制大量数据。
我看到使用multiprocessing.array传递共享内存数组的答案。我有一个~4000个索引的数组,但每个索引都有一个包含200个键/值对的字典。每个过程只读取数据,完成一些计算,然后返回矩阵(4000x3)(没有dicts)。
像Is shared readonly data copied to different processes for Python multiprocessing?这样的答案使用地图。是否可以维护以下系统并实现共享内存?有没有一种有效的方法将数据发送到每个进程的数组,例如将dict包装在某个管理器中,然后将其放在multiprocessing.array中?
import multiprocessing
def main():
data = {}
total = []
for j in range(0,3000):
total.append(data)
for i in range(0,200):
data[str(i)] = i
CalcManager(total,start=0,end=3000)
def CalcManager(myData,start,end):
print 'in calc manager'
#Multi processing
#Set the number of processes to use.
nprocs = 3
#Initialize the multiprocessing queue so we can get the values returned to us
tasks = multiprocessing.JoinableQueue()
result_q = multiprocessing.Queue()
#Setup an empty array to store our processes
procs = []
#Divide up the data for the set number of processes
interval = (end-start)/nprocs
new_start = start
#Create all the processes while dividing the work appropriately
for i in range(nprocs):
print 'starting processes'
new_end = new_start + interval
#Make sure we dont go past the size of the data
if new_end > end:
new_end = end
#Generate a new process and pass it the arguments
data = myData[new_start:new_end]
#Create the processes and pass the data and the result queue
p = multiprocessing.Process(target=multiProcess,args=(data,new_start,new_end,result_q,i))
procs.append(p)
p.start()
#Increment our next start to the current end
new_start = new_end+1
print 'finished starting'
#Print out the results
for i in range(nprocs):
result = result_q.get()
print result
#Joint the process to wait for all data/process to be finished
for p in procs:
p.join()
#MultiProcess Handling
def multiProcess(data,start,end,result_q,proc_num):
print 'started process'
results = []
temp = []
for i in range(0,22):
results.append(temp)
for j in range(0,3):
temp.append(j)
result_q.put(results)
return
if __name__== '__main__':
main()
解决
只需将字典列表放入经理,问题就解决了。
manager=Manager()
d=manager.list(myData)
看起来持有该列表的经理也管理该列表所包含的字典。启动时间有点慢,所以看起来数据仍然被复制,但它在开始时完成一次,然后在数据被切片的过程中完成。
import multiprocessing
import multiprocessing.sharedctypes as mt
from multiprocessing import Process, Lock, Manager
from ctypes import Structure, c_double
def main():
data = {}
total = []
for j in range(0,3000):
total.append(data)
for i in range(0,100):
data[str(i)] = i
CalcManager(total,start=0,end=500)
def CalcManager(myData,start,end):
print 'in calc manager'
print type(myData[0])
manager = Manager()
d = manager.list(myData)
#Multi processing
#Set the number of processes to use.
nprocs = 3
#Initialize the multiprocessing queue so we can get the values returned to us
tasks = multiprocessing.JoinableQueue()
result_q = multiprocessing.Queue()
#Setup an empty array to store our processes
procs = []
#Divide up the data for the set number of processes
interval = (end-start)/nprocs
new_start = start
#Create all the processes while dividing the work appropriately
for i in range(nprocs):
new_end = new_start + interval
#Make sure we dont go past the size of the data
if new_end > end:
new_end = end
#Generate a new process and pass it the arguments
data = myData[new_start:new_end]
#Create the processes and pass the data and the result queue
p = multiprocessing.Process(target=multiProcess,args=(d,new_start,new_end,result_q,i))
procs.append(p)
p.start()
#Increment our next start to the current end
new_start = new_end+1
print 'finished starting'
#Print out the results
for i in range(nprocs):
result = result_q.get()
print len(result)
#Joint the process to wait for all data/process to be finished
for p in procs:
p.join()
#MultiProcess Handling
def multiProcess(data,start,end,result_q,proc_num):
#print 'started process'
results = []
temp = []
data = data[start:end]
for i in range(0,22):
results.append(temp)
for j in range(0,3):
temp.append(j)
print len(data)
result_q.put(results)
return
if __name__ == '__main__':
main()
答案 0 :(得分:2)
看看你的问题,我假设如下:
myData
中的每个项目,您想要返回一个输出(某种矩阵)tasks
)可能用于保存输入,但不确定如何使用它import logging
import multiprocessing
def create_logger(logger_name):
''' Create a logger that log to the console '''
logger = logging.getLogger(logger_name)
logger.setLevel(logging.DEBUG)
# create console handler and set appropriate level
ch = logging.StreamHandler()
formatter = logging.Formatter("%(processName)s %(funcName)s() %(levelname)s: %(message)s")
ch.setFormatter(formatter)
logger.addHandler(ch)
return logger
def main():
global logger
logger = create_logger(__name__)
logger.info('Main started')
data = []
for i in range(0,100):
data.append({str(i):i})
CalcManager(data,start=0,end=50)
logger.info('Main ended')
def CalcManager(myData,start,end):
logger.info('CalcManager started')
#Initialize the multiprocessing queue so we can get the values returned to us
tasks = multiprocessing.JoinableQueue()
results = multiprocessing.Queue()
# Add tasks
for i in range(start, end):
tasks.put(myData[i])
# Create processes to do work
nprocs = 3
for i in range(nprocs):
logger.info('starting processes')
p = multiprocessing.Process(target=worker,args=(tasks,results))
p.daemon = True
p.start()
# Wait for tasks completion, i.e. tasks queue is empty
try:
tasks.join()
except KeyboardInterrupt:
logger.info('Cancel tasks')
# Print out the results
print 'RESULTS'
while not results.empty():
result = results.get()
print result
logger.info('CalManager ended')
def worker(tasks, results):
while True:
try:
task = tasks.get() # one row of input
task['done'] = True # simular work being done
results.put(task) # Save the result to the output queue
finally:
# JoinableQueue: for every get(), we need a task_done()
tasks.task_done()
if __name__== '__main__':
main()
logging
模块,因为它提供了一些优势:
CalcManager
本质上是一个任务管理器,它执行以下操作
tasks
worker
是完成工作的地方
while True
循环)task_done()
,以便主进程知道所有作业何时完成。我将task_done
放在finally
子句中,以确保即使处理过程中发生错误也会运行答案 1 :(得分:2)
通过使用multiprocessing.Manager
将列表存储在管理器服务器中,并让每个子进程通过从一个共享列表中提取项目来处理来自dict的项目,而不是将切片复制到每个子进程:
def CalcManager(myData,start,end):
print 'in calc manager'
print type(myData[0])
manager = Manager()
d = manager.list(myData)
nprocs = 3
result_q = multiprocessing.Queue()
procs = []
interval = (end-start)/nprocs
new_start = start
for i in range(nprocs):
new_end = new_start + interval
if new_end > end:
new_end = end
p = multiprocessing.Process(target=multiProcess,
args=(d, new_start, new_end, result_q, i))
procs.append(p)
p.start()
#Increment our next start to the current end
new_start = new_end+1
print 'finished starting'
for i in range(nprocs):
result = result_q.get()
print len(result)
#Joint the process to wait for all data/process to be finished
for p in procs:
p.join()
在创建任何工作人员之前,这会将您的整个data
列表复制到Manager
进程。 Manager
返回Proxy
对象,该对象允许对list
进行共享访问。然后,您只需将Proxy
传递给工作人员,这意味着他们的启动时间将大大减少,因为不再需要复制data
列表的切片。这里的缺点是访问列表的孩子会更慢,因为访问需要通过IPC进入管理器进程。这是否真的有助于提高性能在很大程度上取决于您在工作流程中list
上正在做什么工作,但值得一试,因为它只需要很少的代码更改。