我有一个程序试图预测一周内发送的每封电子邮件的电子邮件转换(通常是7次发送)。输出是7个不同的文件,每个客户都有预测分数。连续运行这些可能需要近8个小时,所以我尝试将它们与multiprocessing
并行化。这样可以加快速度,但是我注意到在进程完成之后它似乎保留了它的内存,直到没有剩下任何一个进程被系统杀死而没有完成它的任务。
我基于以下代码the 'manual pool' example in this answer,因为我需要限制因内存限制而立即启动的进程数。我想要的是,当一个进程完成时,它会释放内存给系统,为下一个工作者腾出空间。
以下是处理并发的代码:
def work_controller(in_queue, out_list):
while True:
key = in_queue.get()
print key
if key == None:
return
work_loop(key)
out_list.append(key)
if __name__ == '__main__':
num_workers = 4
manager = Manager()
results = manager.list()
work = manager.Queue(num_workers)
processes = []
for i in xrange(num_workers):
p = Process(target=work_controller, args=(work,results))
processes.append(p)
p.start()
iters = itertools.chain([key for key in training_dict.keys()])
for item in iters:
work.put(item)
for p in processes:
print "Joining Worker"
p.join()
这是实际的工作代码,如果有任何帮助:
def work_loop(key):
with open('email_training_dict.pkl','rb') as f:
training_dict = pickle.load(f)
df_test = pd.DataFrame.from_csv(test_file)
outdict = {}
target = 'is_convert'
df_train = train_dataframe(key)
features = data_cleanse(df_train,df_test)
# MAIN PREDICTION
print 'Start time: {}'.format(datetime.datetime.now()) + '\n'
# train/test by mailer
X_train = df_train[features]
X_test = df_test[features]
y_train = df_train[target]
# run model fit
clf = imbalance.ImbalanceClassifier()
clf = clf.fit(X_train, y_train)
y_hat = clf.predict(X_test)
outdict[key] = clf.y_vote
print outdict[key]
print 'Time Complete: {}'.format(datetime.datetime.now()) + '\n'
with open(output_file,'wb') as f:
pickle.dump(outdict,f)
答案 0 :(得分:1)
我假设,就像您链接的示例一样,您使用Queue.Queue()作为队列对象。这是一个阻塞队列,这意味着对queue.get()
的调用将返回一个元素,或者等待/阻塞直到它可以返回一个元素。
尝试将work_controller
功能更改为以下内容:
def work_controller(in_queue, out_list):
while True: # when the queue is empty return
try:
key = in_queue.get(False) # add False to not have the queue block
except Queue.Empty:
return
print key
work_loop(key)
out_list.append(key)
虽然以上解决了阻塞问题,但却产生了另一个问题。在线程开始时#39;生活中,in_queue中没有任何项目,因此线程将立即结束。
要解决此问题,我建议您添加一个标志,以指示是否可以终止。
global ok_to_end # put this flag in a global space
def work_controller(in_queue, out_list):
while True: # when the queue is empty return
try:
key = in_queue.get(False) # add False to not have the queue block
except Queue.Empty:
if ok_to_end: # consult the flag before ending.
return
print key
work_loop(key)
out_list.append(key)
if __name__ == '__main__':
num_workers = 4
manager = Manager()
results = manager.list()
work = manager.Queue(num_workers)
processes = []
ok_to_end = False # termination flag
for i in xrange(num_workers):
p = Process(target=work_controller, args=(work,results))
processes.append(p)
p.start()
iters = itertools.chain([key for key in training_dict.keys()])
for item in iters:
work.put(item)
ok_to_end = True # termination flag set to True after queue is filled
for p in processes:
print "Joining Worker"
p.join()