我正在使用pandas来计算大量数据的统计数据,但它最终运行了几个小时,而且我经常获得新数据。我已经尝试过优化,但我想让它更快,所以我试图让它使用多个进程。我遇到的问题是,我需要在结果完成后对结果进行一些临时工作,我见过multiprocessing.Process
和Pool
的例子都等待一切都完成在使用结果之前。
这是我现在正在使用的严格修饰的代码。我想把它放入单独的进程中的是generateAnalytics()。
for counter, symbol in enumerate(queuelist): # queuelist
if needQueueLoad: # set by another thread that's monitoring for new data (in the form of a new file that arrives a couple times a day)
log.info('Shutting down analyticsRunner thread')
break
dfDay = generateAnalytics(symbol) # slow running function (15s+)
astore[analyticsTable(symbol)] = dfDay # astore is a pandas store (HDF5). analyticsTable() returns the name of the appropriate table, which gets overwritten
dfLatest.loc[symbol] = dfDay.iloc[-1] # update with the latest results (dfLatest is the latest results for each symbol, which is loaded as a global at startup and periodically saved back to the store in another thread)
log.info('Processed {}/{} securities in queue.'.format(counter+1, len(queuelist)))
# do some stuff to update progress GUI
我无法弄清楚如何让最后一行与结果一起使用,而且还在继续,并希望得到建议。
我正在考虑在Pool
中运行它并让进程将结果添加到Queue
(而不是返回它们),然后让一个while循环停留在主进程中随着结果进入队列 - 这是一种合理的方式吗?类似的东西:
mpqueue = multiprocessing.Queue()
pool = multiprocessing.Pool()
pool.map(generateAnalytics, [queuelist, mpqueue])
while not needQueueLoad: # set by another thread that's monitoring for new data (in the form of a new file that arrives a couple times a day)
while not mpqueue.empty():
dfDay = mpqueue.get()
astore[analyticsTable(symbol)] = dfDay # astore is a pandas store (HDF5). analyticsTable() returns the name of the appropriate table, which gets overwritten
dfLatest.loc[symbol] = dfDay.iloc[-1] # update with the latest results (dfLatest is the latest results for each symbol, which is loaded as a global at startup and periodically saved back to the store in another thread)
log.info('Processed {}/{} securities in queue.'.format(counter+1, len(queuelist)))
# do some stuff to update GUI that shows progress
sleep(0.1)
# do some bookkeeping to see if queue has finished
pool.join()
答案 0 :(得分:2)
使用Queue
看起来是一种合理的方式,有两个评论。
由于它从您使用GUI的代码看起来,在超时函数或空闲函数中而不是在while循环中检查结果可能更好。使用while循环检查结果会阻止GUI的事件循环。
如果工作进程需要通过队列将大量数据返回到主进程,这将增加大量开销。您可能需要考虑使用共享内存甚至是中间文件。