python多处理 - 如何对中间结果采取行动

时间:2015-08-17 09:31:21

标签: python multiprocessing

我正在使用pandas来计算大量数据的统计数据,但它最终运行了几个小时,而且我经常获得新数据。我已经尝试过优化,但我想让它更快,所以我试图让它使用多个进程。我遇到的问题是,我需要在结果完成后对结果进行一些临时工作,我见过multiprocessing.ProcessPool的例子都等待一切都完成在使用结果之前。

这是我现在正在使用的严格修饰的代码。我想把它放入单独的进程中的是generateAnalytics()。

for counter, symbol in enumerate(queuelist):  # queuelist
    if needQueueLoad:  # set by another thread that's monitoring for new data (in the form of a new file that arrives a couple times a day)
        log.info('Shutting down analyticsRunner thread')
        break
    dfDay = generateAnalytics(symbol)  # slow running function (15s+)
    astore[analyticsTable(symbol)] = dfDay  # astore is a pandas store (HDF5). analyticsTable() returns the name of the appropriate table, which gets overwritten
    dfLatest.loc[symbol] = dfDay.iloc[-1]  # update with the latest results (dfLatest is the latest results for each symbol, which is loaded as a global at startup and periodically saved back to the store in another thread)

    log.info('Processed {}/{} securities in queue.'.format(counter+1, len(queuelist)))
    # do some stuff to update progress GUI 

我无法弄清楚如何让最后一行与结果一起使用,而且还在继续,并希望得到建议。

我正在考虑在Pool中运行它并让进程将结果添加到Queue(而不是返回它们),然后让一个while循环停留在主进程中随着结果进入队列 - 这是一种合理的方式吗?类似的东西:

mpqueue = multiprocessing.Queue()
pool = multiprocessing.Pool()
pool.map(generateAnalytics, [queuelist, mpqueue])

while not needQueueLoad:  # set by another thread that's monitoring for new data (in the form of a new file that arrives a couple times a day)
    while not mpqueue.empty():
        dfDay = mpqueue.get()
        astore[analyticsTable(symbol)] = dfDay  # astore is a pandas store (HDF5). analyticsTable() returns the name of the appropriate table, which gets overwritten
        dfLatest.loc[symbol] = dfDay.iloc[-1]  # update with the latest results (dfLatest is the latest results for each symbol, which is loaded as a global at startup and periodically saved back to the store in another thread)    
        log.info('Processed {}/{} securities in queue.'.format(counter+1, len(queuelist)))
        # do some stuff to update GUI that shows progress            
    sleep(0.1)
    # do some bookkeeping to see if queue has finished
pool.join()

1 个答案:

答案 0 :(得分:2)

使用Queue看起来是一种合理的方式,有两个评论。

  1. 由于它从您使用GUI的代码看起来,在超时函数或空闲函数中而不是在while循环中检查结果可能更好。使用while循环检查结果会阻止GUI的事件循环。

  2. 如果工作进程需要通过队列将大量数据返回到主进程,这将增加大量开销。您可能需要考虑使用共享内存甚至是中间文件。