Question

我正在使用pandas来计算大量数据的统计数据，但它最终运行了几个小时，而且我经常获得新数据。我已经尝试过优化，但我想让它更快，所以我试图让它使用多个进程。我遇到的问题是，我需要在结果完成后对结果进行一些临时工作，我见过multiprocessing.Process和Pool的例子都等待一切都完成在使用结果之前。

这是我现在正在使用的严格修饰的代码。我想把它放入单独的进程中的是generateAnalytics（）。

for counter, symbol in enumerate(queuelist):  # queuelist
    if needQueueLoad:  # set by another thread that's monitoring for new data (in the form of a new file that arrives a couple times a day)
        log.info('Shutting down analyticsRunner thread')
        break
    dfDay = generateAnalytics(symbol)  # slow running function (15s+)
    astore[analyticsTable(symbol)] = dfDay  # astore is a pandas store (HDF5). analyticsTable() returns the name of the appropriate table, which gets overwritten
    dfLatest.loc[symbol] = dfDay.iloc[-1]  # update with the latest results (dfLatest is the latest results for each symbol, which is loaded as a global at startup and periodically saved back to the store in another thread)

    log.info('Processed {}/{} securities in queue.'.format(counter+1, len(queuelist)))
    # do some stuff to update progress GUI

我无法弄清楚如何让最后一行与结果一起使用，而且还在继续，并希望得到建议。

我正在考虑在Pool中运行它并让进程将结果添加到Queue（而不是返回它们），然后让一个while循环停留在主进程中随着结果进入队列 - 这是一种合理的方式吗？类似的东西：

mpqueue = multiprocessing.Queue()
pool = multiprocessing.Pool()
pool.map(generateAnalytics, [queuelist, mpqueue])

while not needQueueLoad:  # set by another thread that's monitoring for new data (in the form of a new file that arrives a couple times a day)
    while not mpqueue.empty():
        dfDay = mpqueue.get()
        astore[analyticsTable(symbol)] = dfDay  # astore is a pandas store (HDF5). analyticsTable() returns the name of the appropriate table, which gets overwritten
        dfLatest.loc[symbol] = dfDay.iloc[-1]  # update with the latest results (dfLatest is the latest results for each symbol, which is loaded as a global at startup and periodically saved back to the store in another thread)    
        log.info('Processed {}/{} securities in queue.'.format(counter+1, len(queuelist)))
        # do some stuff to update GUI that shows progress            
    sleep(0.1)
    # do some bookkeeping to see if queue has finished
pool.join()

Answer 1

使用Queue看起来是一种合理的方式，有两个评论。

由于它从您使用GUI的代码看起来，在超时函数或空闲函数中而不是在while循环中检查结果可能更好。使用while循环检查结果会阻止GUI的事件循环。
如果工作进程需要通过队列将大量数据返回到主进程，这将增加大量开销。您可能需要考虑使用共享内存甚至是中间文件。

python多处理 - 如何对中间结果采取行动

1 个答案: