跨数据框映射时多处理池是否会挂起?

时间:2017-01-19 18:31:21

标签: python multithreading pandas numpy

我正在尝试将pandas数据框拆分成块,然后在parrallel(based on this example)中跨每个块运行一个函数。常规的非chunked版本工作正常(慢),但对于某些人来说原因是,chunked版本完全失败:池的CPU占用率为0%,脚本永远不会完成。如果有人愿意为什么这样做不起作用,我会把一个快速重现的例子放在一起?

import pandas as pd
from multiprocessing import Pool
import numpy as np
import time

def samplefunction(dfinputlist):
    dfinputlist=dfinputlist*2
    return dfinputlist

def parallelize_dataframe(df, func):
    df_split = np.array_split(df, 2)
    pool = Pool(4)
    df = pd.concat(pool.map(func, df_split))
    pool.close()
    pool.join()
    return df

if __name__ == "__main__":
    dfinputlist = pd.DataFrame(np.random.randint(0,50,size=(100000000, 4)), columns=list('ABCD'))
    start=time.time()
    dfinputlist=samplefunction(dfinputlist)
    print('Finished Non-Parrallel Version after '+ str(time.time()-start)+' seconds.')
    start=time.time()
    output=parallelize_dataframe(dfinputlist, samplefunction)
    print('Finished Parrallel Version after '+ str(time.time()-start)+' seconds.')

0 个答案:

没有答案