我正在尝试将pandas数据框拆分成块,然后在parrallel(based on this example)中跨每个块运行一个函数。常规的非chunked版本工作正常(慢),但对于某些人来说原因是,chunked版本完全失败:池的CPU占用率为0%,脚本永远不会完成。如果有人愿意为什么这样做不起作用,我会把一个快速重现的例子放在一起?
import pandas as pd
from multiprocessing import Pool
import numpy as np
import time
def samplefunction(dfinputlist):
dfinputlist=dfinputlist*2
return dfinputlist
def parallelize_dataframe(df, func):
df_split = np.array_split(df, 2)
pool = Pool(4)
df = pd.concat(pool.map(func, df_split))
pool.close()
pool.join()
return df
if __name__ == "__main__":
dfinputlist = pd.DataFrame(np.random.randint(0,50,size=(100000000, 4)), columns=list('ABCD'))
start=time.time()
dfinputlist=samplefunction(dfinputlist)
print('Finished Non-Parrallel Version after '+ str(time.time()-start)+' seconds.')
start=time.time()
output=parallelize_dataframe(dfinputlist, samplefunction)
print('Finished Parrallel Version after '+ str(time.time()-start)+' seconds.')