多处理熊猫-将df传递给函数,在函数中修改df,将修改后的df返回到过程

时间:2019-05-02 00:37:51

标签: pandas api dataframe multiprocessing

我正在尝试使用多处理来提高api调用的性能。

任何对此的建议将不胜感激。

以下是该程序的一般概念,还有更多详细信息:

data = pd.read_csv(filename, sep=";", converters={i: str for i in range(0, 156)})

for index, series in data.iterrows():
   #
   # this is where the api calls and calculations happen
   #

data.to_csv(filename, index=False, columns=headers)

对于此示例,我将说数据框看起来像(x 10,000 +行):

data['Client_Code'] = 'ABCD'
data['Mode'] = 'Air'
data['Account_Number'] = 'ABC123'
data['Invoice Number'] = '987654321'
data['Tracking_Number'] = '1357924680' 
data['Delivered'] = ''

到目前为止,我拥有创建池的功能:

num_processes = multiprocessing.cpu_count()
chunk_size = int(data.shape[0]/num_processes)
chunks = [data.loc[data.index[i:i + chunk_size]] for i in range(0, data.shape[0], chunk_size)]

def func(x):
   for index, series in x.iterrows():
      #simulates api call
      x.at[index,'Delivered'] = 'Yes'
print(x['Delivered'])
return x


if __name__ == '__main__':
   pool = multiprocessing.Pool(processes=num_processes)
   result = pool.map(func, chunks)
   pool.close()
   pool.join()
   print(data['Delivered'])

1)我是否要以正确(最有效)的方式进行操作?

2)如何从函数中取回数据?这样我就可以data.to_csv(filename,index = False,columns = headers)

函数中的print语句有效,但外面的语句无效。

谢谢

0 个答案:

没有答案