Question

我正在尝试使用多处理来提高api调用的性能。

任何对此的建议将不胜感激。

以下是该程序的一般概念，还有更多详细信息：

data = pd.read_csv(filename, sep=";", converters={i: str for i in range(0, 156)})

for index, series in data.iterrows():
   #
   # this is where the api calls and calculations happen
   #

data.to_csv(filename, index=False, columns=headers)

对于此示例，我将说数据框看起来像（x 10,000 +行）：

data['Client_Code'] = 'ABCD'
data['Mode'] = 'Air'
data['Account_Number'] = 'ABC123'
data['Invoice Number'] = '987654321'
data['Tracking_Number'] = '1357924680' 
data['Delivered'] = ''

到目前为止，我拥有创建池的功能：

num_processes = multiprocessing.cpu_count()
chunk_size = int(data.shape[0]/num_processes)
chunks = [data.loc[data.index[i:i + chunk_size]] for i in range(0, data.shape[0], chunk_size)]

def func(x):
   for index, series in x.iterrows():
      #simulates api call
      x.at[index,'Delivered'] = 'Yes'
print(x['Delivered'])
return x


if __name__ == '__main__':
   pool = multiprocessing.Pool(processes=num_processes)
   result = pool.map(func, chunks)
   pool.close()
   pool.join()
   print(data['Delivered'])

1）我是否要以正确（最有效）的方式进行操作？

2）如何从函数中取回数据？这样我就可以data.to_csv（filename，index = False，columns = headers）

函数中的print语句有效，但外面的语句无效。

谢谢

多处理熊猫-将df传递给函数，在函数中修改df，将修改后的df返回到过程

0 个答案: