我正在尝试使用多处理来提高api调用的性能。
任何对此的建议将不胜感激。
以下是该程序的一般概念,还有更多详细信息:
data = pd.read_csv(filename, sep=";", converters={i: str for i in range(0, 156)})
for index, series in data.iterrows():
#
# this is where the api calls and calculations happen
#
data.to_csv(filename, index=False, columns=headers)
对于此示例,我将说数据框看起来像(x 10,000 +行):
data['Client_Code'] = 'ABCD'
data['Mode'] = 'Air'
data['Account_Number'] = 'ABC123'
data['Invoice Number'] = '987654321'
data['Tracking_Number'] = '1357924680'
data['Delivered'] = ''
到目前为止,我拥有创建池的功能:
num_processes = multiprocessing.cpu_count()
chunk_size = int(data.shape[0]/num_processes)
chunks = [data.loc[data.index[i:i + chunk_size]] for i in range(0, data.shape[0], chunk_size)]
def func(x):
for index, series in x.iterrows():
#simulates api call
x.at[index,'Delivered'] = 'Yes'
print(x['Delivered'])
return x
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=num_processes)
result = pool.map(func, chunks)
pool.close()
pool.join()
print(data['Delivered'])
1)我是否要以正确(最有效)的方式进行操作?
2)如何从函数中取回数据?这样我就可以data.to_csv(filename,index = False,columns = headers)
函数中的print语句有效,但外面的语句无效。
谢谢