我的代码流是这样的:
import pandas as pd
import threading
import helpers
for file in files:
df_full = pd.read_csv(file, chunksize=500000)
for df in df_full:
df_ready = prepare_df(df)
# testing if the previous instance is running
if isinstance(upload_thread, threading.Thread):
if upload_thread.isAlive():
print('waiting for the last upload op to finish')
upload_thread.join()
# starts the upload in another thread, so the loop can continue on the next chunk
upload_thread = threading.Thread(target=helpers.uploading, kwargs=kwargs)
upload_thread.start()
它起作用了,问题在于:使用线程运行它会使它变慢!
我对代码流的想法是:
处理大量数据
完成后,将其上传到后台
上传时,将循环前进到下一步,即 处理下一个数据块
从理论上讲,听起来不错,但是经过大量的试验和计时,我相信线程正在减慢代码流。
我确定我搞砸了,请帮助我找出问题所在。
此外,此功能'helpers.uploading'向我返回重要结果。如何获得这些结果?理想情况下,我需要将每次迭代的结果附加到结果列表中。如果没有线程,则类似于:
import pandas as pd
import helpers
results = []
for file in files:
df_full = pd.read_csv(file, chunksize=500000)
for df in df_full:
df_ready = prepare_df(df)
result = helpers.uploading(**kwargs)
results.append(result)
谢谢!