python-并行写入数据的单独线程使我的代码变慢-但是为什么呢?

时间:2019-02-06 14:20:05

标签: python multithreading parallel-processing multiprocessing python-multithreading

我的代码流是这样的:

import pandas as pd
import threading
import helpers

for file in files:
    df_full = pd.read_csv(file, chunksize=500000)
    for df in df_full:
        df_ready = prepare_df(df)
        # testing if the previous instance is running
        if isinstance(upload_thread, threading.Thread):
            if upload_thread.isAlive():
                print('waiting for the last upload op to finish')
                upload_thread.join()

        # starts the upload in another thread, so the loop can continue on the next chunk
        upload_thread = threading.Thread(target=helpers.uploading, kwargs=kwargs)
        upload_thread.start()

它起作用了,问题在于:使用线程运行它会使它变慢!

我对代码流的想法是:

  1. 处理大量数据

  2. 完成后,将其上传到后台

  3. 上传时,将循环前进到下一步,即 处理下一个数据块

从理论上讲,听起来不错,但是经过大量的试验和计时,我相信线程正在减慢代码流。

我确定我搞砸了,请帮助我找出问题所在。

此外,此功能'helpers.uploading'向我返回重要结果。如何获得这些结果?理想情况下,我需要将每次迭代的结果附加到结果列表中。如果没有线程,则类似于:

import pandas as pd
import helpers

results = []

for file in files:
    df_full = pd.read_csv(file, chunksize=500000)
    for df in df_full:
        df_ready = prepare_df(df)
        result = helpers.uploading(**kwargs)
        results.append(result)

谢谢!

0 个答案:

没有答案