Question

我需要加快Python脚本的执行速度，该Python脚本会分块读取一个较大的CSV文件，进行一些处理，然后将处理后的行保存到数据库中。处理10,000行然后保留它们需要花费相当的时间（1.5sec）。时间确实会有一些波动，当然，有时处理速度更快，有时会持续。

不幸的是，处理记录不容易并行化，因为处理是历史性的（记录是股票交易，并且基于先前的活动进行计算）。可能但对于这个问题，可以做的事情是并行处理一个块，并保留前一个块的结果。这应该使总时间减半。

for chunk in pd.read_csv(filename, chunksize=chunksize):
    # the following two tasks in parallel
    persist (rows_from_previous_chunk) # this is I/O waiting, mostly
    rows_to_save = process(chunk)      # this is Python, not C
    # wait for the above to finish
    rows_from_previous_chunk = rows_to_save

我的问题是，建议采用哪种方法进行上述操作。我可以想到一些：

鉴于一项任务主要是等待I / O，因此我有可能在不遇到GIL争用的情况下使用多线程。
第二种选择是使用Dask，特别是Delayed。但是，鉴于每个任务使用的时间都很短（不到2秒），因此我不确定这是最好的方法。
第三个选择是让一个进程读取和处理行，然后通过有界队列将它们发送到单独的行，该行将保存到数据库。我想到的是multiprocessing.Queue()

任何建议都值得赞赏。我是一个长期的Java程序员，最近改用Python并学习与GIL一起生活，因此是一个问题。

Answer 1

Dask确实增加了开销，但是与典型的2s任务时间相比，它很小。为了保持顺序，您可以让每个任务都依赖于前一个任务。这是一个刺

@dask.delayed
def process_save(rows_from_previous_chunk, chunk):
    if rows_from_previous_chunk:
        persist(rows_from_previous_chunk)
    return process(chunk)

parts = dd.read_csv(filename, chunksize=chunksize).to_delayed()

prev = None
for chunk in parts:
    prev = process_save(prev, chunk)
out = dask.delayed(persist)(prev)
dask.compute(out)

out.visualize()  # should look interesting

Answer 2

这可能取决于您的数据库，但是如果存在，最简单的方法可能是使用诸如aiomysql或asyncpg之类的异步库来允许您在后台执行插入查询。

I / O绑定部分可以执行而无需GIL锁定，因此您的Python代码部分将能够继续。

Answer 3

我最终采用了以下方法。有趣的是，使用多线程无法按预期工作。将数据帧传递到另一个队列进行保存仍在阻止主线程继续工作。并非100％知道发生了什么，但是为了节省时间，我转而使用流程，它可以正常工作。为了简化下面的代码，此代码有些简化，实际上我使用了多个db worker进程。

import multiprocessing

# this function will run into a separate process, saving the df asynchronously
def save(queue):
    db_engine = create_engine(...)
    while True:
        df  = queue.get()
        if df is None:
            break
        df.to_sql(schema="...", name="...", con=db_engine, if_exists="append", chunksize=1000, index=False)
        queue.task_done()

if __name__ == '__main__':

    queue = multiprocessing.JoinableQueue(maxsize=2) 
    worker = multiprocessing.Process(name="db_worker", target=save, args=(queue,))
    worker.daemon = True
    workers.start()

    # inside the main loop to process the df
        queue.put(df_to_save)

    # at the end 
    worker.join()  # wait for the last save job to finish before terminating the main process

Python中的异步持久性

3 个答案: