如何在大批量处理中使用池或队列优化流程?

时间:2016-12-15 23:10:01

标签: python csv multiprocessing python-multiprocessing

我正在尝试尽快在CSV文件的每一行上执行一个函数。我的代码有效,但我知道如果我更好地使用multiprocessing库,它会更快。

processes = []

def execute_task(task_details):
    #work is done here, may take 1 second, may take 10
    #send output to another function

with open('twentyThousandLines.csv', 'rb') as file:
    r = csv.reader(file)
    for row in r:
        p = Process(target=execute_task, args=(row,))
        processes.append(p)
        p.start()

for p in processes:
    p.join()

我想我应该将任务放入Queue并使用Pool处理它们,但所有示例都使Queue看起来不像我假设的那样,我无法将Pool映射到不断扩展的Queue

2 个答案:

答案 0 :(得分:0)

我使用Pool工人做了类似的事情。

    from multiprocessing import Pool, cpu_count

    def initializer(arg1, arg2):
        # Do something to initialize (if necessary)

    def process_csv_data(data):
        # Do something with the data

    pool = Pool(cpu_count(), initializer=initializer, initargs=(arg1, arg2))

    with open("csv_data_file.csv", "rb") as f:
        csv_obj = csv.reader(f)
        for row in csv_obj:
            pool.apply_async(process_csv_data, (row,))

但是,正如pvg在您的问题下发表评论,您可能想要考虑如何批量处理数据。逐行进行可能不是正确的粒度级别。

您可能还想剖析/测试以找出瓶颈。例如,如果磁盘访问限制了您,则可能无法从并行化中受益。

mulprocessing.Queueexchanging objects among the processes的一种方式,所以这不是你要完成任务的东西。

答案 1 :(得分:0)

对我来说,看起来你实际上是想加速

def check(row):
    # do the checking
    return (row,result_of_check)

with open('twentyThousandLines.csv', 'rb') as file:
    r = csv.reader(file)
    for row,result in map(check,r):
        print(row,result)

可以用

完成
#from multiprocessing import Pool # if CPU-bound (but even then not alwys)
from multiprocessing.dummy import Pool # if IO-bound


def check(row):
    # do the checking
    return (row,result_of_check)

if __name__=="__main__": #in case you are using processes on windows
    with open('twentyThousandLines.csv', 'rb') as file:
        r = csv.reader(file)
        with Pool() as p: # before python 3.3 you should do close() and join() explicitly
            for row,result in p.imap_unordered(check,r, chunksize=10): # just a quess - you have to guess/experiement a bit to find the best value
                print(row,result)

创建进程需要一些时间(特别是在Windows上),因此在大多数情况下通过multiprocessing.dummy使用线程的速度更快(并且多处理也不是完全无关紧要的 - 请参阅Guidelines)。