我正在尝试尽快在CSV文件的每一行上执行一个函数。我的代码有效,但我知道如果我更好地使用multiprocessing
库,它会更快。
processes = []
def execute_task(task_details):
#work is done here, may take 1 second, may take 10
#send output to another function
with open('twentyThousandLines.csv', 'rb') as file:
r = csv.reader(file)
for row in r:
p = Process(target=execute_task, args=(row,))
processes.append(p)
p.start()
for p in processes:
p.join()
我想我应该将任务放入Queue
并使用Pool
处理它们,但所有示例都使Queue
看起来不像我假设的那样,我无法将Pool
映射到不断扩展的Queue
。
答案 0 :(得分:0)
我使用Pool
工人做了类似的事情。
from multiprocessing import Pool, cpu_count
def initializer(arg1, arg2):
# Do something to initialize (if necessary)
def process_csv_data(data):
# Do something with the data
pool = Pool(cpu_count(), initializer=initializer, initargs=(arg1, arg2))
with open("csv_data_file.csv", "rb") as f:
csv_obj = csv.reader(f)
for row in csv_obj:
pool.apply_async(process_csv_data, (row,))
但是,正如pvg在您的问题下发表评论,您可能想要考虑如何批量处理数据。逐行进行可能不是正确的粒度级别。
您可能还想剖析/测试以找出瓶颈。例如,如果磁盘访问限制了您,则可能无法从并行化中受益。
mulprocessing.Queue
是exchanging objects among the processes的一种方式,所以这不是你要完成任务的东西。
答案 1 :(得分:0)
对我来说,看起来你实际上是想加速
def check(row):
# do the checking
return (row,result_of_check)
with open('twentyThousandLines.csv', 'rb') as file:
r = csv.reader(file)
for row,result in map(check,r):
print(row,result)
可以用
完成#from multiprocessing import Pool # if CPU-bound (but even then not alwys)
from multiprocessing.dummy import Pool # if IO-bound
def check(row):
# do the checking
return (row,result_of_check)
if __name__=="__main__": #in case you are using processes on windows
with open('twentyThousandLines.csv', 'rb') as file:
r = csv.reader(file)
with Pool() as p: # before python 3.3 you should do close() and join() explicitly
for row,result in p.imap_unordered(check,r, chunksize=10): # just a quess - you have to guess/experiement a bit to find the best value
print(row,result)
创建进程需要一些时间(特别是在Windows上),因此在大多数情况下通过multiprocessing.dummy使用线程的速度更快(并且多处理也不是完全无关紧要的 - 请参阅Guidelines)。