我有一个加载数据并对其进行处理的程序。加载和处理都需要时间,我想并行进行。
这是我程序的同步版本(其中“加载”和“处理”是按顺序完成的,为示例起见,这里是微不足道的操作):
import time
def data_loader():
for i in range(4):
time.sleep(1) # Simulated loading time
yield i
def main():
start = time.time()
for data in data_loader():
time.sleep(1) # Simulated processing time
processed_data = -data*2
print(f'At t={time.time()-start:.3g}, processed data {data} into {processed_data}')
if __name__ == '__main__':
main()
运行此命令时,将输出:
At t=2.01, processed data 0 into 0
At t=4.01, processed data 1 into -2
At t=6.02, processed data 2 into -4
At t=8.02, processed data 3 into -6
循环每2s运行一次,加载1s,处理1s。
现在,我想制作一个异步版本,在该版本中,加载和处理是同时完成的(以便加载器在处理器处理数据时准备好下一个数据)。然后,打印第一个语句应花费2s,之后的每个语句应花费1s。预期的输出将类似于:
At t=2.01, processed data 0 into 0
At t=3.01, processed data 1 into -2
At t=4.02, processed data 2 into -4
At t=5.02, processed data 3 into -6
理想情况下,只需要更改main
函数的内容即可(因为data_loader
代码不必关心它可以以异步方式使用)。
答案 0 :(得分:3)
您可能需要multiprocessing
模块的实用程序。
import time
import multiprocessing
def data_loader():
for i in range(4):
time.sleep(1) # Simulated loading time
yield i
def process_item(item):
time.sleep(1) # Simulated processing time
return (item, -item*2) # Return the original too.
def main():
start = time.time()
with multiprocessing.Pool() as p:
data_iterator = data_loader()
for (data, processed_data) in p.imap(process_item, data_iterator):
print(f'At t={time.time()-start:.3g}, processed data {data} into {processed_data}')
if __name__ == '__main__':
main()
此输出
At t=2.03, processed data 0 into 0
At t=3.03, processed data 1 into -2
At t=4.04, processed data 2 into -4
At t=5.04, processed data 3 into -6
根据您的要求,您可能会发现.imap_unordered()
更快,并且值得一提的是,有Pool
的基于线程的multiprocessing.dummy.Pool
版本可供使用–这可能很有用如果您的数据很大,并且没有在Python中完成处理,则可以避免IPC开销(因此可以避免使用GIL)。
答案 1 :(得分:1)
问题的关键在于数据的实际处理。我不知道您要如何处理真实程序中的数据,但必须是异步操作,才能使用异步编程。如果您正在执行活动,阻止CPU绑定的处理,则最好将其卸载到一个单独的进程中,以便能够使用多个CPU内核并发处理。如果实际上对数据的实际处理仅仅是某种异步服务的消耗,那么可以非常有效地将其包装在单个异步并发线程中。
在您的示例中,您使用time.sleep()
来模拟处理。由于该示例操作可以异步完成(而是使用asyncio.sleep()
),因此转换很简单:
import itertools
import asyncio
async def data_loader():
for i in itertools.count(0):
await asyncio.sleep(1) # Simulated loading time
yield i
async def process(data):
await asyncio.sleep(1) # Simulated processing time
processed_data = -data*2
print(f'At t={loop.time()-start:.3g}, processed data {data} into {processed_data}')
async def main():
tasks = []
async for data in data_loader():
tasks.append(loop.create_task(process(data)))
await asyncio.wait(tasks) # wait for all remaining tasks
if __name__ == '__main__':
loop = asyncio.get_event_loop()
start = loop.time()
loop.run_until_complete(main())
loop.close()
结果如您所料:
At t=2, processed data 0 into 0
At t=3, processed data 1 into -2
At t=4, processed data 2 into -4
...
请记住,它仅由于time.sleep()
具有asyncio.sleep()
形式的异步替代项而起作用。检查您正在使用的操作,以查看它是否可以异步形式编写。
答案 2 :(得分:0)
这是一个允许您使用iter_asynchronously
函数包装数据加载器的解决方案。它现在解决了问题。 (但是请注意,仍然存在一个问题,如果数据加载器比处理循环快,则队列将无限期地增长。如果队列变大,则可以通过在_async_queue_manager
中添加等待来解决,这很容易解决(但是很遗憾在Mac上不支持Queue.qsize()
!)
import time
from multiprocessing import Queue, Process
class PoisonPill:
pass
def _async_queue_manager(gen_func, queue: Queue):
for item in gen_func():
queue.put(item)
queue.put(PoisonPill)
def iter_asynchronously(gen_func):
""" Given a generator function, make it asynchonous. """
q = Queue()
p = Process(target=_async_queue_manager, args=(gen_func, q))
p.start()
while True:
item = q.get()
if item is PoisonPill:
break
else:
yield item
def data_loader():
for i in range(4):
time.sleep(1) # Simulated loading time
yield i
def main():
start = time.time()
for data in iter_asynchronously(data_loader):
time.sleep(1) # Simulated processing time
processed_data = -data*2
print(f'At t={time.time()-start:.3g}, processed data {data} into {processed_data}')
if __name__ == '__main__':
main()
现在可以根据需要输出:
At t=2.03, processed data 0 into 0
At t=3.03, processed data 1 into -2
At t=4.04, processed data 2 into -4
At t=5.04, processed data 3 into -6