问题的目的:了解有关在Python /实验中实现并发的方法的更多信息。
上下文:我想计算与特定模式匹配的所有文件中的所有单词。我的想法是,我可以调用函数count_words('/foo/bar/*.txt')
,并且将计算所有单词(即由一个或多个空格字符分隔的字符串)。
在实现中,我正在寻找使用并发实现count_words
的方法。到目前为止,我设法使用multiprocessing
和asyncio
。
您是否看到了执行相同任务的其他方法?
我没有使用threading
,因为我注意到由于Python GIL的限制,性能提升并不那么令人印象深刻。
import asyncio
import multiprocessing
import time
from pathlib import Path
from pprint import pprint
def count_words(file):
with open(file) as f:
return sum(len(line.split()) for line in f)
async def count_words_for_file(file):
with open(file) as f:
return sum(len(line.split()) for line in f)
def async_count_words(path, glob_pattern):
event_loop = asyncio.get_event_loop()
try:
print("Entering event loop")
for file in list(path.glob(glob_pattern)):
result = event_loop.run_until_complete(count_words_for_file(file))
print(result)
finally:
event_loop.close()
def multiprocess_count_words(path, glob_pattern):
with multiprocessing.Pool(processes=8) as pool:
results = pool.map(count_words, list(path.glob(glob_pattern)))
pprint(results)
def sequential_count_words(path, glob_pattern):
for file in list(path.glob(glob_pattern)):
print(count_words(file))
if __name__ == '__main__':
benchmark = []
path = Path("../data/gutenberg/")
# no need for benchmark on sequential_count_words, it is very slow!
# sequential_count_words(path, "*.txt")
start = time.time()
async_count_words(path, "*.txt")
benchmark.append(("async version", time.time() - start))
start = time.time()
multiprocess_count_words(path, "*.txt")
benchmark.append(("multiprocess version", time.time() - start))
print(*benchmark)
为了模拟大量文件,我从Project Gutenberg(http://gutenberg.org/)下载了一些书籍,并使用以下命令创建了同一文件的多个副本。
for i in {000..99}; do cp 56943-0.txt $(openssl rand -base64 12)-$i.txt; done
答案 0 :(得分:0)
async def
没有神奇地并发函数调用,在asyncio中你需要明确地放弃执行以允许其他协同程序在等待时使用await
并发运行。也就是说,您当前的count_words_for_file
仍然是按顺序执行的。
您可能希望引入aiofiles将阻塞文件I / O推迟到线程中,从而允许在不同协程中进行并发文件I / O.即使这样,计算单词数量的CPU绑定代码仍然在同一主线程中顺序运行。要并行化,您仍然需要多个进程和多个CPU(或多台计算机,请检查Celery)。
此外,您的asyncio代码存在问题 - for ... run_until_complete
再次使函数调用按顺序运行。您需要loop.create_task()
同时启动它们,并aysncio.wait()
加入结果。
import aiofiles
...
async def count_words_for_file(file):
async with aiofiles.open(file) as f:
rv = sum(len(line.split()) async for line in f)
print(rv)
return rv
async def async_count_words(path, glob_pattern):
await asyncio.wait([count_words_for_file(file)
for file in list(path.glob(glob_pattern))])
# asyncio.wait() calls loop.create_task() for you for each coroutine
...
if __name__ == '__main__':
...
loop = asyncio.get_event_loop()
start = time.time()
loop.run_until_complete(async_count_words(path, "*.txt"))
benchmark.append(("async version", time.time() - start))