我有一个简单的类,它利用异步生成器来检索URL列表:
import aiohttp
import asyncio
import logging
import sys
LOOP = asyncio.get_event_loop()
N_SEMAPHORE = 3
FORMAT = '[%(asctime)s] - %(message)s'
logging.basicConfig(stream=sys.stdout, level=logging.INFO, format=FORMAT)
logger = logging.getLogger(__name__)
class ASYNC_GENERATOR(object):
def __init__(self, n_semaphore=N_SEMAPHORE, loop=LOOP):
self.loop = loop
self.semaphore = asyncio.Semaphore(n_semaphore)
self.session = aiohttp.ClientSession(loop=self.loop)
async def _get_url(self, url):
"""
Sends an http GET request to an API endpoint
"""
async with self.semaphore:
async with self.session.get(url) as response:
logger.info(f'Request URL: {url} [{response.status}]')
read_response = await response.read()
return {
'read': read_response,
'status': response.status,
}
def get_routes(self, urls):
"""
Wrapper around _get_url (multiple urls asynchronously)
This returns an async generator
"""
# Asynchronous http GET requests
coros = [self._get_url(url) for url in urls]
futures = asyncio.as_completed(coros)
for future in futures:
yield self.loop.run_until_complete(future)
def close(self):
self.session._connector.close()
当我执行代码的这个主要部分时:
if __name__ == '__main__':
ag = ASYNC_GENERATOR()
urls = [f'https://httpbin.org/get?x={i}' for i in range(10)]
responses = ag.get_routes(urls)
for response in responses:
response = next(ag.get_routes(['https://httpbin.org/get']))
ag.close()
日志打印出来:
[2018-05-15 12:59:49,228] - Request URL: https://httpbin.org/get?x=3 [200]
[2018-05-15 12:59:49,235] - Request URL: https://httpbin.org/get?x=2 [200]
[2018-05-15 12:59:49,242] - Request URL: https://httpbin.org/get?x=6 [200]
[2018-05-15 12:59:49,285] - Request URL: https://httpbin.org/get?x=5 [200]
[2018-05-15 12:59:49,290] - Request URL: https://httpbin.org/get?x=0 [200]
[2018-05-15 12:59:49,295] - Request URL: https://httpbin.org/get?x=7 [200]
[2018-05-15 12:59:49,335] - Request URL: https://httpbin.org/get?x=8 [200]
[2018-05-15 12:59:49,340] - Request URL: https://httpbin.org/get?x=4 [200]
[2018-05-15 12:59:49,347] - Request URL: https://httpbin.org/get?x=1 [200]
[2018-05-15 12:59:49,388] - Request URL: https://httpbin.org/get?x=9 [200]
[2018-05-15 12:59:49,394] - Request URL: https://httpbin.org/get [200]
[2018-05-15 12:59:49,444] - Request URL: https://httpbin.org/get [200]
[2018-05-15 12:59:49,503] - Request URL: https://httpbin.org/get [200]
[2018-05-15 12:59:49,553] - Request URL: https://httpbin.org/get [200]
[2018-05-15 12:59:49,603] - Request URL: https://httpbin.org/get [200]
[2018-05-15 12:59:49,650] - Request URL: https://httpbin.org/get [200]
[2018-05-15 12:59:49,700] - Request URL: https://httpbin.org/get [200]
[2018-05-15 12:59:49,825] - Request URL: https://httpbin.org/get [200]
[2018-05-15 12:59:49,875] - Request URL: https://httpbin.org/get [200]
[2018-05-15 12:59:49,922] - Request URL: https://httpbin.org/get [200]
由于responses
是异步生成器,我希望它从异步生成器产生一个响应(它应该只在实际产生时发送请求),向端点发送一个单独的请求,而不是{{1} }参数,然后从异步生成器生成下一个响应。这应该在具有x
参数的请求和不带参数的请求之间来回切换。相反,它使用x
参数从异步生成器生成所有响应,然后是所有没有参数的https请求。
当我这样做时会发生类似事情:
x
日志打印:
ag = ASYNC_GENERATOR()
urls = [f'https://httpbin.org/get?x={i}' for i in range(10)]
responses = ag.get_routes(urls)
next(responses)
response = next(ag.get_routes(['https://httpbin.org/get']))
ag.close()
相反,我想要的是:
[2018-05-15 13:08:38,643] - Request URL: https://httpbin.org/get?x=8 [200]
[2018-05-15 13:08:38,656] - Request URL: https://httpbin.org/get?x=1 [200]
[2018-05-15 13:08:38,681] - Request URL: https://httpbin.org/get?x=3 [200]
[2018-05-15 13:08:38,695] - Request URL: https://httpbin.org/get?x=4 [200]
[2018-05-15 13:08:38,717] - Request URL: https://httpbin.org/get?x=6 [200]
[2018-05-15 13:08:38,741] - Request URL: https://httpbin.org/get?x=2 [200]
[2018-05-15 13:08:38,750] - Request URL: https://httpbin.org/get?x=0 [200]
[2018-05-15 13:08:38,773] - Request URL: https://httpbin.org/get?x=9 [200]
[2018-05-15 13:08:38,792] - Request URL: https://httpbin.org/get?x=7 [200]
[2018-05-15 13:08:38,803] - Request URL: https://httpbin.org/get?x=5 [200]
[2018-05-15 13:08:38,826] - Request URL: https://httpbin.org/get [200]
有些时候我想在做其他任何事情之前先检索所有的回复。但是,有时候我想在从生成器生成下一个项目之前插入并发出中间请求(即,生成器返回分页搜索结果的结果,我想在进入下一页之前处理来自每个页面的更多链接)。
我需要更改什么来达到所需的结果?
答案 0 :(得分:3)
暂且不考虑responses
是否是异步生成器(不是Python uses the term)的技术问题,您的问题在于as_completed
。 as_completed
并行启动了一系列协程 ,并提供了在完成后获取结果的方法。期货并行运行并不是documentation显而易见的,但如果您认为原始concurrent.futures.as_completed
适用于基于线程的期货而没有选择但是并行运行。从概念上讲,asyncio期货也是如此。
您的代码只获得第一个(速度最快的)结果,然后使用asyncio开始执行其他操作。传递给as_completed
的剩余协程只是因为没有人收集他们的结果而被冻结 - 他们正在后台完成工作,一旦完成就可以await
编译(在你的情况下由as_completed
中的代码,您使用loop.run_until_complete()
访问该代码。我冒昧地猜测,没有参数的URL比仅使用参数x
的URL需要更长的时间来检索,这就是它在所有其他协同程序之后打印的原因。
换句话说,正在打印的那些日志行意味着asyncio
正在完成其工作并提供您请求的并行执行!如果你不想要并行执行,那么不要求它,连续执行它们:
def get_routes(self, urls):
for url in urls:
yield loop.run_until_complete(self._get_url(url))
但这是使用asyncio的一种不好的方式 - 它的主循环是不可重入的,所以为了确保可组合性,你几乎肯定希望循环在顶层只旋转一次。这通常使用loop.run_until_complete(main())
或loop.run_forever()
等结构来完成。正如Martijn指出的那样,你可以通过使get_routes
成为一个真正的异步生成器来实现这一点,同时保留良好的生成器API:
async def get_routes(self, urls):
for url in urls:
result = await self._get_url(url)
yield result
现在你可以拥有一个main()
协程,如下所示:
async def main():
ag = ASYNC_GENERATOR()
urls = [f'https://httpbin.org/get?x={i}' for i in range(10)]
responses = ag.get_routes(urls)
async for response in responses:
# simulate `next` with async iteration
async for other_response in ag.get_routes(['https://httpbin.org/get']):
break
ag.close()
loop.run_until_complete(main())