Python AsyncIO中的Async Generator产生的效果

时间:2018-05-15 17:16:59

标签: python python-3.x python-asyncio aiohttp

我有一个简单的类,它利用异步生成器来检索URL列表:

import aiohttp
import asyncio
import logging
import sys

LOOP = asyncio.get_event_loop()
N_SEMAPHORE = 3

FORMAT = '[%(asctime)s] - %(message)s'
logging.basicConfig(stream=sys.stdout, level=logging.INFO, format=FORMAT)
logger = logging.getLogger(__name__)

class ASYNC_GENERATOR(object):
    def __init__(self, n_semaphore=N_SEMAPHORE, loop=LOOP):
        self.loop = loop
        self.semaphore = asyncio.Semaphore(n_semaphore)
        self.session = aiohttp.ClientSession(loop=self.loop)

    async def _get_url(self, url):
        """
        Sends an http GET request to an API endpoint
        """

        async with self.semaphore:
            async with self.session.get(url) as response:
                logger.info(f'Request URL: {url} [{response.status}]')
                read_response = await response.read()

                return {
                    'read': read_response,
                    'status': response.status,
                }

    def get_routes(self, urls):
        """
        Wrapper around _get_url (multiple urls asynchronously)

        This returns an async generator
        """

        # Asynchronous http GET requests
        coros = [self._get_url(url) for url in urls]
        futures = asyncio.as_completed(coros)
        for future in futures:
            yield self.loop.run_until_complete(future)

    def close(self):
        self.session._connector.close()

当我执行代码的这个主要部分时:

if __name__ == '__main__':
    ag = ASYNC_GENERATOR()
    urls = [f'https://httpbin.org/get?x={i}' for i in range(10)]
    responses = ag.get_routes(urls)
    for response in responses:
        response = next(ag.get_routes(['https://httpbin.org/get']))
    ag.close()

日志打印出来:

[2018-05-15 12:59:49,228] - Request URL: https://httpbin.org/get?x=3 [200]
[2018-05-15 12:59:49,235] - Request URL: https://httpbin.org/get?x=2 [200]
[2018-05-15 12:59:49,242] - Request URL: https://httpbin.org/get?x=6 [200]
[2018-05-15 12:59:49,285] - Request URL: https://httpbin.org/get?x=5 [200]
[2018-05-15 12:59:49,290] - Request URL: https://httpbin.org/get?x=0 [200]
[2018-05-15 12:59:49,295] - Request URL: https://httpbin.org/get?x=7 [200]
[2018-05-15 12:59:49,335] - Request URL: https://httpbin.org/get?x=8 [200]
[2018-05-15 12:59:49,340] - Request URL: https://httpbin.org/get?x=4 [200]
[2018-05-15 12:59:49,347] - Request URL: https://httpbin.org/get?x=1 [200]
[2018-05-15 12:59:49,388] - Request URL: https://httpbin.org/get?x=9 [200]
[2018-05-15 12:59:49,394] - Request URL: https://httpbin.org/get [200]
[2018-05-15 12:59:49,444] - Request URL: https://httpbin.org/get [200]
[2018-05-15 12:59:49,503] - Request URL: https://httpbin.org/get [200]
[2018-05-15 12:59:49,553] - Request URL: https://httpbin.org/get [200]
[2018-05-15 12:59:49,603] - Request URL: https://httpbin.org/get [200]
[2018-05-15 12:59:49,650] - Request URL: https://httpbin.org/get [200]
[2018-05-15 12:59:49,700] - Request URL: https://httpbin.org/get [200]
[2018-05-15 12:59:49,825] - Request URL: https://httpbin.org/get [200]
[2018-05-15 12:59:49,875] - Request URL: https://httpbin.org/get [200]
[2018-05-15 12:59:49,922] - Request URL: https://httpbin.org/get [200]

由于responses是异步生成器,我希望它从异步生成器产生一个响应(它应该只在实际产生时发送请求),向端点发送一个单独的请求,而不是{{1} }参数,然后从异步生成器生成下一个响应。这应该在具有x参数的请求和不带参数的请求之间来回切换。相反,它使用x参数从异步生成器生成所有响应,然后是所有没有参数的https请求。

当我这样做时会发生类似事情:

x

日志打印:

ag = ASYNC_GENERATOR()
urls = [f'https://httpbin.org/get?x={i}' for i in range(10)]
responses = ag.get_routes(urls)
next(responses)
response = next(ag.get_routes(['https://httpbin.org/get']))
ag.close()

相反,我想要的是:

[2018-05-15 13:08:38,643] - Request URL: https://httpbin.org/get?x=8 [200]
[2018-05-15 13:08:38,656] - Request URL: https://httpbin.org/get?x=1 [200]
[2018-05-15 13:08:38,681] - Request URL: https://httpbin.org/get?x=3 [200]
[2018-05-15 13:08:38,695] - Request URL: https://httpbin.org/get?x=4 [200]
[2018-05-15 13:08:38,717] - Request URL: https://httpbin.org/get?x=6 [200]
[2018-05-15 13:08:38,741] - Request URL: https://httpbin.org/get?x=2 [200]
[2018-05-15 13:08:38,750] - Request URL: https://httpbin.org/get?x=0 [200]
[2018-05-15 13:08:38,773] - Request URL: https://httpbin.org/get?x=9 [200]
[2018-05-15 13:08:38,792] - Request URL: https://httpbin.org/get?x=7 [200]
[2018-05-15 13:08:38,803] - Request URL: https://httpbin.org/get?x=5 [200]
[2018-05-15 13:08:38,826] - Request URL: https://httpbin.org/get [200]

有些时候我想在做其他任何事情之前先检索所有的回复。但是,有时候我想在从生成器生成下一个项目之前插入并发出中间请求(即,生成器返回分页搜索结果的结果,我想在进入下一页之前处理来自每个页面的更多链接)。

我需要更改什么来达到所需的结果?

1 个答案:

答案 0 :(得分:3)

暂且不考虑responses是否是异步生成器(不是Python uses the term)的技术问题,您的问题在于as_completedas_completed并行启动了一系列协程 ,并提供了在完成后获取结果的方法。期货并行运行并不是documentation显而易见的,但如果您认为原始concurrent.futures.as_completed适用于基于线程的期货而没有选择但是并行运行。从概念上讲,asyncio期货也是如此。

您的代码只获得第一个(速度最快的)结果,然后使用asyncio开始执行其他操作。传递给as_completed的剩余协程只是因为没有人收集他们的结果而被冻结 - 他们正在后台完成工作,一旦完成就可以await编译(在你的情况下由as_completed中的代码,您使用loop.run_until_complete()访问该代码。我冒昧地猜测,没有参数的URL比仅使用参数x的URL需要更长的时间来检索,这就是它在所有其他协同程序之后打印的原因。

换句话说,正在打印的那些日志行意味着asyncio正在完成其工作并提供您请求的并行执行!如果你不想要并行执行,那么不要求它,连续执行它们:

def get_routes(self, urls):
    for url in urls:
        yield loop.run_until_complete(self._get_url(url))

但这是使用asyncio的一种不好的方式 - 它的主循环是不可重入的,所以为了确保可组合性,你几乎肯定希望循环在顶层只旋转一​​次。这通常使用loop.run_until_complete(main())loop.run_forever()等结构来完成。正如Martijn指出的那样,你可以通过使get_routes成为一个真正的异步生成器来实现这一点,同时保留良好的生成器API:

async def get_routes(self, urls):
    for url in urls:
        result = await self._get_url(url)
        yield result

现在你可以拥有一个main()协程,如下所示:

async def main():
    ag = ASYNC_GENERATOR()
    urls = [f'https://httpbin.org/get?x={i}' for i in range(10)]
    responses = ag.get_routes(urls)
    async for response in responses:
        # simulate `next` with async iteration
        async for other_response in ag.get_routes(['https://httpbin.org/get']):
            break
    ag.close()

loop.run_until_complete(main())