aiohttp.TCPConnector(带限制参数)vs asyncio.Semaphore用于限制并发连接数

时间:2017-08-18 13:03:48

标签: python async-await python-3.5 python-asyncio aiohttp

我以为我想通过制作一个简单的脚本来学习新的python async await语法,更具体地说是asyncio模块,它允许你在一个脚本上下载多个资源。

但现在我被卡住了。

在研究时,我遇到了两个限制并发请求数量的选项:

  1. 将aiohttp.TCPConnector(带限制参数)传递给aiohttp.ClientSession或
  2. 使用asyncio.Semaphore。
  3. 是否有首选选项,或者如果您只想限制并发连接数,它们是否可以互换使用? (大致)性能是否相等?

    两者似乎都有100个并发连接/操作的默认值。如果我只使用信号量限制,那么aiohttp内部会隐式地将我锁定为100个并发连接吗?

    这对我来说都很新鲜,不清楚。请随时指出我的任何误解或我的代码中的缺陷。

    这是我目前包含两个选项的代码(我应该删除哪个?):

    奖金问题:

    1. 如何处理(最好重试x次)出现错误的coros?
    2. coro完成后,保存返回数据(通知我的DataHandler)的最佳方法是什么?我不希望最后全部保存,因为我可以尽快开始处理结果。
    3. 取值

      import asyncio
      from tqdm import tqdm
      import uvloop as uvloop
      from aiohttp import ClientSession, TCPConnector, BasicAuth
      
      # You can ignore this class
      class DummyDataHandler(DataHandler):
          """Takes data and stores it somewhere"""
      
          def __init__(self, *args, **kwargs):
              super().__init__(*args, **kwargs)
      
          def take(self, origin_url, data):
              return True
      
          def done(self):
              return None
      
      class AsyncDownloader(object):
          def __init__(self, concurrent_connections=100, silent=False, data_handler=None, loop_policy=None):
      
              self.concurrent_connections = concurrent_connections
              self.silent = silent
      
              self.data_handler = data_handler or DummyDataHandler()
      
              self.sending_bar = None
              self.receiving_bar = None
      
              asyncio.set_event_loop_policy(loop_policy or uvloop.EventLoopPolicy())
              self.loop = asyncio.get_event_loop()
              self.semaphore = asyncio.Semaphore(concurrent_connections)
      
          async def fetch(self, session, url):
              # This is option 1: The semaphore, limiting the number of concurrent coros,
              # thereby limiting the number of concurrent requests.
              with (await self.semaphore):
                  async with session.get(url) as response:
                      # Bonus Question 1: What is the best way to retry a request that failed?
                      resp_task = asyncio.ensure_future(response.read())
                      self.sending_bar.update(1)
                      resp = await resp_task
      
                      await  response.release()
                      if not self.silent:
                          self.receiving_bar.update(1)
                      return resp
      
          async def batch_download(self, urls, auth=None):
              # This is option 2: Limiting the number of open connections directly via the TCPConnector
              conn = TCPConnector(limit=self.concurrent_connections, keepalive_timeout=60)
              async with ClientSession(connector=conn, auth=auth) as session:
                  await asyncio.gather(*[asyncio.ensure_future(self.download_and_save(session, url)) for url in urls])
      
          async def download_and_save(self, session, url):
              content_task = asyncio.ensure_future(self.fetch(session, url))
              content = await content_task
              # Bonus Question 2: This is blocking, I know. Should this be wrapped in another coro
              # or should I use something like asyncio.as_completed in the download function?
              self.data_handler.take(origin_url=url, data=content)
      
          def download(self, urls, auth=None):
              if isinstance(auth, tuple):
                  auth = BasicAuth(*auth)
              print('Running on concurrency level {}'.format(self.concurrent_connections))
              self.sending_bar = tqdm(urls, total=len(urls), desc='Sent    ', unit='requests')
              self.sending_bar.update(0)
      
              self.receiving_bar = tqdm(urls, total=len(urls), desc='Reveived', unit='requests')
              self.receiving_bar.update(0)
      
              tasks = self.batch_download(urls, auth)
              self.loop.run_until_complete(tasks)
              return self.data_handler.done()
      
      
      ### call like so ###
      
      URL_PATTERN = 'https://www.example.com/{}.html'
      
      def gen_url(lower=0, upper=None):
          for i in range(lower, upper):
              yield URL_PATTERN.format(i)   
      
      ad = AsyncDownloader(concurrent_connections=30)
      data = ad.download([g for g in gen_url(upper=1000)])
      

2 个答案:

答案 0 :(得分:1)

有首选的选项吗?

是,请参见下文:

aiohttp内部会隐式将我锁定为100个并发连接吗?

是的,除非您指定其他限制,否则默认值100会锁定您。 您可以在此处的源代码中看到它:https://github.com/aio-libs/aiohttp/blob/master/aiohttp/connector.py#L1084

在性能方面,它们是否(大致)相等?

否(但性能差异应该可以忽略不计),因为无论如何aiohttp.TCPConnector都会检查可用的连接,无论它是否被信号量包围,在这里使用信号量都是不必要的开销。

我该如何处理(最好重试x次)引发错误的错误?

我不认为有一种标准的方法,但是一种解决方案是将您的呼叫包装在这样的方法中:

async def retry_requests(...):
    for i in range(5):
        try:
            return (await session.get(...)
        except aiohttp.ClientResponseError:
            pass

答案 1 :(得分:0)

我该如何处理(最好重试x次)引发错误的错误?

我创建了一个Python装饰器来处理该问题

    def retry(cls, exceptions, tries=3, delay=2, backoff=2):
        """
        Retry calling the decorated function using an exponential backoff. This
        is required in case of requesting Braze API produces any exceptions.

        Args:
            exceptions: The exception to check. may be a tuple of
                exceptions to check.
            tries: Number of times to try (not retry) before giving up.
            delay: Initial delay between retries in seconds.
            backoff: Backoff multiplier (e.g. value of 2 will double the delay
                each retry).
        """

        def deco_retry(func):
            @wraps(func)
            def f_retry(*args, **kwargs):
                mtries, mdelay = tries, delay
                while mtries > 1:
                    try:
                        return func(*args, **kwargs)
                    except exceptions as e:
                        msg = '{}, Retrying in {} seconds...'.format(e, mdelay)
                        if logging:
                            logging.warning(msg)
                        else:
                            print(msg)
                        time.sleep(mdelay)
                        mtries -= 1
                        mdelay *= backoff
                return func(*args, **kwargs)

            return f_retry

        return deco_retry