asyncio web scraping 101:使用aiohttp获取多个url

时间:2016-03-10 20:45:13

标签: python python-3.x web-scraping python-asyncio aiohttp

在之前的问题中,aiohttp的一位作者使用来自async with的新Python 3.5语法建议fetch multiple urls with aiohttp方式:

import aiohttp
import asyncio

async def fetch(session, url):
    with aiohttp.Timeout(10):
        async with session.get(url) as response:
            return await response.text()

async def fetch_all(session, urls, loop):
    results = await asyncio.wait([loop.create_task(fetch(session, url))
                                  for url in urls])
    return results

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    # breaks because of the first url
    urls = ['http://SDFKHSKHGKLHSKLJHGSDFKSJH.com',
            'http://google.com',
            'http://twitter.com']
    with aiohttp.ClientSession(loop=loop) as session:
        the_results = loop.run_until_complete(
            fetch_all(session, urls, loop))
        # do something with the the_results

但是当其中一个session.get(url)请求中断时(由于http://SDFKHSKHGKLHSKLJHGSDFKSJH.com而如上所述),错误将不会被处理,整个事情就会中断。

我找了一些方法来插入关于session.get(url)结果的测试,例如查找try ... except ...if response.status != 200:的地方,但我只是不明白如何工作使用async withawait和各种对象。

由于async with仍然很新,所以没有很多例子。如果asyncio向导可以显示如何执行此操作,那么对很多人来说会非常有帮助。毕竟,大多数人想要使用asyncio进行测试的第一件事就是同时获得多个资源。

目标

目标是我们可以检查the_results并快速查看:

  • 此网址失败(以及原因:状态代码,可能是例外名称)或
  • 此网址有效,这是一个有用的响应对象

2 个答案:

答案 0 :(得分:13)

我会使用gather而不是wait,它可以将异常作为对象返回,而不会提升它们。然后,您可以检查每个结果,如果它是某个例外的实例。

import aiohttp
import asyncio

async def fetch(session, url):
    with aiohttp.Timeout(10):
        async with session.get(url) as response:
            return await response.text()

async def fetch_all(session, urls, loop):
    results = await asyncio.gather(
        *[fetch(session, url) for url in urls],
        return_exceptions=True  # default is false, that would raise
    )

    # for testing purposes only
    # gather returns results in the order of coros
    for idx, url in enumerate(urls):
        print('{}: {}'.format(url, 'ERR' if isinstance(results[idx], Exception) else 'OK'))
    return results

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    # breaks because of the first url
    urls = [
        'http://SDFKHSKHGKLHSKLJHGSDFKSJH.com',
        'http://google.com',
        'http://twitter.com']
    with aiohttp.ClientSession(loop=loop) as session:
        the_results = loop.run_until_complete(
            fetch_all(session, urls, loop))

试验:

$python test.py 
http://SDFKHSKHGKLHSKLJHGSDFKSJH.com: ERR
http://google.com: OK
http://twitter.com: OK

答案 1 :(得分:4)

我远不是asyncio专家,但是你想捕捉到捕获套接字错误所需的错误:

async def fetch(session, url):
    with aiohttp.Timeout(10):
        try:
            async with session.get(url) as response:
                print(response.status == 200)
                return await response.text()
        except socket.error as e:
            print(e.strerror)

运行代码并打印 the_results

Cannot connect to host sdfkhskhgklhskljhgsdfksjh.com:80 ssl:False [Can not connect to sdfkhskhgklhskljhgsdfksjh.com:80 [Name or service not known]]
True
True
({<Task finished coro=<fetch() done, defined at <ipython-input-7-535a26aaaefe>:5> result='<!DOCTYPE ht...y>\n</html>\n'>, <Task finished coro=<fetch() done, defined at <ipython-input-7-535a26aaaefe>:5> result=None>, <Task finished coro=<fetch() done, defined at <ipython-input-7-535a26aaaefe>:5> result='<!doctype ht.../body></html>'>}, set())

你可以看到我们得到了错误,并且进一步的调用仍然成功返回html。

我们应该真正捕获 OSError 因为socket.error是A deprecated alias of OSError,因为python 3.3:

async def fetch(session, url):
    with aiohttp.Timeout(10):
        try:
            async with session.get(url) as response:
                return await response.text()
        except OSError as e:
            print(e)

如果您还要检查响应是否为200,请将您的if设置为try,然后您可以使用reason属性获取更多信息:

async def fetch(session, url):
    with aiohttp.Timeout(10):
        try:
            async with session.get(url) as response:
                if response.status != 200:
                    print(response.reason)
                return await response.text()
        except OSError as e:
            print(e.strerror)