Question

我想使用asyncio来获取网页。

但是，当我执行下面的代码时，没有页面获得。

代码是

import aiofiles
import aiohttp
from aiohttp import ClientSession
import asyncio

async def get_webpage(url, session):
    try:
        res = await session.request(method="GET", url=url)
        html = await res.text(encoding='GB18030')
        return 0, html
    except:
        return 1, []

async def main_get_webpage(urls):
    webpage = []
    connector = aiohttp.TCPConnector(limit=60)       
    async with ClientSession(connector=connector) as session:
        tasks = [get_webpage(url, session) for url in urls]
        result = await asyncio.gather(*tasks)
        for status, data in result:
            print(status)
            if status == 0:
                webpage.append(data)
        return webpage

if __name__ == '__main__':
    urls = ['https://lcdsj.fang.com/house/3120178164/fangjia.htm', 'https://mingliugaoerfuzhuangyuan0551.fang.com/house/2128242324/fangjia.htm']
    loop = asyncio.ProactorEventLoop()
    asyncio.set_event_loop(loop)
    loop = asyncio.get_event_loop()
    webpage =  loop.run_until_complete(main_get_webpage(urls))

我希望在函数main_get_webpage(urls)上打印两个零。

但是，打印了两个。

我的代码怎么了？

如何解决该问题？

非常感谢您。

Answer 1

我的代码怎么了？

问题是您有一个try: ... except:掩盖了问题的根源。如果删除except子句，则会发现一条错误消息，说明了潜在的问题：

UnicodeDecodeError: 'gb18030' codec can't decode byte 0xb7 in position 47676: illegal multibyte sequence

该网页未编码为GB18030。该页面将自己声明为GB2312（GB18030的前身），但将其用作编码也会失败。

如何解决该问题？

根据您要处理的网页文本的不同，您有几种选择：

找到与给定页面配合使用的Python支持的编码。这是理想的选择，但我无法通过简短的搜索找到它。（使用this answer来查找chrome认为页面使用的内容也无济于事，因为响应为GBK，这会在字符47676上产生错误。）
使用更轻松的错误处理程序（例如res.text(encoding='GB18030', errors='replace')）对页面进行解码。这样可以很好地近似文本，将不可解密的字节显示为unicode replacement character。如果您需要在页面上搜索子字符串或将其作为文本进行分析，而不关心其中的某个奇怪字符，则这是一个不错的选择。
放弃将页面解码为文本的想法，只需使用res.data()即可获取字节。如果您需要存档或缓存页面或对其进行索引，则此选项是最佳选择。

无法使用aiohttp ClientSession获取网页

1 个答案: