我的脚本应该异步运行时遇到错误

时间:2018-12-12 16:06:03

标签: python python-3.x web-scraping python-asyncio aiohttp

我使用python asyncioaiohttp库的关联编写了一个python脚本,以解析弹出式对话框中的名称,这些弹出式框是从位于表格内的不同代理机构信息中单击联系信息按钮后启动的从this website异步发送。该网页显示了513页上的表格内容。

当我尝试使用too many file descriptors in select()时遇到了错误asyncio.get_event_loop(),但是当我遇到this thread时,我发现有人建议使用asyncio.ProactorEventLoop()来避免这种错误因此我使用了后者,但是注意到,即使我遵守了建议,脚本也只从几页中收集名称,直到引发以下错误。我该如何解决?

raise client_error(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host www.tursab.org.tr:443 ssl:None [The semaphore timeout period has expired]

这是我到目前为止的尝试:

import asyncio
import aiohttp
from bs4 import BeautifulSoup

links = ["https://www.tursab.org.tr/en/travel-agencies/search-travel-agency?sayfa={}".format(page) for page in range(1,514)]
lead_link = "https://www.tursab.org.tr/en/displayAcenta?AID={}"

async def get_links(url):
    async with asyncio.Semaphore(10):
        async with aiohttp.ClientSession() as session:
            async with session.get(url) as response:
                text = await response.text()
                result = await process_docs(text)
            return result

async def process_docs(html):
    coros = []
    soup = BeautifulSoup(html,"lxml")
    items = [itemnum.get("data-id") for itemnum in soup.select("#acentaTbl tr[data-id]")]
    for item in items:
        coros.append(fetch_again(lead_link.format(item)))
    await asyncio.gather(*coros)

async def fetch_again(link):
    async with asyncio.Semaphore(10):
        async with aiohttp.ClientSession() as session:
            async with session.get(link) as response:
                text = await response.text()
                sauce = BeautifulSoup(text,"lxml")
                try:
                    name = sauce.select_one("p > b").text
                except Exception: name = ""
                print(name)

if __name__ == '__main__':
    loop = asyncio.ProactorEventLoop()
    asyncio.set_event_loop(loop)
    loop.run_until_complete(asyncio.gather(*(get_links(link) for link in links)))

简而言之,process_docs()函数的作用是从每个页面收集data-id个数字,以将其用作此https://www.tursab.org.tr/en/displayAcenta?AID={}链接的前缀,以从弹出框中收集名称。这样的ID是8757,因此一个合格的链接是https://www.tursab.org.tr/en/displayAcenta?AID=8757

顺便说一句,如果我将links变量中使用的最高数字更改为20或30左右,它将顺利进行。

1 个答案:

答案 0 :(得分:3)

async def get_links(url):
    async with asyncio.Semaphore(10):

您不能做这样的事情:这意味着在每个函数调用上将创建新的信号量实例,而您需要为所有请求使用单个信号量实例。以此方式更改代码:

sem = asyncio.Semaphore(10)  # module level

async def get_links(url):
    async with sem:
        # ...


async def fetch_again(link):
    async with sem:
        # ...

正确使用信号量后,您还可以返回默认循环:

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(...)

最后,您应该更改get_links(url)fetch_again(link)以便在信号量之外进行解析,以便在process_docs(text)内部需要信号量之前尽快释放它。

最终代码:

import asyncio
import aiohttp
from bs4 import BeautifulSoup

links = ["https://www.tursab.org.tr/en/travel-agencies/search-travel-agency?sayfa={}".format(page) for page in range(1,514)]
lead_link = "https://www.tursab.org.tr/en/displayAcenta?AID={}"

sem = asyncio.Semaphore(10)

async def get_links(url):
    async with sem:
        async with aiohttp.ClientSession() as session:
            async with session.get(url) as response:
                text = await response.text()
    result = await process_docs(text)
    return result

async def process_docs(html):
    coros = []
    soup = BeautifulSoup(html,"lxml")
    items = [itemnum.get("data-id") for itemnum in soup.select("#acentaTbl tr[data-id]")]
    for item in items:
        coros.append(fetch_again(lead_link.format(item)))
    await asyncio.gather(*coros)

async def fetch_again(link):
    async with sem:
        async with aiohttp.ClientSession() as session:
            async with session.get(link) as response:
                text = await response.text()
    sauce = BeautifulSoup(text,"lxml")
    try:
        name = sauce.select_one("p > b").text
    except Exception:
        name = "o"
    print(name)

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(asyncio.gather(*(get_links(link) for link in links)))