Question

我与 asyncio 一起在 aiohttp 中编写了一个脚本>库以异步解析网站内容。我已尝试在以下脚本中应用逻辑，就像通常在 scrapy 中应用它的方式一样。

但是，当我执行脚本时，它的作用类似于 requests 或 urllib.request < / em> 。因此，它非常慢并且无法达到目的。

我知道我可以通过在变量 link 中定义所有下一页链接来解决此问题。但是，我不是已经用正确的方式使用现有脚本来完成任务了吗？

在脚本中，processing_docs()函数的作用是收集不同帖子的所有链接，并将经过精炼的链接传递到fetch_again()函数，以从其目标页面中获取标题。 processing_docs()函数中应用了一种逻辑，该逻辑收集next_page链接并将其提供给fetch()函数以重复相同的内容。 This next_page call is making the script slower whereas we usually do the same in {草and get expected performance.

我的问题是：如何保持现有逻辑不变？

import aiohttp import asyncio from lxml.html import fromstring from urllib.parse import urljoin link = "https://stackoverflow.com/questions/tagged/web-scraping" async def fetch(url): async with aiohttp.ClientSession() as session: async with session.get(url) as response: text = await response.text() result = await processing_docs(session, text) return result async def processing_docs(session, html): tree = fromstring(html) titles = [urljoin(link,title.attrib['href']) for title in tree.cssselect(".summary .question-hyperlink")] for title in titles: await fetch_again(session,title) next_page = tree.cssselect("div.pager a[rel='next']") if next_page: page_link = urljoin(link,next_page[0].attrib['href']) await fetch(page_link) async def fetch_again(session,url): async with session.get(url) as response: text = await response.text() tree = fromstring(text) title = tree.cssselect("h1[itemprop='name'] a")[0].text print(title) if __name__ == '__main__': loop = asyncio.get_event_loop() loop.run_until_complete(asyncio.gather(*(fetch(url) for url in [link]))) loop.close()

Answer 1

使用asyncio的全部目的是您可以同时（彼此并行）运行多个提取。让我们看看您的代码：

for title in titles:
    await fetch_again(session,title)

这部分意味着每个新的fetch_again仅在等待前一个（完成）之后才启动。如果您以这种方式进行操作，是的，使用同步方法没有区别。

要调用asyncio的所有功能，请使用asyncio.gather同时启动多个提取：

await asyncio.gather(*[
    fetch_again(session,title)
    for title 
    in titles
])

您会看到明显的加速。

您可以继续进行事件，并从下一页开始fetch到标题的同时fetch_again：

async def processing_docs(session, html):
        coros = []

        tree = fromstring(html)

        # titles:
        titles = [
            urljoin(link,title.attrib['href']) 
            for title 
            in tree.cssselect(".summary .question-hyperlink")
        ]

        for title in titles:
            coros.append(
                fetch_again(session,title)
            )

        # next_page:
        next_page = tree.cssselect("div.pager a[rel='next']")
        if next_page:
            page_link = urljoin(link,next_page[0].attrib['href'])

            coros.append(
                fetch(page_link)
            )

        # await:
        await asyncio.gather(*coros)

重要提示

尽管这种方法可以使您更快地执行操作，但您可能希望同时限制并发请求的数量，以避免在计算机和服务器上浪费大量资源。

您可以将asyncio.Semaphore用于此目的：

semaphore = asyncio.Semaphore(10)

async def fetch(url):
    async with semaphore:
        async with aiohttp.ClientSession() as session:
            async with session.get(url) as response:
                text = await response.text()
                result = await processing_docs(session, text)
            return result

即使脚本异步运行，脚本的执行也会非常缓慢

1 个答案: