Question

情况： 我正在尝试向我已经下载的特定文件中的所有列出的域发送HTTP请求并获取目标URL，我被转发到。

问题：我已经关注了tutorial，我得到的回复少于预期。它每秒大约有100个响应，但在教程中列出了每分钟100,000个响应。几秒钟后脚本也变得越来越慢，所以我每5秒钟就会得到1个响应。

已经尝试过：首先我认为这个问题是因为我在Windows服务器上运行它。我在计算机上尝试了脚本之后，我发现它只是快一点，但不是更多。在另一台Linux服务器上，它与我的计算机（Unix，macOS）相同。

work_dir = os.path.dirname(__file__)

async def fetch(url, session):
    try:
        async with session.get(url, ssl=False) as response:
            if response.status == 200:
                delay = response.headers.get("DELAY")
                date = response.headers.get("DATE")
                print("{}:{} with delay {}".format(date, response.url, delay))
                return await response.read()
    except Exception:
        pass

async def bound_fetch(sem, url, session):
    # Getter function with semaphore.
    async with sem:
        await fetch(url, session)


async def run():
    os.chdir(work_dir)
    for file in glob.glob("cdx-*"):
        print("Opening: " + file)
        opened_file = file
        tasks = []
        # create instance of Semaphore
        sem = asyncio.Semaphore(40000)
        with open(work_dir + '/' + file) as infile:
            seen = set()
            async with ClientSession() as session:
                for line in infile:
                    regex = re.compile(r'://(.*?)/')
                    domain = regex.search(line).group(1)
                    domain = domain.lower()

                    if domain not in seen:
                        seen.add(domain)

                        task = asyncio.ensure_future(bound_fetch(sem, 'http://' + domain, session))
                        tasks.append(task)

                    del line
                responses = asyncio.gather(*tasks)
                await responses
            infile.close()
            del seen
            del file


loop = asyncio.get_event_loop()

future = asyncio.ensure_future(run())
loop.run_until_complete(future)

我真的不知道如何解决这个问题。特别是因为我对Python很新......但我必须以某种方式让它工作:(

Answer 1

在没有实际调试代码的情况下很难分辨出出了什么问题，但是一个潜在的问题是文件处理是序列化的。换句话说，代码永远不会处理下一个文件，直到当前文件的所有请求都完成。如果有很多文件，其中一个文件很慢，这可能是个问题。

要更改此设置，请按以下方式定义run：

async def run():
    os.chdir(work_dir)
    async with ClientSession() as session:
        sem = asyncio.Semaphore(40000)
        seen = set()
        pending_tasks = set()
        for f in glob.glob("cdx-*"):
            print("Opening: " + f)
            with open(f) as infile:
                lines = list(infile)
            for line in lines:
                domain = re.search(r'://(.*?)/', line).group(1)
                domain = domain.lower()
                if domain in seen:
                    continue
                seen.add(domain)
                task = asyncio.ensure_future(bound_fetch(sem, 'http://' + domain, session))
                pending_tasks.add(task)
                # ensure that each task removes itself from the pending set
                # when done, so that the set doesn't grow without bounds
                task.add_done_callback(pending_tasks.remove)
        # await the remaining tasks
        await asyncio.wait(pending_tasks)

另一个重要的事情是：在fetch()中清除所有异常是不好的做法，因为没有迹象表明某些事情已经开始出错（由于错误或简单的拼写错误）。这可能是您的脚本在一段时间后变得“慢”的原因 - fetch正在引发异常并且您从未看到它们。而不是pass，请使用print(f'failed to get {url}: {e}')之类的内容，其中e是您从except Exception as e获得的对象。

其他几条评论：

Python中几乎不需要del局部变量;垃圾收集器会自动执行此操作。
您无需close()使用with语句打开的文件。 with专门为您自动关闭而设计。
代码将域添加到seen集，但也处理了已经看过的域。此版本会跳过已生成任务的域。
您可以创建一个ClientSession并将其用于整个运行。

Python aiohttp（带有asyncio）发送请求的速度非常慢

1 个答案: