Question

我正在以以下形式从网页下载一些信息

http://example.com?p=10
http://example.com?p=20
...

重点是我不知道它们有多少。在某个时候，我将从服务器收到错误消息，或者也许在某个时候，我想停止处理，因为我已经足够了。我想并行运行它们。

def generator_query(step=10):
   i = 0
   yield "http://example.com?p=%d" % i
   i += step

def task(url):
    t = request.get(url).text
    if not t:  # after the last one
       return None
    return t

我可以使用带有消费者/生产者模式的队列来实现它，但是我想知道是否可以有更高级别的实现，例如并发模块。

非并发示例：

results = []
for url in generator_query():
    results.append(task(url))

Answer 1

您可以使用 concurrent 的 ThreadPoolExecutor 。 here提供了一个使用示例。
当您从服务器（例外部分）获得无效答案时，或者当您感到有足够的数据（您可以算作有效）时，就需要打破示例的for循环。例如，其他部分中的回复）。

Answer 2

您可以为此使用aiohttp：

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()

async def coro(step):
    url = 'https://example.com?p={}'.format(step)
    async with aiohttp.ClientSession() as session:
        html = await fetch(session, url)
        print(html)

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    tasks = [coro(i*10) for i in range(10)]
    loop.run_until_complete(asyncio.wait(tasks))

对于页面错误，由于我不知道您要处理的网站，您可能必须自己弄清楚它。也许尝试...除外？

注意：如果您的python版本高于3.5，则可能会导致ssl证书验证错误。

与发电机输入一起使用并发期货

2 个答案: