并发期货网络抓取

时间:2018-07-31 12:29:19

标签: python web-scraping concurrent.futures

谁在读他的! 谢谢您抽出宝贵的时间来研究这个问题。

我目前正在尝试开发一种快速的网页抓取功能,以便可以抓取大量文件。

这是我当前拥有的代码:

import time
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ProcessPoolExecutor, as_completed
def parse(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'lxml')
    return soup.find_all('a')
with ProcessPoolExecutor(max_workers=4) as executor:
    start = time.time()
    futures = [ executor.submit(parse, url) for url in URLs ]
    results = []
    for result in as_completed(futures):
        results.append(result)
    end = time.time()
    print("Time Taken: {:.6f}s".format(end-start))

这可以带来网站(例如www.google.com)的搜索结果, 但是我的问题是我不知道查看它带来的数据 我只会得到未来的物品。

请有人可以解释/告诉我如何做。

感谢您随时提供帮助。

1 个答案:

答案 0 :(得分:1)

您也可以通过dict理解来实现它,如下所示。

with ProcessPoolExecutor(max_workers=4) as executor:

    start = time.time()
    futures = { executor.submit(parse, url): url for url in URLs }
    for result in as_completed(futures):
        link = futures.get(result)
        try:
            data = result.result()
        except Exception as e:
            print(e)
        else:
            print("Link: {}, data: {}".format(link, data))
    end = time.time()
    print("Time Taken: {:.6f}s".format(end-start))