Question

原样：我构建了一个函数，该函数以url作为参数，抓取页面并将解析后的信息放入列表中。在此旁边，我有一个URL列表，并且正在将URL列表映射到URL解析器功能，并遍历每个URL。问题是我大约有7000-8000个链接，因此迭代解析需要很多时间。这是当前的迭代解决方案：

mapped_parse_links = map(parse, my_new_list)
all_parsed = list(it.chain.from_iterable(mapped_parse_links))

“ parse”是抓取函数，“ my_new_list”是URL列表。

成为：我要实现多重处理，以便与其遍历URL列表，不如利用多个CPU同时拾取更多链接并使用parse函数解析信息。我尝试了以下方法：

import multiprocessing
with multiprocessing.Pool() as p:
    mapped_parse_links = p.map(parse, my_new_list)
    all_parsed = list(it.chain.from_iterable(mapped_parse_links))

我也尝试使用Pool函数使用其他解决方案，但是所有解决方案都可以永久运行。有人可以给我指出如何解决这个问题吗？谢谢。

Answer 1

从docs for concurrent.futures处进行了少量改动后制成：

import concurrent.futures
import urllib.request

URLS = ['http://www.foxnews.com/',
        'http://www.cnn.com/',
        'http://europe.wsj.com/',
        'http://www.bbc.co.uk/',
        'http://some-made-up-domain.com/']

# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
    with urllib.request.urlopen(url, timeout=timeout) as conn:
        return conn.read()

if __name__ == '__main__':
    # We can use a with statement to ensure threads are cleaned up promptly
    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        # Start the load operations and mark each future with its URL
        future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
        for future in concurrent.futures.as_completed(future_to_url):
            url = future_to_url[future]
            try:
                data = future.result()
                # Do something with the scraped data here
            except Exception as exc:
                print('%r generated an exception: %s' % (url, exc))

您将不得不用解析函数代替load_url。

多处理beautifulsoup4函数可提高性能

1 个答案: