多处理beautifulsoup4函数可提高性能

时间:2020-03-12 08:56:45

标签: python beautifulsoup multiprocessing

原样: 我构建了一个函数,该函数以url作为参数,抓取页面并将解析后的信息放入列表中。在此旁边,我有一个URL列表,并且正在将URL列表映射到URL解析器功能,并遍历每个URL。问题是我大约有7000-8000个链接,因此迭代解析需要很多时间。这是当前的迭代解决方案:

mapped_parse_links = map(parse, my_new_list)
all_parsed = list(it.chain.from_iterable(mapped_parse_links))

“ parse”是抓取函数,“ my_new_list”是URL列表。

成为: 我要实现多重处理,以便与其遍历URL列表,不如利用多个CPU同时拾取更多链接并使用parse函数解析信息。我尝试了以下方法:

import multiprocessing
with multiprocessing.Pool() as p:
    mapped_parse_links = p.map(parse, my_new_list)
    all_parsed = list(it.chain.from_iterable(mapped_parse_links))

我也尝试使用Pool函数使用其他解决方案,但是所有解决方案都可以永久运行。有人可以给我指出如何解决这个问题吗? 谢谢。

1 个答案:

答案 0 :(得分:0)

docs for concurrent.futures处进行了少量改动后制成:

import concurrent.futures
import urllib.request

URLS = ['http://www.foxnews.com/',
        'http://www.cnn.com/',
        'http://europe.wsj.com/',
        'http://www.bbc.co.uk/',
        'http://some-made-up-domain.com/']

# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
    with urllib.request.urlopen(url, timeout=timeout) as conn:
        return conn.read()

if __name__ == '__main__':
    # We can use a with statement to ensure threads are cleaned up promptly
    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        # Start the load operations and mark each future with its URL
        future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
        for future in concurrent.futures.as_completed(future_to_url):
            url = future_to_url[future]
            try:
                data = future.result()
                # Do something with the scraped data here
            except Exception as exc:
                print('%r generated an exception: %s' % (url, exc))

您将不得不用解析函数代替load_url。