原样: 我构建了一个函数,该函数以url作为参数,抓取页面并将解析后的信息放入列表中。在此旁边,我有一个URL列表,并且正在将URL列表映射到URL解析器功能,并遍历每个URL。问题是我大约有7000-8000个链接,因此迭代解析需要很多时间。这是当前的迭代解决方案:
mapped_parse_links = map(parse, my_new_list)
all_parsed = list(it.chain.from_iterable(mapped_parse_links))
“ parse”是抓取函数,“ my_new_list”是URL列表。
成为: 我要实现多重处理,以便与其遍历URL列表,不如利用多个CPU同时拾取更多链接并使用parse函数解析信息。我尝试了以下方法:
import multiprocessing
with multiprocessing.Pool() as p:
mapped_parse_links = p.map(parse, my_new_list)
all_parsed = list(it.chain.from_iterable(mapped_parse_links))
我也尝试使用Pool函数使用其他解决方案,但是所有解决方案都可以永久运行。有人可以给我指出如何解决这个问题吗? 谢谢。
答案 0 :(得分:0)
从docs for concurrent.futures处进行了少量改动后制成:
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
if __name__ == '__main__':
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
# Do something with the scraped data here
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
您将不得不用解析函数代替load_url。