线程加速从给定的网站列表中抓取数据

时间:2015-10-18 04:16:24

标签: multithreading python-3.x web-scraping

我编写了一个程序,用于从给定的网站列表(100个链接)中删除信息。目前,我的程序按顺序执行此操作;也就是说,一次检查一个。我的程序的骨架如下。

for j in range(len(num_of_links)):
    try: #if error occurs, this jumps to next of the list of website
        site_exist(j) #a function to check if site exists
        get_url_with_info(j) #a function to get links inside the website
    except Exception as e: 
        print(str(e))
filter_result_info(links_with_info) #function that filters result

毋庸置疑,这个过程非常缓慢。因此,是否可以实现线程,使得我的程序可以更快地处理作业,使得4个并发作业每个都刮掉链接列表25。你能指出我如何做到这一点的参考吗?

2 个答案:

答案 0 :(得分:1)

你想要的是Pool of threads

from concurrent.futures import ThreadPoolExecutor


def get_url(url):
    try:
        if site_exists(url):
            return get_url_with_info(url)
        else:
            return None
    except Exception as error: 
        print(error)


with ThreadPoolExecutor(max_workers=4) as pool:
    future = pool.map(get_url, list_of_urls)

list_of_results = future.results()  # waits until all URLs have been retrieved
filter_result_info(list_of_results)  # note that some URL might be None

答案 1 :(得分:0)

线程不会加快速度。多处理可能就是你想要的。

Multiprocessing vs Threading Python