多线程网页抓取

时间:2019-07-30 10:35:04

标签: python python-3.x multithreading

我写了一个代码来获取网络链接。运行此代码大约需要2:20分钟,这是很多时间,因为它只是代码中的一个函数。我想提高效率。我曾考虑过多线程,但在深入了解并将其应用于此代码方面遇到困难

def get_manufacturer():
    manufacturers = requests.get("https://www.gsmarena.com/")
    res = re.findall(r"<li><a href=\"samsung-phones-9.php\">.+\n", manufacturers.text)
    manufacturer_links = re.findall(r"<li><a href=\"(.+?)\">", res[0])
    final_list = []
    for i in range(len(manufacturer_links)):
        final_list.append("https://www.gsmarena.com/" + manufacturer_links[i])
        # find pages
        for i in final_list:
            req = requests.get(i)
            res2 = re.findall(r"<strong>1</strong>(.+)</div>", req.text)
            for k in res2:
                if k is not None:
                    pages = re.findall(r"<a href=\"(.+?)\">.<\/a>", res2[0])
                    for j in range(len(pages)):
                        final_list.append("https://www.gsmarena.com/" + pages[j])
    return final_list

1 个答案:

答案 0 :(得分:0)

您可以在以下示例中并行运行for循环

import multiprocessing as mul
def calcIntOfnth(i,ppStr,c,znot):
pool = mul.Pool(mul.cpu_count())
results = pool.starmap(calcIntOfnth, [(i,ppStr,c,znot) for i in range(k)]) # other parameters are local to this statement i.e. ppStr,c,znot,k
pool.close()

您将需要重新编写一个for循环函数,并使用Pool对象或其他类似方式并行运行它。