Question

我必须从网页中提取一些属性（在我的示例中，只有一个属性：应用程序的文本描述）。问题是时间！实际上，使用以下代码进入页面，提取HTML的一部分并保存，每页大约需要1.2-1.8秒。很多时间。有没有办法使其更快？我的页面很多，x也可能是200000。我正在用木星。

    Description=[]
    for x in range(len(M)):
        response = http.request('GET',M[x] )
        soup = BeautifulSoup(response.data,"lxml")
        t=str(soup.find("div",attrs={"class":"section__description"}))
        Description.append(t)

谢谢

Answer 1

您应该考虑一下inspecting the page。如果页面依赖于Rest API，则可以通过直接从API中获取所需内容来进行抓取。这比从HTML获取内容要有效得多。要使用它，您应该签出Requests library for Python。

Answer 2

我会根据我的评论将其划分为多个过程。因此，您可以将代码放入函数中并使用像这样的多处理

from multiprocessing import Pool

def web_scrape(url):
    response = http.request('GET',url )
    soup = BeautifulSoup(response.data,"lxml")
    t=str(soup.find("div",attrs={"class":"section__description"}))
    return t

if __name__ == '__main__':
    # M is your list of urls
    M=["https:// ... , ... , ... ]
    p = Pool(5) # 5 or how many processes you think is appropriate (start with how many cores you have, maybe)
    description=p.map(web_scrape, M))
    p.close()
    p.join()
    description=list(description) # if you need it to be a list

正在发生的事情是，您的URL列表正被分发到运行您的scrape函数的多个进程。然后，所有结果最后都会合并，并在description中结束。这应该比您一次像处理当前一个URL那样快得多。

有关更多详细信息：https://docs.python.org/2/library/multiprocessing.html

网页抓取。如何使其更快？

2 个答案: