在python中进行网络爬取以获得知识

时间:2019-06-02 09:22:45

标签: python web-scraping

我只是想抓取一个网站以获取标题和产品说明等,以供练习,我已经获取了产品名称,但对于如何获取以下内容感到困惑。

在这里,我只是试图获取产品标题及其描述。 我已经成功拿到了头衔。

from requests_html import HTML,HTMLSession
session = HTMLSession()
r = session.get('https://www.newegg.com/Video-Cards-Video-Devices/Category/ID-38?Tpk=graphics%20card')
containers =  r.html.find('.item-container',first=True)
#print(containers.html)
title = containers.find('.item-branding img',first=True).attrs['title']
#print(title)
description = containers.find('.item-title',first=True).html
print(description)

但是问题出在描述上,我想获取ai标签内的数据,该数据显示了我无法执行的产品描述,因此不胜感激< / p>

从此:

<a class="item-title" href="https://www.newegg.com/evga-geforce-rtx-2080-ti-11g-p4-2281-kr/p/N82E16814487418?Item=N82E16814487418" title="View Details"><i class="icon-premier icon-premier-xsm"/>EVGA GeForce RTX 2080 Ti DirectX 12 11G-P4-2281-KR BLACK EDITION GAMING Video Card, Dual HDB Fans &amp; RGB LED</a>

我想抓住这个:

EVGA GeForce RTX 2080 Ti DirectX 12 11G-P4-2281-KR BLACK EDITION GAMING Video Card, Dual HDB Fans &amp; RGB LED

1 个答案:

答案 0 :(得分:0)

我建议使用BeautifulSoup来抓取该网站的内容,您的代码应如下所示:

from requests_html import HTML, HTMLSession
from bs4 import BeautifulSoup

session = HTMLSession()
r = session.get('https://www.newegg.com/Video-Cards-Video-Devices/Category/ID-38?Tpk=graphics%20card')
soup = BeautifulSoup(r.content,"lxml")

containers = soup.find("div", {"class","item-container"})
title = containers.findAll("img", {"class":"lazy-img"})[1]["title"]
description = containers.find("a",{"class":"item-title"}).getText()
print(description)

希望这会有所帮助