python中的网页抓取:为每个网页复制HTML的特定部分

时间:2021-04-19 01:57:49

标签: python web-scraping beautifulsoup

我正在使用 html 请求和漂亮的汤(新手)开发网络爬虫。对于 1 个网页 (https://www.lookfantastic.com/illamasqua-artistry-palette-experimental/11723920.html),我正在尝试刮取一部分,我将复制该部分以用于其他产品。 html 看起来像:

<span class="js-enhanced-ecommerce-data hidden" data-product-title="Illamasqua Expressionist Artistry Palette" data-product-id="12024086" data-product-category="" data-product-is-master-product-id="false" data-product-master-product-id="12024086" data-product-brand="Illamasqua" data-product-price="£39.00" data-product-position="1">
</span>

我想选择 data-product-brand="Illamasqua" ,特别是 Illamasqua。我不确定如何使用 html 请求或 Beautifulsoup 来获取它。我试过了:

r.html.find("span.data-product-brand", first=True)

但这并不成功。任何帮助将不胜感激。

2 个答案:

答案 0 :(得分:2)

因为你标记了 beautifulsoup,这里有一个使用该包的解决方案

from bs4 import BeautifulSoup
import requests

page = requests.get('https://www.lookfantastic.com/illamasqua-artistry-palette-experimental/11723920.html')
soup = BeautifulSoup(page.content, "html.parser")

# there are multiple matches for the class that contains the word 'Illamasqua', which is what I think you want in the end???
# you can loop through and get the brand like this; in this case there are three
for l in soup.find_all(class_="js-enhanced-ecommerce-data hidden"):
    print(l.get('data-product-brand'))

# if it's always going to be the first, you can just do this
soup.find(class_="js-enhanced-ecommerce-data hidden").get('data-product-brand')

答案 1 :(得分:1)

您可以直接获取具有指定数据属性的元素:

from requests_html import HTMLSession
session = HTMLSession()

r = session.get('https://www.lookfantastic.com/illamasqua-artistry-palette-experimental/11723920.html')
span=r.html.find('[data-product-brand]',first=True)
print(span)

3 个结果,我猜你只需要第一个。