Question

所以我想在丝芙兰网站上取消香水的名称，品牌和价格。但是我注意到，只有60种香水中的前12种会出现（一页上有60种香水）。我尝试打印出“ item_container”的长度，结果显示其中有60个，但是从第12个项目开始，一些结构不同的代码开始出现。我已经检查了它们的HTML结构，但我不明白为什么我的代码对其余的代码无效。我还尝试将“类”更改为更具体的类，例如：

perfume_containers = soup.find_all('div', class_="css-12egk0t")

到

perfume_containers = soup.find_all('div', class_="css-ix8km1")

但是它给了我相同的结果，或者什么也没有回到我身边。
HTML code of item that is not showing up

HTML code of item that works

这是我的代码，我只展示提取品牌的部分，因为展示整个内容太长了。请发送一些帮助！谢谢！！

import pandas as pd
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.sephora.com/shop/perfume')
soup = BeautifulSoup(source.content, 'html.parser')
perfume_containers = soup.find_all('div', class_="css-12egk0t")
brands = []
for container in perfume_containers:
# The brand
    brand = container.find('span', class_='css-ktoumz')
    try:
        brands.append(brand.text)
    except:
        continue

Answer 1

这里的问题是，如从html属性data-lload="comp"中注意到的，页面加载后会加载一些类数据，而您成功抓取的12个项目具有属性data-lload="false"。 BeautifulSoup解析html正是您如何查看网页的源代码，并且您可以从源代码中看到仅加载了12个项目，因此其余项目可能是以其他方式（也许使用了ajax或其他方式）进行加载，但是在这种情况下情况下，我发现项目实际上是在源底部的script标签上以json的形式在源上交付的，因此您实际上不再需要抓取数据了，您可以按如下所示直接访问json：

import pandas as pd
from bs4 import BeautifulSoup
import requests
import json

source = requests.get('https://www.sephora.com/shop/perfume')
soup = BeautifulSoup(source.content, 'html.parser')

scriptContent = soup.find(id="linkJSON").text

catalog = json.loads(scriptContent)

products = catalog[3]['props']['products']

extracted = []
for p in products:
  extracted.append({'brand': p['brandName'], 'displayName': p['displayName'], 'price': p['currentSku']['listPrice']})

print(extracted)

使用美丽的汤找到目标“物品”

1 个答案: