Question

我是Web爬网的新手，但是对请求（BeautifulSoup和Selenium）有足够的命令，可以从网站提取数据。现在的问题是，当我单击下一页的页码时，我试图从 URL不变的网站上抓取数据。

网站URL ==> https://www.ellsworth.com/products/adhesives/

我也尝试使用Google Developer工具，但无法成功。如果有人用代码指导我，将不胜感激。 Google Developer show Get Request

这是我的代码

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import pandas as pd
import requests
itemproducts = pd.DataFrame()
driver = webdriver.Chrome(ChromeDriverManager().install())

driver.get('https://www.ellsworth.com/products/adhesives/')
base_url = 'https://www.ellsworth.com'
html= driver.page_source
s = BeautifulSoup(html,'html.parser')
data = []

href_link = s.find_all('div',{'class':'results-products-item-image'})
for links in href_link:
    href_link_a = links.find('a')['href']
    data.append(base_url+href_link_a)
# url = 'https://www.ellsworth.com/products/adhesives/silicone/dow-838-silicone-adhesive-sealant-white-90-ml-tube/'

for c in data:
    driver.get(c)
    html_pro = driver.page_source
    soup = BeautifulSoup(html_pro,'html.parser')
    title = soup.find('span',{'itemprop':'name'}).text.strip()
    part_num = soup.find('span',{'itemprop':'sku'}).text.strip()
    manfacture = soup.find('span',{'class':'manuSku'}).text.strip()
    manfacture_ = manfacture.replace('Manufacturer SKU:', '').strip()
    pro_det = soup.find('div',{'class':'product-details'})
    p = pro_det.find_all('p')
    try:
        d = p[1].text.strip()    
        c = p.text.strip()
    except:
        pass
    table = pro_det.find('table',{'class':'table'})
    tr = table.find_all('td')
    typical = tr[1].text.strip()
    brand = tr[3].text.strip()
    color = tr[5].text.strip()
    image = soup.find('img',{'itemprop':'image'})['src']
    image_ = base_url + image
    png_url = title +('.jpg')
    img_data = requests.get(image_).content
    with open(png_url,'wb') as fh:
        fh.write(img_data)

    itemproducts=itemproducts.append({'Product Title':title,
                                     'Part Number':part_num,
                                     'SKU':manfacture_,
                                     'Description d':d,
                                     'Description c':c,
                                     'Typical':typical,
                                     'Brand':brand,
                                     'Color':color,
                                     'Image URL':image_},ignore_index=True)

Answer 1

页面的内容是动态呈现的，但是如果您在开发人员工具中的“网络”下检查XHR选项卡，则可以获取API请求网址。我稍微缩短了URL，但仍然可以正常使用。

在这里，您可以从第1页获取前10种产品的列表：

import requests

start = 0
n_items = 10

api_request_url = f"https://www.ellsworth.com/api/catalogSearch/search?sEcho=1&iDisplayStart={start}&iDisplayLength={n_items}&DefaultCatalogNode=Adhesives&_=1497895052601"

data = requests.get(api_request_url).json()

print(f"Found: {data['iTotalRecords']} items.")

for item in data["aaData"]:
    print(item)

这为您提供了一个不错的JSON响应，其中包含每个产品的所有数据，应该可以让您入门。

['Sauereisen Insa-Lute Adhesive Cement No. P-1 Powder Off-White 1 qt Can', 'P-1-INSA-LUTE-ADHESIVE', 'P-1 INSA-LUTE ADHESIVE', '$72.82', '/products/adhesives/ceramic/sauereisen-insa-lute-adhesive-cement-no.-p-1-powder-off-white-1-qt-can/', '/globalassets/catalogs/sauereisen-insa-lute-cement-no-p-1-off-white-1qt_170x170.jpg', 'Adhesives-Ceramic', '[{"qty":"1-2","price":"$72.82","customerPrice":"$72.82","eachPrice":"","custEachPrice":"","priceAmount":"72.820000000","customerPriceAmount":"72.820000000","currency":"USD"},{"qty":"3-15","price":"$67.62","customerPrice":"$67.62","eachPrice":"","custEachPrice":"","priceAmount":"67.620000000","customerPriceAmount":"67.620000000","currency":"USD"},{"qty":"16+","price":"$63.36","customerPrice":"$63.36","eachPrice":"","custEachPrice":"","priceAmount":"63.360000000","customerPriceAmount":"63.360000000","currency":"USD"}]', '', '', '', 'P1-Q', '1000', 'true', 'Presentation of packaged goods may vary. For special packaging requirements, please call (877) 454-9224', '', '', '']

如果要获取下10个项目，则必须将iDisplayStart的值修改为10。而且，如果您希望每个请求有更多项目，只需将iDisplayLength更改为20。

在演示中，我将这些值替换为start和n_items，但是您可以轻松地自动执行此操作，因为找到的所有项目的数量都随响应一起出现，例如iTotalRecords。

从网站上抓取网址不变的数据

1 个答案: