我是Web爬网的新手,但是对请求(BeautifulSoup和Selenium)有足够的命令,可以从网站提取数据。现在的问题是,当我单击下一页的页码时,我试图从 URL不变的网站上抓取数据。
网站URL ==> https://www.ellsworth.com/products/adhesives/
我也尝试使用Google Developer工具,但无法成功。如果有人用代码指导我,将不胜感激。 Google Developer show Get Request
这是我的代码
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import pandas as pd
import requests
itemproducts = pd.DataFrame()
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get('https://www.ellsworth.com/products/adhesives/')
base_url = 'https://www.ellsworth.com'
html= driver.page_source
s = BeautifulSoup(html,'html.parser')
data = []
href_link = s.find_all('div',{'class':'results-products-item-image'})
for links in href_link:
href_link_a = links.find('a')['href']
data.append(base_url+href_link_a)
# url = 'https://www.ellsworth.com/products/adhesives/silicone/dow-838-silicone-adhesive-sealant-white-90-ml-tube/'
for c in data:
driver.get(c)
html_pro = driver.page_source
soup = BeautifulSoup(html_pro,'html.parser')
title = soup.find('span',{'itemprop':'name'}).text.strip()
part_num = soup.find('span',{'itemprop':'sku'}).text.strip()
manfacture = soup.find('span',{'class':'manuSku'}).text.strip()
manfacture_ = manfacture.replace('Manufacturer SKU:', '').strip()
pro_det = soup.find('div',{'class':'product-details'})
p = pro_det.find_all('p')
try:
d = p[1].text.strip()
c = p.text.strip()
except:
pass
table = pro_det.find('table',{'class':'table'})
tr = table.find_all('td')
typical = tr[1].text.strip()
brand = tr[3].text.strip()
color = tr[5].text.strip()
image = soup.find('img',{'itemprop':'image'})['src']
image_ = base_url + image
png_url = title +('.jpg')
img_data = requests.get(image_).content
with open(png_url,'wb') as fh:
fh.write(img_data)
itemproducts=itemproducts.append({'Product Title':title,
'Part Number':part_num,
'SKU':manfacture_,
'Description d':d,
'Description c':c,
'Typical':typical,
'Brand':brand,
'Color':color,
'Image URL':image_},ignore_index=True)
答案 0 :(得分:2)
页面的内容是动态呈现的,但是如果您在开发人员工具中的“网络”下检查XHR选项卡,则可以获取API请求网址。我稍微缩短了URL,但仍然可以正常使用。
在这里,您可以从第1页获取前10种产品的列表:
import requests
start = 0
n_items = 10
api_request_url = f"https://www.ellsworth.com/api/catalogSearch/search?sEcho=1&iDisplayStart={start}&iDisplayLength={n_items}&DefaultCatalogNode=Adhesives&_=1497895052601"
data = requests.get(api_request_url).json()
print(f"Found: {data['iTotalRecords']} items.")
for item in data["aaData"]:
print(item)
这为您提供了一个不错的JSON响应,其中包含每个产品的所有数据,应该可以让您入门。
['Sauereisen Insa-Lute Adhesive Cement No. P-1 Powder Off-White 1 qt Can', 'P-1-INSA-LUTE-ADHESIVE', 'P-1 INSA-LUTE ADHESIVE', '$72.82', '/products/adhesives/ceramic/sauereisen-insa-lute-adhesive-cement-no.-p-1-powder-off-white-1-qt-can/', '/globalassets/catalogs/sauereisen-insa-lute-cement-no-p-1-off-white-1qt_170x170.jpg', 'Adhesives-Ceramic', '[{"qty":"1-2","price":"$72.82","customerPrice":"$72.82","eachPrice":"","custEachPrice":"","priceAmount":"72.820000000","customerPriceAmount":"72.820000000","currency":"USD"},{"qty":"3-15","price":"$67.62","customerPrice":"$67.62","eachPrice":"","custEachPrice":"","priceAmount":"67.620000000","customerPriceAmount":"67.620000000","currency":"USD"},{"qty":"16+","price":"$63.36","customerPrice":"$63.36","eachPrice":"","custEachPrice":"","priceAmount":"63.360000000","customerPriceAmount":"63.360000000","currency":"USD"}]', '', '', '', 'P1-Q', '1000', 'true', 'Presentation of packaged goods may vary. For special packaging requirements, please call (877) 454-9224', '', '', '']
如果要获取下10个项目,则必须将iDisplayStart
的值修改为10
。而且,如果您希望每个请求有更多项目,只需将iDisplayLength
更改为20
。
在演示中,我将这些值替换为start
和n_items
,但是您可以轻松地自动执行此操作,因为找到的所有项目的数量都随响应一起出现,例如iTotalRecords
。