如何从网页抓取中获取所有页面

时间:2020-07-15 02:23:42

标签: python python-3.x list selenium-webdriver web-scraping

我正在尝试从该网站https://www.dickssportinggoods.com/f/all-mens-footwear的所有页面中获取所有鞋子的列表,但是我不知道我的代码还要写什么。 基本上,我想从网站的所有页面中选择一个品牌鞋。例如,我想选择纽巴伦(New Balance)鞋子,我想用我选择的名字branc打印所有鞋子的列表。这是我的下面的代码

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
Url = 'https://www.dickssportinggoods.com/f/all-mens-footwear'
uClient = uReq(Url)
Page = uClient.read()
uClient.close()
page_soup = soup(Page, "html.parser")
for i in page_soup.findAll("div", {"class":"rs-facet-name-container"}):
    print(i.text)

3 个答案:

答案 0 :(得分:0)

您可以单击过滤器按钮并检查所有想要的品牌。 您只需要执行driver.find element by xpath() 如果您使用硒,则必须知道这一点。

答案 1 :(得分:0)

该站点正在使用js脚本更新其元素,因此您将无法单独使用beautifulsoup,而必须使用自动化。

以下代码无法正常工作,因为该元素会在几毫秒后更新。它将首先显示所有品牌,然后将更新并显示所选品牌,因此请使用自动化。

失败的代码:

from bs4 import BeautifulSoup as soup
import time
from urllib.request import urlopen as uReq
Url = 'https://www.dickssportinggoods.com/f/all-mens-footwear'
url_st = 'https://www.dickssportinggoods.com/f/mens-athletic-shoes?pageNumber=0&filterFacets=X_BRAND'

for idx, br in enumerate(brands_name):
    if idx==0:
        url_st += '%3A'+ '%20'.join(br.split(' '))
    else: 
        url_st += '%2C' + '%20'.join(br.split(' '))

uClient = uReq(url_st)
time.sleep(4)
Page = uClient.read()
uClient.close()

page_soup = soup(Page, "html.parser") 
for match in page_soup.find_all('div', class_='rs_product_description d-block'):
    print(match.text)

代码:(硒+ bs4)

from bs4 import BeautifulSoup as soup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
from webdriver_manager.chrome import ChromeDriverManager

chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install())#, chrome_options=chrome_options)
driver.set_window_size(1024, 600)
driver.maximize_window()

brands_name = ['New Balance']

filter_facet ='filterFacets=X_BRAND'
for idx, br in enumerate(brands_name):
    if idx==0:
        filter_facet += '%3A'+ '%20'.join(br.split(' '))
    else: 
        filter_facet += '%2C' + '%20'.join(br.split(' '))

url = f"https://www.dickssportinggoods.com/f/mens-athletic-shoes?pageNumber=0&{filter_facet}"        
driver.get(url)
time.sleep(4)
page_soup = soup(driver.page_source, 'html.parser')  
elem = driver.find_element_by_class_name('close')
if elem:
    elem.click()
for match in page_soup.find_all('div', class_='rs_product_description d-block'):
    print(match.text)
    
page_num = page_soup.find_all('a', class_='rs-page-item')
pnum = [int(pn.text) for pn in page_num if pn.text!='']
if len(pnum)>=2:
    for pn in range(1, len(pnum)):
        url = f"https://www.dickssportinggoods.com/f/mens-athletic-shoes?pageNumber={pn}&{filter_facet}"
        driver.get(url)
        time.sleep(2)
        page_soup = soup(driver.page_source, "html.parser") 
        for match in page_soup.find_all('div', class_='rs_product_description d-block'):
            print(match.text)

New Balance Men's 410v6 Trail Running Shoes
New Balance Men's 623v3 Training Shoes
.
.
.
New Balance Men's Fresh Foam Beacon Running Shoes
New Balance Men's Fresh Foam Cruz v2 SockFit Running Shoes
New Balance Men's 470 Running Shoes
New Balance Men's 996v3 Tennis Shoes
New Balance Men's 1260 V7 Running Shoes
New Balance Men's Fresh Foam Beacon Running Shoes

我已注释掉无标题的chrome,因为当您打开它时,关闭后会出现一个对话框按钮,您可以获取产品详细信息。在无浏览器自动化中,您将无法做到(无法回答。硒概念不太好)

不要忘记安装:webdriver_manager 使用pip install webdriver_manager

答案 2 :(得分:0)

页面正在使用Java脚本创建您想要的链接,您无法抓取,您需要复制页面请求,在这种情况下,页面正在发送发布请求:

Request URL: https://prod-catalog-product-api.dickssportinggoods.com/v1/search
Request Method: POST
Status Code: 200 OK
Remote Address: [2600:1400:d:696::25db]:443
Referrer Policy: no-referrer-when-downgrade

使用浏览器中的检查元素工具检查请求标头,以模拟发布请求

这是发送帖子请求的网址:

https://prod-catalog-product-api.dickssportinggoods.com/v1/search

这是浏览器正在发送的帖子信息

{selectedCategory: "12301_1714863", selectedStore: "1406", selectedSort: 1,…}
isFamilyPage: true
pageNumber: 0
pageSize: 48
searchTypes: []
selectedCategory: "12301_1714863"
selectedFilters: {X_BRAND: ["New Balance"]}   #<--- this is the info that you want to get
selectedSort: 1
selectedStore: "1406"
storeId: 15108
totalCount: 3360

该页面可能还需要标题,因此请确保模拟浏览器发送的请求。