Selenium / BeautifulSoup-Python-遍历多页

时间:2018-12-28 23:14:15

标签: python selenium selenium-webdriver web-scraping beautifulsoup

我一天的大部分时间都在研究和测试在零售商网站上循环浏览一组产品的最佳方法。

虽然我能够成功地在第一页上收集产品集(和属性),但我仍然想出最好的方法来循环浏览网站的页面以继续进行抓取。

根据下面的代码,我尝试使用“ while”循环和Selenium单击网站的“下一页”按钮,然后继续收集产品。

问题是我的代码仍然无法通过第1页。

我在这里犯了一个愚蠢的错误吗?在该站点上阅读了4或5个类似的示例,但是没有一个示例足够具体,无法在此处提供解决方案。

from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get('https://www.kohls.com/catalog/mens-button-down-shirts-tops-clothing.jsp?CN=Gender:Mens+Silhouette:Button-Down%20Shirts+Category:Tops+Department:Clothing&cc=mens-TN3.0-S-buttondownshirts&kls_sbp=43160314801019132980443403449632772558&PPP=120&WS=0')

products.clear()
hyperlinks.clear()
reviewCounts.clear()
starRatings.clear()

products = []
hyperlinks = []
reviewCounts = []
starRatings = []

pageCounter = 0
maxPageCount = int(html_soup.find('a', class_ = 'totalPageNum').text)+1


html_soup = BeautifulSoup(driver.page_source, 'html.parser')
prod_containers = html_soup.find_all('li', class_ = 'products_grid')


while (pageCounter < maxPageCount):
    for product in prod_containers:
        # If the product has review count, then extract:
        if product.find('span', class_ = 'prod_ratingCount') is not None:
            # The product name
            name = product.find('div', class_ = 'prod_nameBlock')
            name = re.sub(r"\s+", " ", name.text)
            products.append(name)

            # The product hyperlink
            hyperlink = product.find('span', class_ = 'prod_ratingCount')
            hyperlink = hyperlink.a
            hyperlink = hyperlink.get('href')
            hyperlinks.append(hyperlink)

            # The product review count
            reviewCount = product.find('span', class_ = 'prod_ratingCount').a.text
            reviewCounts.append(reviewCount)

            # The product overall star ratings
            starRating = product.find('span', class_ = 'prod_ratingCount')
            starRating = starRating.a
            starRating = starRating.get('alt')
            starRatings.append(starRating) 

    driver.find_element_by_xpath('//*[@id="page-navigation-top"]/a[2]').click()
    counterProduct +=1
    print(counterProduct)

2 个答案:

答案 0 :(得分:1)

每次在下一页上单击时都需要解析。因此,您希望将其包含在while循环中,否则,即使prod_containers对象永远不变,您也将继续在第一页上进行迭代,即使它单击到下一页也是如此。

第二,您拥有它的方式,while循环将永远不会停止,因为您将pageCounter设置为0,但永远不会对其进行递增...它将永远是您的maxPageCount。

我在代码中修复了这2件事并运行了它,它似乎已经工作并解析了第1页到第5页。

from selenium import webdriver
from bs4 import BeautifulSoup
import re

driver = webdriver.Chrome()
driver.get('https://www.kohls.com/catalog/mens-button-down-shirts-tops-clothing.jsp?CN=Gender:Mens+Silhouette:Button-Down%20Shirts+Category:Tops+Department:Clothing&cc=mens-TN3.0-S-buttondownshirts&kls_sbp=43160314801019132980443403449632772558&PPP=120&WS=0')

products = []
hyperlinks = []
reviewCounts = []
starRatings = []

pageCounter = 0

html_soup = BeautifulSoup(driver.page_source, 'html.parser')
maxPageCount = int(html_soup.find('a', class_ = 'totalPageNum').text)+1

prod_containers = html_soup.find_all('li', class_ = 'products_grid')


while (pageCounter < maxPageCount):
    html_soup = BeautifulSoup(driver.page_source, 'html.parser')
    prod_containers = html_soup.find_all('li', class_ = 'products_grid')
    for product in prod_containers:
        # If the product has review count, then extract:
        if product.find('span', class_ = 'prod_ratingCount') is not None:
            # The product name
            name = product.find('div', class_ = 'prod_nameBlock')
            name = re.sub(r"\s+", " ", name.text)
            name = name.strip()
            products.append(name)

            # The product hyperlink
            hyperlink = product.find('span', class_ = 'prod_ratingCount')
            hyperlink = hyperlink.a
            hyperlink = hyperlink.get('href')
            hyperlinks.append(hyperlink)

            # The product review count
            reviewCount = product.find('span', class_ = 'prod_ratingCount').a.text
            reviewCounts.append(reviewCount)

            # The product overall star ratings
            starRating = product.find('span', class_ = 'prod_ratingCount')
            starRating = starRating.a
            starRating = starRating.get('alt')
            starRatings.append(starRating) 

    driver.find_element_by_xpath('//*[@id="page-navigation-top"]/a[2]').click()
    pageCounter +=1
    print(pageCounter)

答案 1 :(得分:0)

好吧,当从Dam文件单独运行时,此代码段将无法运行,我猜您是在iPython或类似环境中运行了这些代码,并且已经初始化了这些变量并导入了库。

首先,您需要包括regex软件包:

Sire

此外,所有这些.py语句都是不必要的,因为无论如何都初始化了所有这些列表(实际上python仍然会引发错误,因为当您对它们调用clear时尚未定义这些列表)

还需要初始化import re

clear()

最后,您必须在代码中引用counterProduct之前为其设置值:

counterProduct = 0

这是正确的代码,可以正常工作:

html_soup