Web用硒抓取下一页

时间:2019-06-18 16:54:59

标签: python selenium

当我导航到以下链接并在页面底部找到分页时: https://shop.nordstrom.com/c/sale-mens-clothing?origin=topnav&breadcrumb=Home%2FSale%2FMen%2FClothing&sort=Boosted

我只能抓取前4个左右的页面,然后脚本停止运行

我尝试使用xpath,css_selector和WebDriverWait选项

 pages_remaining = True
 page = 2   //starts @ page 2 since page one is scraped already with first loop



 while pages_remaining:

      //scrape code

      try:
           wait = WebDriverWait(browser, 20)
           wait.until(EC.element_to_be_clickable((By.LINK_TEXT, str(page)))).click()

           print browser.current_url
           page += 1

     except TimeoutException:
           pages_remaining = False

控制台的当前结果:

 https://shop.nordstrom.com/c/sale-mens-designer-clothing-accessories-  shoes?breadcrumb=Home%2FSale%2FMen%2FDesigner&page=2&sort=Boosted

 https://shop.nordstrom.com/c/sale-mens-designer-clothing-accessories-shoes?breadcrumb=Home%2FSale%2FMen%2FDesigner&page=3&sort=Boosted

 https://shop.nordstrom.com/c/sale-mens-designer-clothing-accessories-shoes?breadcrumb=Home%2FSale%2FMen%2FDesigner&page=4&sort=Boosted

2 个答案:

答案 0 :(得分:1)

此解决方案是BeautifulSoup的一种,因为我对Selenium不太熟悉。

尝试使用您的页数创建一个新变量。如您所见,当您进入下一页时,URL会更改,因此只需操作给定的URL。请参阅下面的代码示例。

# Define variable pages first
pages = [str(i) for i in range(1,53)] # 53 'cuz you have 52 pages

for page in pages:
    response = get("https://shop.nordstrom.com/c/sale-mens-clothing?origin=topnav&breadcrumb=Home%2FSale%2FMen%2FClothing&page=" + page + "&sort=Boosted"
# Rest of you code

此代码段应在其余页面上完成工作。希望能有所帮助,尽管这可能与您一直在寻找的不一样。

如果您有任何疑问,请在下面发布。 ;)。

干杯。

答案 1 :(得分:1)

您可以循环浏览页码,直到仅通过更改URL不再显示结果:

from bs4 import BeautifulSoup
from selenium import webdriver

base_url = "https://m.shop.nordstrom.com/c/sale-mens-clothing?origin=topnav&breadcrumb=Home%2FSale%2FMen%2FClothing&page={}&sort=Boosted"

driver = webdriver.Chrome()

page = 1
soup = BeautifulSoup("")

#Will loop untill there's no more results
while "Looks like we don’t have exactly what you’re looking for." not in soup.text:
    print(base_url.format(page))
    #Go to page
    driver.get(base_url.format(page))
    soup = BeautifulSoup(driver.page_source)

    ### your extracting code

    page +=1