当我导航到以下链接并在页面底部找到分页时: https://shop.nordstrom.com/c/sale-mens-clothing?origin=topnav&breadcrumb=Home%2FSale%2FMen%2FClothing&sort=Boosted
我只能抓取前4个左右的页面,然后脚本停止运行
我尝试使用xpath,css_selector和WebDriverWait选项
pages_remaining = True
page = 2 //starts @ page 2 since page one is scraped already with first loop
while pages_remaining:
//scrape code
try:
wait = WebDriverWait(browser, 20)
wait.until(EC.element_to_be_clickable((By.LINK_TEXT, str(page)))).click()
print browser.current_url
page += 1
except TimeoutException:
pages_remaining = False
控制台的当前结果:
https://shop.nordstrom.com/c/sale-mens-designer-clothing-accessories- shoes?breadcrumb=Home%2FSale%2FMen%2FDesigner&page=2&sort=Boosted
https://shop.nordstrom.com/c/sale-mens-designer-clothing-accessories-shoes?breadcrumb=Home%2FSale%2FMen%2FDesigner&page=3&sort=Boosted
https://shop.nordstrom.com/c/sale-mens-designer-clothing-accessories-shoes?breadcrumb=Home%2FSale%2FMen%2FDesigner&page=4&sort=Boosted
答案 0 :(得分:1)
此解决方案是BeautifulSoup的一种,因为我对Selenium不太熟悉。
尝试使用您的页数创建一个新变量。如您所见,当您进入下一页时,URL会更改,因此只需操作给定的URL。请参阅下面的代码示例。
# Define variable pages first
pages = [str(i) for i in range(1,53)] # 53 'cuz you have 52 pages
for page in pages:
response = get("https://shop.nordstrom.com/c/sale-mens-clothing?origin=topnav&breadcrumb=Home%2FSale%2FMen%2FClothing&page=" + page + "&sort=Boosted"
# Rest of you code
此代码段应在其余页面上完成工作。希望能有所帮助,尽管这可能与您一直在寻找的不一样。
如果您有任何疑问,请在下面发布。 ;)。
干杯。
答案 1 :(得分:1)
您可以循环浏览页码,直到仅通过更改URL不再显示结果:
from bs4 import BeautifulSoup
from selenium import webdriver
base_url = "https://m.shop.nordstrom.com/c/sale-mens-clothing?origin=topnav&breadcrumb=Home%2FSale%2FMen%2FClothing&page={}&sort=Boosted"
driver = webdriver.Chrome()
page = 1
soup = BeautifulSoup("")
#Will loop untill there's no more results
while "Looks like we don’t have exactly what you’re looking for." not in soup.text:
print(base_url.format(page))
#Go to page
driver.get(base_url.format(page))
soup = BeautifulSoup(driver.page_source)
### your extracting code
page +=1