Question

我正在尝试遍历网站上的书籍，在进入下一页之前应该只会得到 20 个结果。

我查看一个元素以获取总页数 (num_pages)，这给了我可以迭代的最大页数。

我在代码中遇到的问题是嵌套循环（定位锚节点）不仅仅从单个页面提供 20 个 url，而是在相同的循环上循环。

我不是 100% 认为嵌套循环出错了，所以任何指针都会非常有帮助。

options = webdriver.FirefoxOptions()
options.add_argument("--headless")

driver = webdriver.Firefox(executable_path=GeckoDriverManager().install(), options=options)
#driver = webdriver.Chrome(executable_path=chromedriver, options=options)
print("Browsing to Wordery")
driver.get('https://wordery.com/search?viewBy=grid&resultsPerPage=20&page=1&leadTime[]=any&interestAge[]=Babies')
#print((driver.page_source).encode('utf-8'))
driver.implicitly_wait(3)

#Get total pages
num_page = driver.find_element_by_xpath('//span[@class="js-pnav-max"]')


#iterate through pages grabbing links
for i in range(int(num_page.text)):
    
    #locate anchor nodes
    lists = driver.find_elements_by_xpath("//a[@class='"'c-book__title'"']")
    links = []
    for lis in lists:
        
        # Fetch and store the links
        links.append(lis.get_attribute('href'))
        with open('search_results_urls.txt', 'a') as filehandle:
            filehandle.write('%s\n' % lis.get_attribute('href'))
            print(lis.get_attribute('href'))
    
    page_ = i + 1
    click_next = driver.find_element_by_xpath('//a[@class="o-layout__item o-link--arrow js-pnav-next u-utils-pnav__next"]').click()

driver.quit()

奇怪的是它会遍历第一页 33 次（只有 20 个项目，所以它重复了它们）然后产生以下错误：

selenium.common.exceptions.StaleElementReferenceException: Message: The element reference of <a class="c-book__title" href="/peppa-pig-practise-with-peppa-wipe-clean-first-letters-peppa-pig-9780723292081"> is stale; either the element is no longer attached to the DOM, it is not in the current frame context, or the document has been refreshed

我已经通过删除页面导航循环测试了以下内容，并且它按预期工作。

lists = driver.find_elements_by_xpath("//a[@class='c-book__title']")
links = [link.get_attribute('href') for link in lists]

with open('search_results_urls.txt', 'a') as filehandle:
    for link in links:
        filehandle.write(link)
        print(link)

一旦我将它添加到页面循环中，它就会循环遍历同一页 url。

这是我的最新代码，其中包含来自以下答案的输入。我仍在为它多次循环第一页而苦苦挣扎。

#Get total pages
num_page = driver.find_element_by_xpath('//span[@class="js-pnav-max"]')

for i in range(int(num_page.text)):
    driver.implicitly_wait(10)
    lists = driver.find_elements_by_xpath("//a[@class='c-book__title']")
    links = [link.get_attribute('href') for link in lists]
    
    with open('search_results_urls.txt', 'a') as filehandle:
        for link in links:
            filehandle.write(link + "\n")
            print(link + "\n")
    
    click_next = driver.find_element_by_xpath('//a[@class="o-layout__item o-link--arrow js-pnav-next u-utils-pnav__next"]').click()

Answer 1

您获得相同链接的原因是您已将其分配到循环之外。当页面刷新时，您仍然可以在其中保留较早的链接。

放置在 for 循环中。使用 WebDriverWait() 并等待 presence_of_all_elements_located()，这样当点击到下一页时，这将在页面刷新时同步。

num_page = driver.find_element_by_xpath('//span[@class="js-pnav-max"]')
for i in range(int(num_page.text)):
    lists =WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.XPATH, "//a[@class='c-book__title']"))) 
    links = [link.get_attribute('href') for link in lists]
    with open('search_results_urls.txt', 'a') as filehandle:
        for link in links:
            filehandle.write(link)
            print(link)

    click_next = driver.find_element_by_xpath('//a[@class="o-layout__item o-link--arrow js-pnav-next u-utils-pnav__next"]').click()
    #provide some delay to refreshed the page.
     time.sleep(2)

您需要导入以下库。

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

嵌套循环的错误循环

1 个答案: