使用Selenium对网站进行爬网时的StaleElementReferenceException

时间:2017-11-08 15:05:11

标签: java selenium selenium-webdriver web-crawler

我正在使用Text { id: someId //% "%1 my translatable suffix!" text: qsTrId("my_translatable_id_with_argument") Component.onCompleted: { text = text.arg("This is") } } 抓取包含大量内部链接的网页。 我有以下代码:

Selenium

此代码可以使用大约一个小时,然后我得到一个 import org.openqa.selenium.*; import org.openqa.selenium.firefox.FirefoxDriver; import java.util.ArrayList; import java.util.List; public class WebScrapper { //list to save visited links static List<String> linkAlreadyVisited = new ArrayList<String>(); static String userName = "user"; static String password = "passw"; static String mainPage = "https://web/"; WebDriver driver; // public WebDriver driver = new FirefoxDriver(); String loginPage = "https://web/Login"; public WebScrapper(WebDriver driver) { this.driver = driver; } public static void main(String[] args) throws InterruptedException { System.setProperty("webdriver.gecko.driver", "E:\\geckodriver.exe"); WebDriver driver = new FirefoxDriver(); WebScrapper webSrcapper = new WebScrapper(driver); webSrcapper.openTestSite(); webSrcapper.login(userName, password); driver.navigate().to(mainPage); driver.get(mainPage); // start recursive linkText new WebScrapper(driver).linkTest(); } public static boolean isElementStale(WebElement e) { try { e.isDisplayed(); return false; } catch (StaleElementReferenceException ex) { return true; } } public void linkTest() { // loop over all the a elements in the page for (WebElement link : driver.findElements(By.tagName("a"))) { // Check if link is displayed and not previously visited if (!isElementStale(link) && !linkAlreadyVisited.contains(link.getText())) { // add link to list of links already visited linkAlreadyVisited.add(link.getText()); System.out.println(link.getText()); try { Thread.sleep(200); } catch (InterruptedException e) { e.printStackTrace(); } try { link.click(); } catch (Exception ex) { // String id = link.getAttribute("id"); ((JavascriptExecutor) driver).executeScript("$('#id').click();"); } // call recursiveLinkTest on the new page new WebScrapper(driver).linkTest(); } else { continue; } } driver.navigate().back(); } /** * // * Open the test website. // */ public void openTestSite() { driver.navigate().to(loginPage); } public void login(String username, String Password) { WebElement userName_editbox = driver.findElement(By.id("IDToken1")); WebElement password_editbox = driver.findElement(By.id("IDToken2")); WebElement submit_button = driver.findElement(By.name("Login.Submit")); userName_editbox.sendKeys(username); password_editbox.sendKeys(Password); submit_button.click(); } } 。由于网页上有很多链接,我可以简单地忽略该链接,跟随每个链接对我来说并不重要。

所以,我尝试使用StaleElementReferenceException else clause命令来逃避此异常。但它不起作用。我的问题是为什么?我只是想进入下一个链接。

与此同时,由于代码需要很长时间才能运行,我不知道它到底发生了什么(同样,我也看到有时代码运行方式不同,这就是链接的顺序并不总是同样,所以我不能轻易调试)我无法检查具体的链接。

我尝试了在网站上找到的不同解决方案。例如,continue;在点击链接之前和之后Thread.sleep

有没有人能解决这个问题呢?

2 个答案:

答案 0 :(得分:0)

这可行。请注意,我在浏览器中编程,而不是实际的IDE,所以请原谅我所做的任何拼写错误。

我改变了什么:我们只使用类WebScrapper的单个实例。外出时会更容易。我们不会先搜索深度,而是首先使用Queue<String>进行呼吸。每次我们找到新链接时,其网址都会添加到Queue<String> links。之后,我们通过链接链接并抓取它。

您可能需要进行一些调整,因为我们不再调用link.click(),因此您可能需要在添加链接之前修复该链接(例如,如果缺少https://stackoverflow.com)< / p>

import org.openqa.selenium.*;
import org.openqa.selenium.firefox.FirefoxDriver;
import java.util.ArrayList;
import java.util.List;

public class WebScrapper {
    public static final long WEBSITE_LOAD_TIME = 1000;
    //list to save visited links
    static List<String> linkAlreadyVisited = new ArrayList<String>();
    private String userName;
    private String password;
    private String mainPage;
    private WebDriver driver;
    public static final String loginPage = "https://web/Login";
    public Queue<String> links = new LinkedList<>();
    public Set<String> visitedLinks = new LinkedHashSet(); // linked so we can later on determine which links where visited in which order

    public WebScrapper(WebDriver driver, String page, String userName, String password) {
        this.driver = driver;
        this.userName = userName;
        this.password = password;
    }

    public static void main(String[] args) throws InterruptedException {

        System.setProperty("webdriver.gecko.driver", "E:\\geckodriver.exe");

        WebDriver driver = new FirefoxDriver();

        WebScrapper webSrcapper = new WebScrapper(driver, "user", "passw");
        webSrcapper.login(userName, password);
        webSrcapper.start("https://web/");
    }

    public void start(String page) {
        links.add(page);
        while (!links.isEmpty()) {
            crawlPage(links.poll());
        }
    }

    public void crawlPage(String address) {
        System.out.println("visiting page \"" + linkAddress+"\"");  
        driver.navigate().to(address);
            try {
                Thread.sleep(WEBSITE_LOAD_TIME);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        visitedLinks.add(linkAddress);
        for (WebElement link : driver.findElements(By.tagName("a"))) {
            try {
                String linkAddress = link.getAttribute("href");
                if (!visitedLinks.contains(linkAddress)) {
                    System.out.println("found link \"" + linkAddress+"\"");
                    links.add(linkAddress);
                } else {
                    continue;
                }
            } catch(StaleElementReferenceException e) {
                System.out.println("link became stale and is therefore ignored."); 
            }
        }
    }

    public void login(String username, String Password) {
        driver.navigate().to(loginPage);
        WebElement userName_editbox = driver.findElement(By.id("IDToken1"));
        WebElement password_editbox = driver.findElement(By.id("IDToken2"));
        WebElement submit_button = driver.findElement(By.name("Login.Submit"));

        userName_editbox.sendKeys(username);
        password_editbox.sendKeys(Password);
        submit_button.click();
    }
}

如果我犯了任何拼写错误或逻辑错误,请随时编辑我的答案。我目前无法测试我的解决方案。

答案 1 :(得分:0)

我可以为您提供更好的解决方案,以便在大多数时间内找到新的webElement。我在driver.findElement()上创建了一个包装函数。

 public WebElement findFreshElement(By locator){ // To handle stale Element reference exception
      WebElement webElement = null;
      int attempts =0;
      while(attempts < 10){
      try {
          wait.hardWait(2);
          webElement = driver.findElement(locator);
          webElement.isDisplayed();
          break;
      } catch (StaleElementReferenceException e) {
          logMessage("⚠ Stale Element Reference Exception ... Refinding element after 2 seconds.. ");
          attempts+=1;
      }catch(NoSuchElementException e){
           logMessage("❌ [ELEMENT NOT FOUND] : You might have to update the locator:-" + locator);
        attempts+=1;     
        }
      }
      return webElement;

    } 

wait.hardWait(2)是一个Thread.Sleep包装器方法。因为使用了hardwait,它不是最好的解决方案,但它比检查元素状态更好。此方法实际上返回一个新元素引用