用于Java API的Selenium WebDriver:findElement产生不同的结果

时间:2019-06-22 13:21:27

标签: java selenium selenium-webdriver selenium-chromedriver

我正在使用Java的Selenium Webdriver来爬网此页面:

https://www.immowelt.at/liste/wien/wohnungen/mieten?sort=relevanz

在我的代码中,方法

WebElement.findElement(...)

产生不同的结果,如下:

1。)我的源代码:

package at.home.digest.services;

import java.util.ArrayList;
import java.util.List;


import org.apache.commons.lang3.StringUtils;
import org.openqa.selenium.By;
import org.openqa.selenium.JavascriptExecutor;
import org.openqa.selenium.Keys;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.support.ui.ExpectedConditions;
import org.openqa.selenium.support.ui.WebDriverWait;

import at.home.digest.model.HomeToDeal;

public class ImmoweltBot {

    public static final String URL = "https://www.immowelt.at/";
    public static final String queryURL = URL + "/liste/wien/wohnungen/mieten?sort=relevanz";


    public static void main (String [] args) throws Exception {

        System.setProperty("webdriver.chrome.driver", "C:\\Temp\\chromedriver.exe");

        String URLPage = StringUtils.EMPTY;
        int page = 1;
        int totalNumberOfEntities = 6000;
        int numberOfEntitiesFound = 0;

        List<WebElement> elemnts = new ArrayList<>();

        WebDriver webDriver = new ChromeDriver();

        outer:
        while (numberOfEntitiesFound < totalNumberOfEntities){

        webDriver.get(queryURL + URLPage);


        WebDriverWait wait = new WebDriverWait(webDriver, 5);
        By searchResults = By.xpath("//*[contains(@class, 'clear relative js-listitem')]");

        JavascriptExecutor js = (JavascriptExecutor)webDriver;
        webDriver.manage().window().maximize();
        js.executeScript("window.scrollBy(0,1000)");

        final int totalNumberOfKeyDowns = 190;
        int keyDownTries = 0;
        while ((++keyDownTries < totalNumberOfKeyDowns)) {
            elemnts = wait.until(ExpectedConditions.presenceOfAllElementsLocatedBy(searchResults));
            webDriver.findElement(By.tagName("body")).sendKeys(Keys.DOWN);

        }

        WebElement elem = webDriver.findElement(By.xpath("//*[contains(@class, 'ellipsis margin_none')]"));
        totalNumberOfEntities = Utils.parseNumber(elem.getText()).intValue();

        for (int i = 0; i < elemnts.size(); i++) {
            WebElement divListItemClear = elemnts.get(i);
            HomeToDeal homeToRent = new HomeToDeal();
            String exposeURL = divListItemClear.findElement(By.tagName("a")).getAttribute("href");
            homeToRent.setURL(exposeURL);

            WebElement listContentClear = divListItemClear.findElement(By.xpath("//*[contains(@class, 'listcontent clear')]"));
            WebElement h2Elem = listContentClear.findElement(By.tagName("h2"));
            String text = h2Elem.getText();
            homeToRent.setDescription(text);

            System.out.println(homeToRent);
        }

        URLPage = "&cp="+ (++page);
        numberOfEntitiesFound+=elemnts.size();
     }
    }

}

我的问题是那条线

String exposeURL = divListItemClear.findElement(By.tagName("a")).getAttribute("href");

按预期方式工作,并为我提供了元素的后续URL(针对周期中的每个新信号),但是这些行

WebElement listContentClear = divListItemClear.findElement(By.xpath("//*[contains(@class, 'listcontent clear')]"));
        WebElement h2Elem = listContentClear.findElement(By.tagName("h2"));
        String text = h2Elem.getText();

给我HTML元素h2-的每个时间值和相同值,这始终是找到的第一个元素的值。

有什么主意我在做错什么吗?

谢谢!

1 个答案:

答案 0 :(得分:1)

您已经成为许多人在将XPath与Selenium结合使用时犯下的经典错误的受害者。 WebDriver实现遵循XPath规范来定位元素,这意味着with t(empid,headerid,status,time) as ( select 1,123,'Failed' , '2019-06-22 17:00:00' from dual union all select 1,321,'Pending', '2019-06-22 17:10:00' from dual union all select 1,555,'Pending', '2019-06-22 17:20:00' from dual union all select 1,111,'Failed' , '2019-06-22 17:30:00' from dual union all select 1,222,'Pending', '2019-06-22 17:40:00' from dual union all select 2,333,'Failed' , '2019-06-22 17:50:00' from dual union all select 2,444,'Pending', '2019-06-22 18:00:00' from dual union all select 3,555,'Failed' , '2019-06-22 18:10:00' from dual ), t2 as ( select sum(case when status = 'Failed' then 1 else 0 end) over (partition by empid order by time) as rn, t.* from t ), t3 as ( select t2.*, row_number() over (partition by empid,rn,status order by time) as rn2 from t2 ) select * from t3 where (rn,rn2, empid) in ( select rn,rn2, empid from t3 where empid = 1--&i_empid and headerid = 123 -- &i_headerid and status = 'Failed' ) and status = 'Pending'; 定位符始终指向文档的顶部。即使您是在//实例中使用findElement,也是如此。在引用该错误的代码中,所需的内容如下:

WebElement

请注意定位器开头的WebElement listContentClear = divListItemClear.findElement(By.xpath(".//*[contains(@class, 'listcontent clear')]")); WebElement h2Elem = listContentClear.findElement(By.tagName("h2")); String text = h2Elem.getText(); ,指示当前节点为上下文节点。由于您主要是根据元素的CSS类中的值查找元素,因此在这种情况下,使用CSS选择器代替XPath可以避免此问题。

顺便说一句,我认为这些定位器有些脆弱,因为class属性不能保证类值的顺序。换句话说,就浏览器而言,.在语义上等效于<div class="listcontent clear">。如果浏览器将元素渲染为后一个元素而不是前一个元素,则CSS选择器<div class="clear listcontent">将同时找到这两个渲染,而您使用的XPath则找不到。