我正在使用Java的Selenium Webdriver来爬网此页面:
https://www.immowelt.at/liste/wien/wohnungen/mieten?sort=relevanz
在我的代码中,方法
WebElement.findElement(...)
产生不同的结果,如下:
1。)我的源代码:
package at.home.digest.services;
import java.util.ArrayList;
import java.util.List;
import org.apache.commons.lang3.StringUtils;
import org.openqa.selenium.By;
import org.openqa.selenium.JavascriptExecutor;
import org.openqa.selenium.Keys;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.support.ui.ExpectedConditions;
import org.openqa.selenium.support.ui.WebDriverWait;
import at.home.digest.model.HomeToDeal;
public class ImmoweltBot {
public static final String URL = "https://www.immowelt.at/";
public static final String queryURL = URL + "/liste/wien/wohnungen/mieten?sort=relevanz";
public static void main (String [] args) throws Exception {
System.setProperty("webdriver.chrome.driver", "C:\\Temp\\chromedriver.exe");
String URLPage = StringUtils.EMPTY;
int page = 1;
int totalNumberOfEntities = 6000;
int numberOfEntitiesFound = 0;
List<WebElement> elemnts = new ArrayList<>();
WebDriver webDriver = new ChromeDriver();
outer:
while (numberOfEntitiesFound < totalNumberOfEntities){
webDriver.get(queryURL + URLPage);
WebDriverWait wait = new WebDriverWait(webDriver, 5);
By searchResults = By.xpath("//*[contains(@class, 'clear relative js-listitem')]");
JavascriptExecutor js = (JavascriptExecutor)webDriver;
webDriver.manage().window().maximize();
js.executeScript("window.scrollBy(0,1000)");
final int totalNumberOfKeyDowns = 190;
int keyDownTries = 0;
while ((++keyDownTries < totalNumberOfKeyDowns)) {
elemnts = wait.until(ExpectedConditions.presenceOfAllElementsLocatedBy(searchResults));
webDriver.findElement(By.tagName("body")).sendKeys(Keys.DOWN);
}
WebElement elem = webDriver.findElement(By.xpath("//*[contains(@class, 'ellipsis margin_none')]"));
totalNumberOfEntities = Utils.parseNumber(elem.getText()).intValue();
for (int i = 0; i < elemnts.size(); i++) {
WebElement divListItemClear = elemnts.get(i);
HomeToDeal homeToRent = new HomeToDeal();
String exposeURL = divListItemClear.findElement(By.tagName("a")).getAttribute("href");
homeToRent.setURL(exposeURL);
WebElement listContentClear = divListItemClear.findElement(By.xpath("//*[contains(@class, 'listcontent clear')]"));
WebElement h2Elem = listContentClear.findElement(By.tagName("h2"));
String text = h2Elem.getText();
homeToRent.setDescription(text);
System.out.println(homeToRent);
}
URLPage = "&cp="+ (++page);
numberOfEntitiesFound+=elemnts.size();
}
}
}
我的问题是那条线
String exposeURL = divListItemClear.findElement(By.tagName("a")).getAttribute("href");
按预期方式工作,并为我提供了元素的后续URL(针对周期中的每个新信号),但是这些行
WebElement listContentClear = divListItemClear.findElement(By.xpath("//*[contains(@class, 'listcontent clear')]"));
WebElement h2Elem = listContentClear.findElement(By.tagName("h2"));
String text = h2Elem.getText();
给我HTML元素h2-的每个时间值和相同值,这始终是找到的第一个元素的值。
有什么主意我在做错什么吗?
谢谢!
答案 0 :(得分:1)
您已经成为许多人在将XPath与Selenium结合使用时犯下的经典错误的受害者。 WebDriver实现遵循XPath规范来定位元素,这意味着with t(empid,headerid,status,time) as
(
select 1,123,'Failed' , '2019-06-22 17:00:00' from dual union all
select 1,321,'Pending', '2019-06-22 17:10:00' from dual union all
select 1,555,'Pending', '2019-06-22 17:20:00' from dual union all
select 1,111,'Failed' , '2019-06-22 17:30:00' from dual union all
select 1,222,'Pending', '2019-06-22 17:40:00' from dual union all
select 2,333,'Failed' , '2019-06-22 17:50:00' from dual union all
select 2,444,'Pending', '2019-06-22 18:00:00' from dual union all
select 3,555,'Failed' , '2019-06-22 18:10:00' from dual
), t2 as
(
select sum(case when status = 'Failed' then 1 else 0 end)
over (partition by empid order by time) as rn,
t.*
from t
), t3 as
(
select t2.*,
row_number() over (partition by empid,rn,status order by time) as rn2
from t2
)
select *
from t3
where (rn,rn2, empid) in
( select rn,rn2, empid
from t3
where empid = 1--&i_empid
and headerid = 123 -- &i_headerid
and status = 'Failed'
)
and status = 'Pending';
定位符始终指向文档的顶部。即使您是在//
实例中使用findElement
,也是如此。在引用该错误的代码中,所需的内容如下:
WebElement
请注意定位器开头的WebElement listContentClear = divListItemClear.findElement(By.xpath(".//*[contains(@class, 'listcontent clear')]"));
WebElement h2Elem = listContentClear.findElement(By.tagName("h2"));
String text = h2Elem.getText();
,指示当前节点为上下文节点。由于您主要是根据元素的CSS类中的值查找元素,因此在这种情况下,使用CSS选择器代替XPath可以避免此问题。
顺便说一句,我认为这些定位器有些脆弱,因为class属性不能保证类值的顺序。换句话说,就浏览器而言,.
在语义上等效于<div class="listcontent clear">
。如果浏览器将元素渲染为后一个元素而不是前一个元素,则CSS选择器<div class="clear listcontent">
将同时找到这两个渲染,而您使用的XPath则找不到。