我正在使用Text {
id: someId
//% "%1 my translatable suffix!"
text: qsTrId("my_translatable_id_with_argument")
Component.onCompleted: {
text = text.arg("This is")
}
}
抓取包含大量内部链接的网页。
我有以下代码:
Selenium
此代码可以使用大约一个小时,然后我得到一个 import org.openqa.selenium.*;
import org.openqa.selenium.firefox.FirefoxDriver;
import java.util.ArrayList;
import java.util.List;
public class WebScrapper {
//list to save visited links
static List<String> linkAlreadyVisited = new ArrayList<String>();
static String userName = "user";
static String password = "passw";
static String mainPage = "https://web/";
WebDriver driver;
// public WebDriver driver = new FirefoxDriver();
String loginPage = "https://web/Login";
public WebScrapper(WebDriver driver) {
this.driver = driver;
}
public static void main(String[] args) throws InterruptedException {
System.setProperty("webdriver.gecko.driver", "E:\\geckodriver.exe");
WebDriver driver = new FirefoxDriver();
WebScrapper webSrcapper = new WebScrapper(driver);
webSrcapper.openTestSite();
webSrcapper.login(userName, password);
driver.navigate().to(mainPage);
driver.get(mainPage);
// start recursive linkText
new WebScrapper(driver).linkTest();
}
public static boolean isElementStale(WebElement e) {
try {
e.isDisplayed();
return false;
} catch (StaleElementReferenceException ex) {
return true;
}
}
public void linkTest() {
// loop over all the a elements in the page
for (WebElement link : driver.findElements(By.tagName("a"))) {
// Check if link is displayed and not previously visited
if (!isElementStale(link)
&& !linkAlreadyVisited.contains(link.getText())) {
// add link to list of links already visited
linkAlreadyVisited.add(link.getText());
System.out.println(link.getText());
try {
Thread.sleep(200);
} catch (InterruptedException e) {
e.printStackTrace();
}
try {
link.click();
} catch (Exception ex) {
// String id = link.getAttribute("id");
((JavascriptExecutor) driver).executeScript("$('#id').click();");
}
// call recursiveLinkTest on the new page
new WebScrapper(driver).linkTest();
} else {
continue;
}
}
driver.navigate().back();
}
/**
* // * Open the test website. //
*/
public void openTestSite() {
driver.navigate().to(loginPage);
}
public void login(String username, String Password) {
WebElement userName_editbox = driver.findElement(By.id("IDToken1"));
WebElement password_editbox = driver.findElement(By.id("IDToken2"));
WebElement submit_button = driver.findElement(By.name("Login.Submit"));
userName_editbox.sendKeys(username);
password_editbox.sendKeys(Password);
submit_button.click();
}
}
。由于网页上有很多链接,我可以简单地忽略该链接,跟随每个链接对我来说并不重要。
所以,我尝试使用StaleElementReferenceException
else clause
命令来逃避此异常。但它不起作用。我的问题是为什么?我只是想进入下一个链接。
与此同时,由于代码需要很长时间才能运行,我不知道它到底发生了什么(同样,我也看到有时代码运行方式不同,这就是链接的顺序并不总是同样,所以我不能轻易调试)我无法检查具体的链接。
我尝试了在网站上找到的不同解决方案。例如,continue;
在点击链接之前和之后Thread.sleep
。
有没有人能解决这个问题呢?
答案 0 :(得分:0)
这可行。请注意,我在浏览器中编程,而不是实际的IDE,所以请原谅我所做的任何拼写错误。
我改变了什么:我们只使用类WebScrapper
的单个实例。外出时会更容易。我们不会先搜索深度,而是首先使用Queue<String>
进行呼吸。每次我们找到新链接时,其网址都会添加到Queue<String> links
。之后,我们通过链接链接并抓取它。
您可能需要进行一些调整,因为我们不再调用link.click()
,因此您可能需要在添加链接之前修复该链接(例如,如果缺少https://stackoverflow.com
)< / p>
import org.openqa.selenium.*;
import org.openqa.selenium.firefox.FirefoxDriver;
import java.util.ArrayList;
import java.util.List;
public class WebScrapper {
public static final long WEBSITE_LOAD_TIME = 1000;
//list to save visited links
static List<String> linkAlreadyVisited = new ArrayList<String>();
private String userName;
private String password;
private String mainPage;
private WebDriver driver;
public static final String loginPage = "https://web/Login";
public Queue<String> links = new LinkedList<>();
public Set<String> visitedLinks = new LinkedHashSet(); // linked so we can later on determine which links where visited in which order
public WebScrapper(WebDriver driver, String page, String userName, String password) {
this.driver = driver;
this.userName = userName;
this.password = password;
}
public static void main(String[] args) throws InterruptedException {
System.setProperty("webdriver.gecko.driver", "E:\\geckodriver.exe");
WebDriver driver = new FirefoxDriver();
WebScrapper webSrcapper = new WebScrapper(driver, "user", "passw");
webSrcapper.login(userName, password);
webSrcapper.start("https://web/");
}
public void start(String page) {
links.add(page);
while (!links.isEmpty()) {
crawlPage(links.poll());
}
}
public void crawlPage(String address) {
System.out.println("visiting page \"" + linkAddress+"\"");
driver.navigate().to(address);
try {
Thread.sleep(WEBSITE_LOAD_TIME);
} catch (InterruptedException e) {
e.printStackTrace();
}
visitedLinks.add(linkAddress);
for (WebElement link : driver.findElements(By.tagName("a"))) {
try {
String linkAddress = link.getAttribute("href");
if (!visitedLinks.contains(linkAddress)) {
System.out.println("found link \"" + linkAddress+"\"");
links.add(linkAddress);
} else {
continue;
}
} catch(StaleElementReferenceException e) {
System.out.println("link became stale and is therefore ignored.");
}
}
}
public void login(String username, String Password) {
driver.navigate().to(loginPage);
WebElement userName_editbox = driver.findElement(By.id("IDToken1"));
WebElement password_editbox = driver.findElement(By.id("IDToken2"));
WebElement submit_button = driver.findElement(By.name("Login.Submit"));
userName_editbox.sendKeys(username);
password_editbox.sendKeys(Password);
submit_button.click();
}
}
如果我犯了任何拼写错误或逻辑错误,请随时编辑我的答案。我目前无法测试我的解决方案。
答案 1 :(得分:0)
我可以为您提供更好的解决方案,以便在大多数时间内找到新的webElement。我在driver.findElement()上创建了一个包装函数。
public WebElement findFreshElement(By locator){ // To handle stale Element reference exception
WebElement webElement = null;
int attempts =0;
while(attempts < 10){
try {
wait.hardWait(2);
webElement = driver.findElement(locator);
webElement.isDisplayed();
break;
} catch (StaleElementReferenceException e) {
logMessage("⚠ Stale Element Reference Exception ... Refinding element after 2 seconds.. ");
attempts+=1;
}catch(NoSuchElementException e){
logMessage("❌ [ELEMENT NOT FOUND] : You might have to update the locator:-" + locator);
attempts+=1;
}
}
return webElement;
}
wait.hardWait(2)是一个Thread.Sleep包装器方法。因为使用了hardwait,它不是最好的解决方案,但它比检查元素状态更好。此方法实际上返回一个新元素引用。