我使用Selenium库在Java中开发了一个页面爬虫。抓取工具通过一个通过Javascript 3应用程序启动的网站,该应用程序在弹出窗口中显示为HTML。
启动2个应用程序时,爬虫没有问题,但在第3个爬虫程序中,爬虫程序永远冻结。
我使用的代码类似于
public void applicationSelect() {
...
//obtain url by parsing tag href attributed
...
this.driver = new HtmlUnitDriver(BrowserVersion.INTERNET_EXPLORER_8);
this.driver.seJavascriptEnabled(true);
this.driver.get(url); //the code does not execute after this point for the 3rd app
...
}
我也尝试通过以下代码点击网页元素
public void applicationSelect() {
...
WebElement element = this.driver.findElementByLinkText("linkText");
element.click(); //the code does not execute after this point for the 3rd app
...
}
单击它会产生完全相同的结果。对于上面的代码,我确保我得到了正确的元素。
有谁能告诉我我遇到的问题是什么?
在申请方面,我无法透露有关html代码的任何信息。我知道这使得解决问题变得更加困难,并且我提前道歉。
===更新2013-04-10 ===
所以,我将这些资源添加到我的抓取工具中,看到了这个.driver.get(url)中的哪些内容被卡住了。
基本上,驱动程序在无限刷新循环中丢失。在由HtmlUnitDriver实例化的WebClient对象中,加载了一个HtmlPage,它不断刷新,似乎没有结束。
以下是来自WaitingRefreshHandler的代码,该代码包含在com.gargoylesoftware.htmlunit中:
public void handleRefresh(final Page page, final URL url, final int requestedWait) throws IOException {
int seconds = requestedWait;
if (seconds > maxwait_ && maxwait_ > 0) {
seconds = maxwait_;
}
try {
Thread.sleep(seconds * 1000);
}
catch (final InterruptedException e) {
/* This can happen when the refresh is happening from a navigation that started
* from a setTimeout or setInterval. The navigation will cause all threads to get
* interrupted, including the current thread in this case. It should be safe to
* ignore it since this is the thread now doing the navigation. Eventually we should
* refactor to force all navigation to happen back on the main thread.
*/
if (LOG.isDebugEnabled()) {
LOG.debug("Waiting thread was interrupted. Ignoring interruption to continue navigation.");
}
}
final WebWindow window = page.getEnclosingWindow();
if (window == null) {
return;
}
final WebClient client = window.getWebClient();
client.getPage(window, new WebRequest(url));
}
指令“client.getPage(window,new WebRequest(url))”再次调用WebClient来重新加载页面,只是再一次调用这个相同的刷新方法。这似乎继续下去,不会因为“Thread.sleep(seconds * 1000)”而迅速填满内存,这会迫使3m等待再次尝试。
有没有人对如何解决这个问题有任何建议?我有一个建议是创建2个新的HtmlUnitDriver和WebClient类,它们扩展了原始的类。然后覆盖相关方法以避免此问题。
再次感谢。
答案 0 :(得分:3)
我通过创建一个什么都不做的RefreshHandler类解决了我永恒的刷新问题:
public class RefreshHandler implements com.gargoylesoftware.htmlunit.RefreshHandler {
public RefreshHandler() { }
public void handleRefresh(final Page page, final URL url, final int secods) { }
}
另外,我扩展了HtmlUnitDriver类,并通过重写方法modifyWebClient,设置了新的RefreshHandler:
public class HtmlUnitDriverExt extends HtmlUnitDriver {
public HtmlUnitDriverExt(BrowserVersion version) {
super(version);
}
@Override
protected WebClient modifyWebClient(WebClient client) {
client.setRefreshHandler(new RefreshHandler());
return client;
}
}
方法modifyWebClient是在HtmlUnitDriver中为此目的创建的无操作方法。
干杯。