Question

我使用Selenium库在Java中开发了一个页面爬虫。抓取工具通过一个通过Javascript 3应用程序启动的网站，该应用程序在弹出窗口中显示为HTML。

启动2个应用程序时，爬虫没有问题，但在第3个爬虫程序中，爬虫程序永远冻结。

我使用的代码类似于

public void applicationSelect() {
  ...
  //obtain url by parsing tag href attributed
  ...

  this.driver = new HtmlUnitDriver(BrowserVersion.INTERNET_EXPLORER_8);
  this.driver.seJavascriptEnabled(true);
  this.driver.get(url); //the code does not execute after this point for the 3rd app
  ...
}

我也尝试通过以下代码点击网页元素

public void applicationSelect() {
  ...
  WebElement element = this.driver.findElementByLinkText("linkText");
  element.click(); //the code does not execute after this point for the 3rd app
  ...
}

单击它会产生完全相同的结果。对于上面的代码，我确保我得到了正确的元素。

有谁能告诉我我遇到的问题是什么？

在申请方面，我无法透露有关html代码的任何信息。我知道这使得解决问题变得更加困难，并且我提前道歉。

===更新2013-04-10 ===

所以，我将这些资源添加到我的抓取工具中，看到了这个.driver.get（url）中的哪些内容被卡住了。

基本上，驱动程序在无限刷新循环中丢失。在由HtmlUnitDriver实例化的WebClient对象中，加载了一个HtmlPage，它不断刷新，似乎没有结束。

以下是来自WaitingRefreshHandler的代码，该代码包含在com.gargoylesoftware.htmlunit中：

public void handleRefresh(final Page page, final URL url, final int requestedWait) throws IOException {
  int seconds = requestedWait;
  if (seconds > maxwait_ && maxwait_ > 0) {
    seconds = maxwait_;
  }
  try {
    Thread.sleep(seconds * 1000);
  }
  catch (final InterruptedException e) {
    /* This can happen when the refresh is happening from a navigation that started
     * from a setTimeout or setInterval. The navigation will cause all threads to get
     * interrupted, including the current thread in this case. It should be safe to
     * ignore it since this is the thread now doing the navigation. Eventually we should
     * refactor to force all navigation to happen back on the main thread.
     */
    if (LOG.isDebugEnabled()) {
      LOG.debug("Waiting thread was interrupted. Ignoring interruption to continue navigation.");
    }
  }
  final WebWindow window = page.getEnclosingWindow();
  if (window == null) {
    return;
  }
  final WebClient client = window.getWebClient();
  client.getPage(window, new WebRequest(url));
}

指令“client.getPage（window，new WebRequest（url））”再次调用WebClient来重新加载页面，只是再一次调用这个相同的刷新方法。这似乎继续下去，不会因为“Thread.sleep（seconds * 1000）”而迅速填满内存，这会迫使3m等待再次尝试。

有没有人对如何解决这个问题有任何建议？我有一个建议是创建2个新的HtmlUnitDriver和WebClient类，它们扩展了原始的类。然后覆盖相关方法以避免此问题。

再次感谢。

Answer 1

我通过创建一个什么都不做的RefreshHandler类解决了我永恒的刷新问题：

public class RefreshHandler implements com.gargoylesoftware.htmlunit.RefreshHandler {   
  public RefreshHandler() { }
  public void handleRefresh(final Page page, final URL url, final int secods) { }
}

另外，我扩展了HtmlUnitDriver类，并通过重写方法modifyWebClient，设置了新的RefreshHandler：

public class HtmlUnitDriverExt extends HtmlUnitDriver { 
  public HtmlUnitDriverExt(BrowserVersion version) {
    super(version);
  }
  @Override
  protected WebClient modifyWebClient(WebClient client) {
    client.setRefreshHandler(new RefreshHandler());
    return client;
  }
}

方法modifyWebClient是在HtmlUnitDriver中为此目的创建的无操作方法。

干杯。

获取网址时，HtmlUnitDriver会导致问题

1 个答案: