Question

我正在尝试使用htmlunit抓取网站。每当我运行它时它只会输出以下错误：

Caused by: net.sourceforge.htmlunit.corejs.javascript.EcmaError: TypeError: Cannot read property "push" from undefined (https://www.kinoheld.de/dist/prod/0.4.7/widget.js#1)

现在我对JS不太了解，但我读到push是某种数组操作。这对我来说似乎是标准的，我不知道为什么htmlunit不支持它。

以下是我目前使用的代码：

public static void main(String[] args) throws IOException {
    WebClient web = new WebClient(BrowserVersion.FIREFOX_45);
    web.getOptions().setUseInsecureSSL(true);
    String url = "https://www.kinoheld.de/kino-muenchen/royal-filmpalast/vorstellung/280823/?mode=widget&showID=280828#panel-seats";
    web.getOptions().setThrowExceptionOnFailingStatusCode(false);
    web.waitForBackgroundJavaScript(9000);
    HtmlPage response = web.getPage(url);

    System.out.println(response.getTitleText());
}

我错过了什么？有没有办法绕过这个或解决这个问题的方法？提前谢谢！

Answer 1

尝试添加

web.getOptions().setThrowExceptionOnScriptError(false);

在您尝试获取该页面之前。这迫使htmlunit忽略错误。但是，如果例如抛出错误的javascript对于获取您正在废弃的数据（希望它不是很重要），这可能不会100％有效。如果这不起作用，请尝试将Selenium与ChromeDriver或GhostDriver一起使用。

Source

Answer 2

之前我遇到过类似的问题。这是HTML单元被设计为测试工具框架而不是网络抓取框架的问题。您是否在运行最新版本的HTML单元？

我能够通过添加setThrowExceptionOnScriptError(false)（如Coffee Converter的答案中所述）行以及添加来运行代码 java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF); 在方法的顶部禁用日志转储。这产生了一个输出：

Royal Filmpalast München München | kinoheld.de

完整代码如下：

public static void main(String[] args) throws IOException {

    java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);

    WebClient webClient = new WebClient(BrowserVersion.FIREFOX_45);
    String url = "https://www.kinoheld.de/kino-muenchen/royal-filmpalast/vorstellung/280823/?mode=widget&showID=280828#panel-seats";

    webClient.getOptions().setUseInsecureSSL(true);
    webClient.getOptions().setThrowExceptionOnScriptError(false);
    webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
    webClient.waitForBackgroundJavaScript(9000);
    HtmlPage response = webClient.getPage(url);

    System.out.println(response.getTitleText());
}

这是在带有HTML Unit 2.2.1的RedHat命令行上运行的。希望这会有所帮助。

htmlunit无法读取属性＆＃34; push＆＃34;来自undefined

2 个答案: