使用HtmlUnit时,如何配置底层NekoHtml解析器?

时间:2012-06-21 13:08:16

标签: java htmlunit cyberneko

我正在使用HtmlUnit尝试抓取网页,因为它支持Javascript。 (我宁愿使用Jsoup,但没有JS支持)。

该问题与底层NekoHtml解析器的功能有关:     “http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe

请参阅:http://nekohtml.sourceforge.net/settings.html

这显然可以在Neko中启用,但我正在使用HtmlUnit。有没有办法配置HTML单元用于启用此功能的基础Neko解析器?

尝试运行此代码时:

final WebClient webClient = new WebClient();
HtmlPage page = webClient.getPage(url.toString());

我收到了这个错误:

Caused by: com.gargoylesoftware.htmlunit.ObjectInstantiationException: unable to create HTML parser
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.<init>(HTMLParser.java:418)
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.<init>(HTMLParser.java:342)
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:203)
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:179)
    at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:221)
    at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:106)
    at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:433)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:311)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:373)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:358)
    at 
Caused by: org.xml.sax.SAXNotRecognizedException: Feature 'http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe' is not recognized.
    at org.apache.xerces.parsers.AbstractSAXParser.setFeature(Unknown Source)
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.<init>(HTMLParser.java:411)
    ... 41 more

2 个答案:

答案 0 :(得分:1)

尝试使用FF行为初始化Web客户端

WebClient webClient = new WebClient(BrowserVersion.FIREFOX_3_6);

并激活javascript

webClient.setJavaScriptEnabled(true);

那应该没问题。

答案 1 :(得分:1)

...解决

    BrowserVersionFeatures[] bvf = new BrowserVersionFeatures[1];
    bvf[0] = BrowserVersionFeatures.HTMLIFRAME_IGNORE_SELFCLOSING;
    BrowserVersion bv = new BrowserVersion(
            BrowserVersion.NETSCAPE, "5.0 (Windows; en-US)",
            "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8",
            (float) 3.6, bvf);

    WebClient webClient = new WebClient(bv);
    webClient.setJavaScriptEnabled(true);