通过Tomcat 7运行HtmlUnit

时间:2013-03-16 14:48:14

标签: java htmlunit

我正在尝试使用HTMLUnit生成我们的ajax页面的可抓取HTML快照(由https://developers.google.com/webmasters/ajax-crawling/建议)。我们的想法是创建允许企业通过常规计划服务或自愿创建快照的功能。

我写了一个快速的POC主类来测试理论,它按预期工作(当我们查看源代码时,我们可以看到之前我们看不到的Google抓取工具所需的所有数据)。我现在将它集成到我们在Tomcat 7上运行的应用程序中,我有一个问题从Google下载jquery.js并带有以下日志消息

2013-03-15 18:10:38,071 ERROR [author->taskExecutor-1] com.gargoylesoftware.htmlunit.html.HtmlPage       : Error loading JavaScript from [https://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.js].
javax.net.ssl.SSLException: hostname in certificate didn't match: <ajax.googleapis.com/173.194.67.95> != <*.googleapis.com> OR <*.googleapis.com> OR <googleapis.com>
at org.apache.http.conn.ssl.AbstractVerifier.verify(AbstractVerifier.java:228)
at org.apache.http.conn.ssl.BrowserCompatHostnameVerifier.verify(BrowserCompatHostnameVerifier.java:54)
at org.apache.http.conn.ssl.AbstractVerifier.verify(AbstractVerifier.java:149)
at org.apache.http.conn.ssl.AbstractVerifier.verify(AbstractVerifier.java:130)
at org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:397)
at org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:495)
at org.apache.http.conn.scheme.SchemeSocketFactoryAdaptor.connectSocket(SchemeSocketFactoryAdaptor.java:62)
at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:148)
at org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:150)

...

因此,ajax没有被执行,快照也不包含我们希望的视图源中的数据。有没有人知道为什么会在我的Tomcat版本的代码中发生这种情况,而不是在我的独立主类中?这两个版本都在我的本地机器上运行,一个是Tomcat(v7),一个是Java App。两个版本都有相同的Maven包含(见下)。

注意:我在设置WebClient client = new WebClient(BrowserVersion.FIREFOX_17);时尝试指定BrowserVersion,因为我读过这会产生更好的结果(抱歉,我记不起链接了)。再次这在POC中运行良好,但是当我在Tomcat中运行时,我看到日志“Instatiating Web Client”,但无论我等待多长时间,它都不会进入“Client Instatiated”或抛出任何异常。我不知道这是否与无法下载jqeury.js有关,因为它仍然可以在未指定BrowserVersion的POC中工作。

这是我的POC Java主要方法

        OutputStreamWriter writer = null;

        try {
            final WebClient webClient = new WebClient();
            webClient.getOptions().setThrowExceptionOnScriptError(false);
            webClient.getOptions().setPrintContentOnFailingStatusCode(false);
            final HtmlPage page = (HtmlPage)webClient.getPage("http://myurl.com");

            webClient.waitForBackgroundJavaScript(1500);

            File file = new File("C:\\test.html");
            FileUtils.touch(file);

            writer = new OutputStreamWriter(new FileOutputStream(file), "UTF-8");
            writer.write(page.asXml());
            writer.flush();

        } catch (MalformedURLException mue) {
            System.out.println("MalformedURL exception");
        } catch (IOException ioe) {
            System.out.println("IOException occurred " +  ioe.getMessage());
        } finally {
            IOUtils.closeQuietly(writer);
        }

这是我的综合版

        /* Entry point for the generation */
     public void generate() {

        log.info("Beginning snapshot generation...");

        try {

            // Get the URLS
            log.info("Retrieving list of page urls");
            List<String> pageUrls = getUrlList();
            log.info("Found {} urls to generate", pageUrls.size());

            // For every url we have generate a snapshot
            for (String pageUrl: pageUrls) {
                takeSnapshot(pageUrl);
            }
            log.info("Finished generating snapshots!");
        } catch (Exception e) {
            log.error("Exception caught while generating snapshot", e);
        }
    }

    /**
     * Take the HTML snapshot of the url and output to the snapshot directory
     */
    private void takeSnapshot(String pagePath) {
        try {
            String fullOutputFilePath = config.getHtmlSnapshotDirectory() + File.separator
                                                        + pagePath + File.separator + HTML_SNAPSHOT_FILE_NAME;
            String pageUrl = "http://myurl.com" + pagePath;

            log.debug("Instantiating Web Client...");
            final WebClient webClient = new WebClient();
            log.debug("Client instantiated");
            webClient.getOptions().setThrowExceptionOnScriptError(false);
            webClient.getOptions().setPrintContentOnFailingStatusCode(false);
            final HtmlPage page = (HtmlPage)webClient.getPage(pageUrl);

            webClient.waitForBackgroundJavaScript(1500);

            snapshotFile = new File(fullOutputFilePath);
            FileUtils.touch(snapshotFile);

            writer = new OutputStreamWriter(new FileOutputStream(snapshotFile), "UTF-8");
            writer.write(page.asXml());
            writer.flush();
        } catch (MalformedURLException mue) {
            System.out.println("MalformedURL exception");
        } catch (IOException ioe) {
            System.out.println("IOException occurred " +  ioe.getMessage());
        } finally {
            IOUtils.closeQuietly(writer);
        }
    }

Maven依赖

        <dependency>
            <groupId>net.sourceforge.htmlunit</groupId>
            <artifactId>htmlunit</artifactId>
            <version>2.12</version>
        </dependency>

        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.2.3</version>
        </dependency>

        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpcore</artifactId>
            <version>4.3-alpha1</version>
        </dependency>

谢谢大家!!!

1 个答案:

答案 0 :(得分:1)

因此,添加webClient.getOptions().setUseInsecureSSL(true);是解决此问题的关键。但是,我不得不使用已弃用的版本webClient.setUseInsecureSSL(true);

我不知道为什么较新的版本在Tomcat中运行时不起作用但它解决了这个问题。如果有人能够提供有关为什么那将是伟大的洞察力。我还失去了为什么在运行Tomcat时设置BrowserVersion导致应用程序停止。我已经向HtmlUnit邮件列表询问了这些问题的答案。