Question

我创建了一个基本的GWT（Google Web Toolkit）Ajax应用程序，现在我正在尝试创建snapshots to the crawlers read the page。

我使用HtmlUnit创建一个Servlet来响应抓取工具。

当我在浏览器上时，我的应用程序运行得很好。但是在HtmlUnit中，它会引发很多关于我在HTML中的特殊字符的错误。但这些字符是满足的，我不想用特殊代码替换它，一旦它正在工作，只是因为HtmlUnit。（至少我应该检查一下，如果我正确使用HtmlUnit）

My page with the error

我认为HtmlUnit应该读取页面的字符集信息并将其呈现为浏览器，一旦它成为我认为的项目的目标。

我还没有找到关于这个问题的好消息。这是HtmlUnit的限制吗？我是否需要更改网站的所有内容才能使用此java库拍摄快照？

这是我的代码：

if ((queryString != null) && (queryString.contains("_escaped_fragment_"))) {
            // ok its the crawler
            // rewrite the URL back to the original #! version
            // remember to unescape any %XX characters

            url = URLDecoder.decode(url, "UTF-8");

            String ajaxURL = url.replace("?_escaped_fragment_=", "#!");


            final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_24);


            HtmlPage page = webClient.getPage(ajaxURL);

            // important!  Give the headless browser enough time to execute JavaScript
            // The exact time to wait may depend on your application.
            webClient.waitForBackgroundJavaScript(3000);

            // return the snapshot
            response.getWriter().write(page.asXml());

Answer 1

问题是XML与HTML混淆。 @ColinAlworth评论帮助了我。

我跟着谷歌的例子，但没有工作。

要实现这一点，您需要删除XML标记，只需响应HTML，更改行：

 // return the snapshot
 response.getWriter().write(page.asXml());

到

 response.getWriter().write(page.asXml().replaceFirst("<\\?.*>",""));

现在正在渲染。

但是虽然它正在呈现，但CSS没有工作，并且DOM没有更新（GWT在页面打开时更新页面标题）。 HTMLUnit引发了很多关于CSS的错误，我正在使用twitter bootstrap而没有任何改变。显然，HtmlUnit项目有很多错误，适合小型测试，但不能解析复杂（甚至简单）的HTML。

HtmlUnit用于获取Ajax应用程序的快照

1 个答案: