Question

我正在尝试抓取使用GWT构建的网页，并使用GWT RPC机制进行AJAX调用。我试图抓取的页面不是我的 - 所以我无法编辑服务器端。我对GWT很新，从我最初的几天开始 - 我认为除非你有案例界面，否则你不能反序列化数据。

我是对的还是有办法智能地抓取数据？

Answer 1

我为生活而努力，GWT是一个几乎总是让我感到困惑的框架。它传递了序列化的，非人类可读参数的事实使我无法插入将访问该站点的插入逻辑。

在一些简单的GWT上，我已经得到了解决方法来解析JavaScript并按原样运行部分，但我无法全部工作。

Answer 2

你可以使用htmlunit和WebClient来实现：

//real code mixed with pseudo-code:
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_3);
Map<String, String> urls = new HashMap<->();
LinkedList<String> urlsToVisit = new LinkedList<->();
urlsToVisit.put("http://some_gwt_app.com/#!home");
while (!urlsToVisit.isEmpty()) {
    String page = urlsToVisit.remove();
    if (urls.containsKey(page)) { 
        continue;
    }
    String rendered = webClient.getPage(page);
    urls.put(page, rendered);
    urlsToVisit.addAll(extractLinks(page));
}

您可能需要尝试一下WebClient选项。在我看来，这些似乎做得很好：

webClient.setThrowExceptionOnScriptError(false);
webClient.setRedirectEnabled(true);
webClient.setJavaScriptEnabled(true);
// important! Give the headless browser enough time to execute
// JavaScript. The exact time to wait may depend on your application.
webClient.waitForBackgroundJavaScript(20000);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());

Web爬行/抓取基于GWT的网页

2 个答案: