Question

我正在尝试在谷歌应用上构建某种web服务。

现在的问题是，我需要从网站上获取数据（HTML Scraping）。

请求如下：

URL url = new URL(p_url);
con = (HttpURLConnection) url.openConnection();
InputStreamReader in = new InputStreamReader(con.getInputStream());
BufferedReader reader = new BufferedReader(in);

        String result = "";
        String line = "";
        while((line = reader.readLine()) != null)
        {
            System.out.println(line);
        }
        return result;

现在，App Engine在第3行给出了以下例外：

com.google.appengine.api.urlfetch.ResponseTooLargeException

这是因为最大请求限制为1mb，页面中的HTML总量约为1.5mb。

现在我的问题：我只需要html的前20行来刮。有没有办法只获取HTML的一部分，以便不会抛出ResponseTooLargeException？

提前致谢！

Answer 1

使用低级别URLFetch api解决了这个问题。

将allowtruncate选项设置为true;

http://code.google.com/intl/nl-NL/appengine/docs/java/javadoc/com/google/appengine/api/urlfetch/FetchOptions.html

基本上它的工作原理如下：

HTTPRequest request = new HTTPRequest(_url, HTTPMethod.POST, Builder.allowTruncate());
URLFetchService service = URLFetchServiceFactory.getURLFetchService();
HTTPResponse response = service.fetch(request);

Google App Engine（Java）：URL获取响应过大的问题

1 个答案: