Apache httpclient在加载之前返回页面?

时间:2010-10-25 17:20:28

标签: java apache amazon httpclient

我注意到使用apache httpclient库时出现了一个奇怪的现象,我想知道它为什么会发生。我创建了一些示例代码来演示。 请考虑以下代码:

//Example URL
 String url = "http://www.amazon.com/gp/offer-listing/05961580/ref=dp_olp_used?ie=UTF8";
 GetMethod get = new GetMethod(url);
 HttpMethodRetryHandler httpHandler = new DefaultHttpMethodRetryHandler(1, false);
 get.getParams().setParameter(HttpMethodParams.RETRY_HANDLER, httpHandler );
 get.getParams().setCookiePolicy(CookiePolicy.IGNORE_COOKIES);
 HttpConnectionManager connectionManager = new SimpleHttpConnectionManager();
 HttpClient client = new HttpClient( connectionManager );
 client.getParams().setParameter("http.useragent", FIREFOX );
 String line;
 StringBuilder stringBuilder = new StringBuilder();
 String toStreamBody = null;
 String toStringBody = null;
 try {
  int statusCode = client.executeMethod(get);
  if( statusCode != HttpStatus.SC_OK ){
   System.err.println("Internet Status: " + HttpStatus.getStatusText(statusCode) );
   System.err.println("While getting page: " + url );
  }
 //toString
  toStringBody = get.getResponseBodyAsString();
 //toStream
  InputStreamReader isr = new InputStreamReader(get.getResponseBodyAsStream())
  BufferedReader rd = new BufferedReader(isr);
  while ((line = rd.readLine()) != null) {
  stringBuilder.append(line);
  }
 } catch (java.io.IOException ex) {
  System.out.println( "Failed to get page: " + url);
 } finally {
  get.releaseConnection();
 }       
 toStreamBody = stringBuilder.toString();

此代码不打印任何内容:

 System.out.println(toStringBody); // ""

此代码打印网页:

 System.out.println(toStreamBody); // "Whole Page"

但它变得更加奇怪...... 替换:

get.getResponseBodyAsString();

使用:

 get.getResponseBodyAsString(150000);

现在我们收到错误: 无法获取页面:http://www.amazon.com/gp/offer-listing/0596158068/ref=dp_olp_used?ie=UTF8

我无法找到除亚马逊以外的其他网站复制此行为,但我认为还有其他网站。

我知道根据http://hc.apache.org/httpclient-3.x/performance.html的文档阻止使用getResponseBodyAsString(),它并不表示页面不会加载,只是说您可能存在内存不足的风险例外。 getResponseBodyAsString()是否有可能在加载之前返回页面?为什么这只发生在亚马逊上?

1 个答案:

答案 0 :(得分:0)

您是否使用其他任何网址进行测试?

您提供的代码中的网址将302重定向到http://www.amazon.com/dp/05961580/?tag=stackoverfl08-20,然后返回404(未找到)。

HttpClient不处理重定向:http://hc.apache.org/httpclient-3.x/redirects.html