在Java中将网页内容读入字符串的最佳方法是什么?

时间:2009-07-25 14:32:57

标签: java string optimization inputstream micro-optimization

我有以下Java代码来获取给定URL的HTML页面的全部内容。这可以以更有效的方式完成吗?欢迎任何改进。

public static String getHTML(final String url) throws IOException {
    if (url == null || url.length() == 0) {
        throw new IllegalArgumentException("url cannot be null or empty");
    }

    final HttpURLConnection conn = (HttpURLConnection) new URL(url).openConnection();
    final BufferedReader buf = new BufferedReader(new InputStreamReader(conn.getInputStream()));
    final StringBuilder page = new StringBuilder();
    final String lineEnd = System.getProperty("line.separator");
    String line;
    try {
        while (true) {
            line = buf.readLine();
            if (line == null) {
                break;
            }
            page.append(line).append(lineEnd);
        }
    } finally {
        buf.close();
    }

    return page.toString();
}

我忍不住觉得线条读数不够理想。我知道我可能会屏蔽由MalformedURLException电话引起的openConnection,我对此感到满意。

我的函数还有使HTML字符串具有当前系统的正确行终止符的副作用。这不是必要条件。

我意识到网络IO可能会缩短读取HTML所需的时间,但我仍然想知道这是最佳的。

旁注:如果StringBuilder有一个开放式InputStream的构造函数,只需要获取InputStream的所有内容并将其读入{StringBuilder,那就太棒了。 1}}。

3 个答案:

答案 0 :(得分:10)

正如其他答案所示,在任何强大的解决方案中都应考虑许多不同的边缘情况(HTTP特性,编码,分块等)。因此,我建议除玩具程序之外的任何东西都使用事实上的Java标准HTTP库:Apache HTTP Components HTTP Client

他们提供了许多样本,"just" getting the response contents for a request looks like this

HttpClient httpclient = new DefaultHttpClient();
HttpGet httpget = new HttpGet("http://www.google.com/"); 
ResponseHandler<String> responseHandler = new BasicResponseHandler();    
String responseBody = httpclient.execute(httpget, responseHandler);
// responseBody now contains the contents of the page
System.out.println(responseBody);
httpclient.getConnectionManager().shutdown();        

答案 1 :(得分:2)

好的,再次编辑。一定要在它周围放置try-finally块,或者捕获IOException

 ...
 final static int BUFZ = 4096;
 StringBuilder page = new StringBuilder();
 HttpURLConnection conn = 
    (HttpURLConnection) new URL(url).openConnection();
 InputStream is = conn.getInputStream()
 // perhaps allocate this one time and reuse if you
  //call this method a lot.
 byte[] buf = new byte[BUFZ] ;
 int nRead = 0;

 while((nRead = is.read(buf, 0, BUFZ) > 0) {
    page.append(new String(buf /* , Charset charset */)); 
 // uses local default char encoding for now
 }

在这里试试这个:

 ...
 final static int MAX_SIZE = 10000000;
 HttpURLConnection conn = 
    (HttpURLConnection) new URL(url).openConnection();
 InputStream is = conn.getInputStream()
 // perhaps allocate this one time and reuse if you
  //call this method a lot.
 byte[] buf = new byte[MAX_SIZE] ;
 int nRead = 0;
 int total = 0;
 // you could also use ArrayList so that you could dynamically
 //  resize or there are other ways to resize an array also
 while(total < MAX_SIZE && (nRead = is.read(buf) > 0) {
      total += nRead;
 }
 ...
 // do something with buf array of length total

确定下面的代码不适合你,因为由于HTTP / 1.1“分块”而没有在开头发送内容长度标题行

 ...
 HttpURLConnection conn = 
    (HttpURLConnection) new URL(url).openConnection();
 InputStream is = conn.getInputStream()
 int cLen = conn.getContentLength() ;
 byte[] buf = new byte[cLen] ;
 int nRead=0 ;

 while(nRead < cLen) {
      nRead += is.read(buf, nRead, cLen - nRead) ;
 }
 ...
 // do something with buf array 

答案 2 :(得分:1)

您可以在InputStreamReader之上进行自己的缓冲,方法是将更大的块读入字符数组并将数组内容附加到StringBuilder。

但这会使你的代码难以理解,我怀疑它是否值得。

请注意Sean A.O.的提案。 Harney读取原始字节,因此您需要在其上进行文本转换。