Question

我有以下Java代码来获取给定URL的HTML页面的全部内容。这可以以更有效的方式完成吗？欢迎任何改进。

public static String getHTML(final String url) throws IOException {
    if (url == null || url.length() == 0) {
        throw new IllegalArgumentException("url cannot be null or empty");
    }

    final HttpURLConnection conn = (HttpURLConnection) new URL(url).openConnection();
    final BufferedReader buf = new BufferedReader(new InputStreamReader(conn.getInputStream()));
    final StringBuilder page = new StringBuilder();
    final String lineEnd = System.getProperty("line.separator");
    String line;
    try {
        while (true) {
            line = buf.readLine();
            if (line == null) {
                break;
            }
            page.append(line).append(lineEnd);
        }
    } finally {
        buf.close();
    }

    return page.toString();
}

我忍不住觉得线条读数不够理想。我知道我可能会屏蔽由MalformedURLException电话引起的openConnection，我对此感到满意。

我的函数还有使HTML字符串具有当前系统的正确行终止符的副作用。这不是必要条件。

我意识到网络IO可能会缩短读取HTML所需的时间，但我仍然想知道这是最佳的。

旁注：如果StringBuilder有一个开放式InputStream的构造函数，只需要获取InputStream的所有内容并将其读入{StringBuilder，那就太棒了。 1}}。

Answer 1

正如其他答案所示，在任何强大的解决方案中都应考虑许多不同的边缘情况（HTTP特性，编码，分块等）。因此，我建议除玩具程序之外的任何东西都使用事实上的Java标准HTTP库：Apache HTTP Components HTTP Client。

他们提供了许多样本，"just" getting the response contents for a request looks like this：

HttpClient httpclient = new DefaultHttpClient();
HttpGet httpget = new HttpGet("http://www.google.com/"); 
ResponseHandler<String> responseHandler = new BasicResponseHandler();    
String responseBody = httpclient.execute(httpget, responseHandler);
// responseBody now contains the contents of the page
System.out.println(responseBody);
httpclient.getConnectionManager().shutdown();

Answer 2

好的，再次编辑。一定要在它周围放置try-finally块，或者捕获IOException

 ...
 final static int BUFZ = 4096;
 StringBuilder page = new StringBuilder();
 HttpURLConnection conn = 
    (HttpURLConnection) new URL(url).openConnection();
 InputStream is = conn.getInputStream()
 // perhaps allocate this one time and reuse if you
  //call this method a lot.
 byte[] buf = new byte[BUFZ] ;
 int nRead = 0;

 while((nRead = is.read(buf, 0, BUFZ) > 0) {
    page.append(new String(buf /* , Charset charset */)); 
 // uses local default char encoding for now
 }

在这里试试这个：

 ...
 final static int MAX_SIZE = 10000000;
 HttpURLConnection conn = 
    (HttpURLConnection) new URL(url).openConnection();
 InputStream is = conn.getInputStream()
 // perhaps allocate this one time and reuse if you
  //call this method a lot.
 byte[] buf = new byte[MAX_SIZE] ;
 int nRead = 0;
 int total = 0;
 // you could also use ArrayList so that you could dynamically
 //  resize or there are other ways to resize an array also
 while(total < MAX_SIZE && (nRead = is.read(buf) > 0) {
      total += nRead;
 }
 ...
 // do something with buf array of length total

确定下面的代码不适合你，因为由于HTTP / 1.1“分块”而没有在开头发送内容长度标题行

 ...
 HttpURLConnection conn = 
    (HttpURLConnection) new URL(url).openConnection();
 InputStream is = conn.getInputStream()
 int cLen = conn.getContentLength() ;
 byte[] buf = new byte[cLen] ;
 int nRead=0 ;

 while(nRead < cLen) {
      nRead += is.read(buf, nRead, cLen - nRead) ;
 }
 ...
 // do something with buf array

Answer 3

您可以在InputStreamReader之上进行自己的缓冲，方法是将更大的块读入字符数组并将数组内容附加到StringBuilder。

但这会使你的代码难以理解，我怀疑它是否值得。

请注意Sean A.O.的提案。 Harney读取原始字节，因此您需要在其上进行文本转换。

在Java中将网页内容读入字符串的最佳方法是什么？

3 个答案: