我有以下Java代码来获取给定URL的HTML页面的全部内容。这可以以更有效的方式完成吗?欢迎任何改进。
public static String getHTML(final String url) throws IOException {
if (url == null || url.length() == 0) {
throw new IllegalArgumentException("url cannot be null or empty");
}
final HttpURLConnection conn = (HttpURLConnection) new URL(url).openConnection();
final BufferedReader buf = new BufferedReader(new InputStreamReader(conn.getInputStream()));
final StringBuilder page = new StringBuilder();
final String lineEnd = System.getProperty("line.separator");
String line;
try {
while (true) {
line = buf.readLine();
if (line == null) {
break;
}
page.append(line).append(lineEnd);
}
} finally {
buf.close();
}
return page.toString();
}
我忍不住觉得线条读数不够理想。我知道我可能会屏蔽由MalformedURLException
电话引起的openConnection
,我对此感到满意。
我的函数还有使HTML字符串具有当前系统的正确行终止符的副作用。这不是必要条件。
我意识到网络IO可能会缩短读取HTML所需的时间,但我仍然想知道这是最佳的。
旁注:如果StringBuilder
有一个开放式InputStream
的构造函数,只需要获取InputStream
的所有内容并将其读入{StringBuilder
,那就太棒了。 1}}。
答案 0 :(得分:10)
正如其他答案所示,在任何强大的解决方案中都应考虑许多不同的边缘情况(HTTP特性,编码,分块等)。因此,我建议除玩具程序之外的任何东西都使用事实上的Java标准HTTP库:Apache HTTP Components HTTP Client。
他们提供了许多样本,"just" getting the response contents for a request looks like this:
HttpClient httpclient = new DefaultHttpClient();
HttpGet httpget = new HttpGet("http://www.google.com/");
ResponseHandler<String> responseHandler = new BasicResponseHandler();
String responseBody = httpclient.execute(httpget, responseHandler);
// responseBody now contains the contents of the page
System.out.println(responseBody);
httpclient.getConnectionManager().shutdown();
答案 1 :(得分:2)
好的,再次编辑。一定要在它周围放置try-finally块,或者捕获IOException
...
final static int BUFZ = 4096;
StringBuilder page = new StringBuilder();
HttpURLConnection conn =
(HttpURLConnection) new URL(url).openConnection();
InputStream is = conn.getInputStream()
// perhaps allocate this one time and reuse if you
//call this method a lot.
byte[] buf = new byte[BUFZ] ;
int nRead = 0;
while((nRead = is.read(buf, 0, BUFZ) > 0) {
page.append(new String(buf /* , Charset charset */));
// uses local default char encoding for now
}
在这里试试这个:
...
final static int MAX_SIZE = 10000000;
HttpURLConnection conn =
(HttpURLConnection) new URL(url).openConnection();
InputStream is = conn.getInputStream()
// perhaps allocate this one time and reuse if you
//call this method a lot.
byte[] buf = new byte[MAX_SIZE] ;
int nRead = 0;
int total = 0;
// you could also use ArrayList so that you could dynamically
// resize or there are other ways to resize an array also
while(total < MAX_SIZE && (nRead = is.read(buf) > 0) {
total += nRead;
}
...
// do something with buf array of length total
确定下面的代码不适合你,因为由于HTTP / 1.1“分块”而没有在开头发送内容长度标题行
...
HttpURLConnection conn =
(HttpURLConnection) new URL(url).openConnection();
InputStream is = conn.getInputStream()
int cLen = conn.getContentLength() ;
byte[] buf = new byte[cLen] ;
int nRead=0 ;
while(nRead < cLen) {
nRead += is.read(buf, nRead, cLen - nRead) ;
}
...
// do something with buf array
答案 2 :(得分:1)
您可以在InputStreamReader之上进行自己的缓冲,方法是将更大的块读入字符数组并将数组内容附加到StringBuilder。
但这会使你的代码难以理解,我怀疑它是否值得。
请注意Sean A.O.的提案。 Harney读取原始字节,因此您需要在其上进行文本转换。