Question

是否有更好的方法可以将整个html文件读取到单个字符串变量而不是：

    String content = "";
    try {
        BufferedReader in = new BufferedReader(new FileReader("mypage.html"));
        String str;
        while ((str = in.readLine()) != null) {
            content +=str;
        }
        in.close();
    } catch (IOException e) {
    }

Answer 1

来自Apache Commons的IOUtils.toString(..)实用程序。

如果您使用Guava，还有Files.readLines(..)和Files.toString(..)。

Answer 2

您应该使用StringBuilder：

StringBuilder contentBuilder = new StringBuilder();
try {
    BufferedReader in = new BufferedReader(new FileReader("mypage.html"));
    String str;
    while ((str = in.readLine()) != null) {
        contentBuilder.append(str);
    }
    in.close();
} catch (IOException e) {
}
String content = contentBuilder.toString();

Answer 3

您可以使用JSoup 对于java来说，这是一个非常强大的HTML parser

Answer 4

对于字符串操作，使用StringBuilder或StringBuffer类来累积字符串数据块。不要对字符串对象使用+=操作。 String类是不可变的，你将在运行时产生大量的字符串对象，它会影响性能。

使用StringBuilder / StringBuffer类实例的.append()方法。

Answer 5

我更喜欢使用Guava：


import com.google.common.base.Charsets;
import com.google.common.io.Files;
String content = Files.toString(new File("/path/to/file", Charsets.UTF_8)

Answer 6

正如让恩（Jean）所述，使用StringBuilder代替+=会更好。但是，如果您正在寻找更简单的东西，那么番石榴，IOUtils和Jsoup都是不错的选择。

番石榴的例子：

String content = Files.asCharSource(new File("/path/to/mypage.html"), StandardCharsets.UTF_8).read();

IOUtils示例：

InputStream in = new URL("/path/to/mypage.html").openStream();
String content;

try {
   content = IOUtils.toString(in, StandardCharsets.UTF_8);
 } finally {
   IOUtils.closeQuietly(in);
 }

Jsoup示例：

String content = Jsoup.parse(new File("/path/to/mypage.html"), "UTF-8").toString();

或

String content = Jsoup.parse(new File("/path/to/mypage.html"), "UTF-8").outerHtml();

注释：

Files.readLines()和Files.toString()

这些已从Guava 22.0版（2017年5月22日）开始弃用。如上例所示，应改为使用Files.asCharSource() 。（version 22.0 release diffs）

IOUtils.toString(InputStream)和Charsets.UTF_8

从Apache Commons-IO版本2.5（2016年5月6日）开始不推荐使用。如上例所示，IOUtils.toString现在应该传递InputStream 和 Charset。如上例所示，应使用Java 7的StandardCharsets代替Charsets 。（deprecated Charsets.UTF_8）

Answer 7

这是仅使用标准Java库检索网页html的解决方案：

import java.io.*;
import java.net.*;

String urlToRead = "https://google.com";
URL url; // The URL to read
HttpURLConnection conn; // The actual connection to the web page
BufferedReader rd; // Used to read results from the web page
String line; // An individual line of the web page HTML
String result = ""; // A long string containing all the HTML
try {
 url = new URL(urlToRead);
 conn = (HttpURLConnection) url.openConnection();
 conn.setRequestMethod("GET");
 rd = new BufferedReader(new InputStreamReader(conn.getInputStream()));
 while ((line = rd.readLine()) != null) {
  result += line;
 }
 rd.close();
} catch (Exception e) {
 e.printStackTrace();
}

System.out.println(result);

SRC

Answer 8

 import org.apache.commons.io.IOUtils;
 import java.io.IOException;     
    try {
               var content = new String(IOUtils.toByteArray ( this.getClass().
                        getResource("/index.html")));
            } catch (IOException e) {
                e.printStackTrace();
            }

//上面提到的 Java 10 代码 - 假设 index.html 在资源文件夹中可用。

将整个html文件读取到String？

8 个答案: