Question

某些网页的内容中包含HTML特殊字符，但它们显示为正方形（未知字符）。

我该怎么办？

我可以将包含carachters的字符串转换为其他格式（UTF-8）吗？这是从InputStream到String的转换。我真的不知道是什么原因造成的。

public HttpURLConnection openConnection(String url) {
    try {
        URL urlDownload = new URL(url);
        HttpURLConnection con = (HttpURLConnection) urlDownload.openConnection();
        con.setInstanceFollowRedirects(true);
        con.connect();
        return con;
    } catch (Exception e) {
        return null;
    }
}

private String getContent(HttpURLConnection con) {
    try {
        return IOUtils.toString(con.getInputStream());
    } catch (Exception e) {
        System.out.println("Erro baixando página: " + e);
        return null;
    }
}

page.setContent(getContent(openConnection(con)));

Answer 1

您需要使用InputStream使用已下载HTML页面的Content-Type标头中指定的字符集来阅读Reader reader = new InputStreamReader(input, "UTF-8"); // ...。否则将使用平台默认字符集，这显然与您的情况下的HTML不同。

String html = Jsoup.connect("http://stackoverflow.com").get().html();

您当然也可以使用像InputStreamReader这样的HTML阅读器/解析器，它会自动将其考虑在内。

URLConnection

更新：根据您更新的问题，您似乎正在使用IOUtils来请求HTML页面并InputStream将String转换为{{1} }}。您需要按如下方式使用它：

String contentType = connection.getHeaderField("Content-Type");
String charset = "UTF-8"; // Default to UTF-8
for (String param : contentType.replace(" ", "").split(";")) {
    if (param.startsWith("charset=")) {
        charset = param.split("=", 2)[1];
        break;
    }
}

String html = IOUtils.toString(input, charset);

如果您仍然无法获得正确的字符，那么它只能意味着您打印这些字符的控制台/查看器不支持字符集。例如，当您在Eclipse中运行以下内容时

System.out.println(html);

然后，您需要确保Eclipse控制台使用UTF-8。您可以通过 Window＆gt;设置它。偏好＆gt;一般＆gt;工作区＆gt;文本文件编码。

或者如果您是FileWriter将其写入某个文件，那么您应该从头开始使用InputStream / OutputStream而不将其转换为String第一。如果转换为String确实是一个重要步骤，那么您需要将其写入new OutputStreamWriter(output, "UTF-8")。

下载页面中的HTML字符无法正确显示

1 个答案: