Question

您好我有一个HTML页面，我正在从中删除数据。该页面使用UTF-8字符集，包含德语和其他欧洲字母

<meta charset="utf-8">

但是当我尝试在Java中将其解码为ISO-8859-1和UTF-8时却没有任何效果。我无法获得欧洲角色，而是获得了以下价值观：

Bayern MÃ¼nchen
Bor. MÃ¶nchengladbach
JÃ©rÃ´me Boateng

以下是我的代码：

               URL myUrl = new URL("http://www.weltfussball.de/spielplan/bundesliga-"
                                + season + "-spieltag/" + gameDay + "/");

    in = new BufferedReader(new InputStreamReader(myUrl.openStream(), "ISO-8859-1"));

                while ((line = in.readLine()) != null) {
                    all += line;
                }

我注意到的一件事是，当我打印String line;时，它正确地打印出java控制台上的所有拉丁字符，但是一旦我将它连接到String all;，字符就会搞乱......有人可以提出解决方案吗？

Answer 1

首先，尝试查看页面是否真的使用UTF-8，因为它假装它：

final CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder()
    .onMalformedInput(CodingErrorAction.REPORT);

try (
    final InputStream in = url.openStream();
    final Reader reader = new InputStreamReader(in, decoder);
) {
    /* read the contents */
}

如果此程序抛出MalformedInputException，那么您就知道页面正在撒谎。

考虑到你的输出，我怀疑问题是你的显示没有正确读取UTF-8。

Answer 2

这总是有效的。

InputStream is = getClass().getResourceAsStream(myUrl); byte[] b = new byte[is.available()]; int l = is.read(b); String body = new String(b, 0, l, "UTF-8"); // whatever your charset you want

Answer 3

确保“ISO-8859-1”是仅被阅读。否则它不会起作用。我今天遇到同样的问题，我花了30分钟阅读这篇文章http://www.joelonsoftware.com/articles/Unicode.html然后我解决了我的问题，现在我知道解码了什么，为什么人们使用这个，为什么这是好的和他自己的局限。

要解决我的问题，我只在标题模板文件中替换了此标记：

meta http-equiv =“content-type”content =“text / html; charset = UTF-8”

有关：

meta http-equiv =“content-type”content =“text / html; charset = ISO-8859-1”

重新加载浏览器，现在正在正确打印带有怪异字符的欧洲名称：）

抱歉英文不好！

UTF-8＆amp; ISO-8859-1不适用于解码Java中的欧洲字符集

3 个答案: