Question

我正在解析维基百科上的一些图片链接。我在http://en.wikipedia.org/wiki/Special:Export/Diego_Forl%C3%A1n

上看到了这个

当我使用已弃用的URLEncoder.encode时，我可以正确编码重音字符，但是当我指定“UTF-8”参数时，它会失败。维基百科上的文字是utf8 AFAIK。

Diego + Forl％C3％A1n + vs + + Netherlands.jpg是正确的而Diego + Forl％E2％88％9A％C2％B0n + vs + + Netherlands.jpg不正确。

scala> first
res24: String = Diego Forlán vs the Netherlands.jpg

scala> java.net.URLEncoder.encode(first, "UTF-8")
res25: java.lang.String = Diego+Forl%E2%88%9A%C2%B0n+vs+the+Netherlands.jpg

scala> java.net.URLEncoder.encode(first)
<console>:33: warning: method encode in object URLEncoder is deprecated: see corresponding Javadoc for more information.
              java.net.URLEncoder.encode(first)
                                  ^
res26: java.lang.String = Diego+Forl%C3%A1n+vs+the+Netherlands.jpg

Answer 1

我猜测first已经损坏，并且由于您的控制台配置隐藏了转码错误而只能正确呈现。

您可以通过在字符串中发出UTF-16代码单元来确认：

for(c<-first.toCharArray()){print("\\u%04x".format(c.toInt))}

可能有一种更优雅的方式来编写它。

如果代码点编码正确，则为：

U+00e1      á       \u00e1

我希望某处使用MacRoman解码器解码UTF-8编码数据。

codepoint   glyph   escaped    x-MacRoman     info
=======================================================================
U+221a      √       \u221a     c3,            MATHEMATICAL_OPERATORS, MATH_SYMBOL
U+00b0      °       \u00b0     a1,            LATIN_1_SUPPLEMENT, OTHER_SYMBOL

为什么默认情况下不推荐使用java.net.URLEncoder.encode但是在我指定字符集时却没有？

1 个答案: