我发现其他人遇到了同样的问题,他们的问题通过在InputStreamReader构造函数中指定UTF-8来解决:
https://www.mkyong.com/java/how-to-read-utf-8-encoded-data-from-a-file-java/
这对我不起作用,我不知道为什么。无论我尝试什么,我都会继续获取转义的unicode值(斜杠-U +十六进制)而不是实际的语言字符。我在这做错了什么?提前谢谢!
// InputStream is is a FileInputStream:
public void load(InputStream is) throws Exception {
BufferedReader br = null;
try {
// Passing "UTF8" or "UTF-8" to this constructor makes no difference for me:
br = new BufferedReader(new InputStreamReader(is, StandardCharsets.UTF_8));
String line = null;
while ((line = br.readLine()) != null) {
// The following prints "got line: chinese = \u4f60\u597d" instead of "got line: chinese = 你好"
System.out.println("got line: " + line);
}
} finally {
if (br != null) {
br.close();
}
}
}
请注意:这不是字体问题。我知道这是因为如果我对同一个文件使用ResourceBundle,我会正确地在IDE控制台中打印中文字符。但每当我尝试使用FileInputStream手动读取文件时,某些东西会继续将字符转换为斜杠/ u约定。即使我告诉它使用UTF-8编码。我也尝试过修改项目的编码JVM参数,但仍然没有乐趣。再次感谢您的任何建议。
此外,使用ResourceBundle作为最终解决方案对我来说不是一个选择。这个特定项目有合理的原因,为什么它不适合这项工作,以及我为什么要自己明确地这样做。
编辑:我尝试手动从InputStream中提取字节,完全绕过InputStreamReader及其构造函数,这似乎忽略了我的编码参数。这只会导致相同的行为。斜杠+ U约定而不是正确的字符。我很难理解为什么我不能像看待其他人一样工作。我是否可能在某个地方设置了系统/操作系统设置,这超出了Java正确处理unicode的能力?我在Windows 7版本6.1(也是64位)上使用Java版本1.8.0_65(64位)。
public void load(InputStream is) throws Exception {
String line = null;
try {
while ((line = readLine(is)) != null) {
// The following prints "got line: chinese = \u4f60\u597d" instead of "got line: chinese = 你好"
System.out.println("got line: " + line);
}
} finally {
is.close();
}
}
private String readLine(InputStream is) throws Exception {
List<Byte> bytesList = new ArrayList<>();
while (true) {
byte b = -1;
try {
b = (byte)is.read();
} catch (EOFException e) {
return bytesToString(bytesList);
}
if (b == -1) {
return bytesToString(bytesList);
}
char ch = (char)b;
if (ch == '\n') {
return bytesToString(bytesList);
}
bytesList.add(b);
}
}
private String bytesToString(List<Byte> bytesList) {
if (bytesList.isEmpty()) {
return null;
}
byte[] bytes = new byte[bytesList.size()];
for (int i = 0; i < bytes.length; i++) {
bytes[i] = bytesList.get(i);
}
return new String(bytes, 0, bytes.length);
}
答案 0 :(得分:0)
如果其他人遇到同样的麻烦,我能够找到解决方案。由于ResourceBundle总是为我做正确的事情,我挖掘了为什么会这样,并发现java.util.Properties正在使用loadConvert()函数做所有的魔术。在BufferedReader从文件中给我一行文本后,我需要显式解码该String中的Unicode转义字符,如下所示:
public void load(InputStream is) throws Exception {
BufferedReader br = null;
try {
// Passing "UTF8" or "UTF-8" to this constructor makes no difference for me:
br = new BufferedReader(new InputStreamReader(is, StandardCharsets.UTF_8));
String line = null;
while ((line = br.readLine()) != null) {
// The following prints "got line: chinese = \u4f60\u597d" instead of "got line: chinese = 你好"
System.out.println("got line: " + line);
line = decodeUni(line);
// The following prints "decoded line: chinese = 你好" exactly as it should!
System.out.println("decoded line: " + line);
}
} finally {
if (br != null) {
br.close();
}
}
}
// Converts encoded "\\uxxxx" to unicode chars
private String decodeUni(String string) {
char[] charsIn = string.toCharArray();
int len = charsIn.length;
char[] charsOut = new char[len];
char ch;
int outLen = 0;
int off = 0;
int end = off + len;
while (off < end) {
ch = charsIn[off++];
// Does aChar start with "\\u" ?
if (ch == '\\') {
ch = charsIn[off++];
if(ch == 'u') {
// Yep! Convert the hex part to the correct character.
int value = 0;
for (int i = 0; i < 4; i++) {
ch = charsIn[off++];
switch (ch) {
case '0': case '1': case '2': case '3': case '4':
case '5': case '6': case '7': case '8': case '9': {
value = (value << 4) + ch - '0';
break;
}
case 'a': case 'b': case 'c': case 'd': case 'e': case 'f': {
value = (value << 4) + 10 + ch - 'a';
break;
}
case 'A': case 'B': case 'C': case 'D': case 'E': case 'F': {
value = (value << 4) + 10 + ch - 'A';
break;
}
default: throw new IllegalArgumentException("Malformed \\uxxxx encoding: " + string);
}
}
charsOut[outLen++] = (char)value;
} else {
// Starts with a slash but not "\\u", handle the other possible escaped characters.
switch (ch) {
case 't':
ch = '\t';
break;
case 'r':
ch = '\r';
break;
case 'n':
ch = '\n';
break;
case 'f':
ch = '\f';
break;
default:
break;
}
charsOut[outLen++] = ch;
}
} else {
// Doesn't start with a slash, leave as-is.
charsOut[outLen++] = ch;
}
}
return new String(charsOut, 0, outLen).trim();
}