如何从InputStream中正确读取Unicode?

时间:2017-08-24 15:45:24

标签: utf-8 inputstream

我发现其他人遇到了同样的问题,他们的问题通过在InputStreamReader构造函数中指定UTF-8来解决:

Reading InputStream as UTF-8

https://www.mkyong.com/java/how-to-read-utf-8-encoded-data-from-a-file-java/

这对我不起作用,我不知道为什么。无论我尝试什么,我都会继续获取转义的unicode值(斜杠-U +十六进制)而不是实际的语言字符。我在这做错了什么?提前谢谢!

// InputStream is is a FileInputStream:
public void load(InputStream is) throws Exception {

    BufferedReader br = null;

    try {
        // Passing "UTF8" or "UTF-8" to this constructor makes no difference for me:
        br = new BufferedReader(new InputStreamReader(is, StandardCharsets.UTF_8));
        String line = null;         
        while ((line = br.readLine()) != null) {
            // The following prints "got line: chinese = \u4f60\u597d" instead of "got line: chinese = 你好"
            System.out.println("got line: " + line);
        }
    } finally {
        if (br != null) {
            br.close();
        }
    }       
}

请注意:这不是字体问题。我知道这是因为如果我对同一个文件使用ResourceBundle,我会正确地在IDE控制台中打印中文字符。但每当我尝试使用FileInputStream手动读取文件时,某些东西会继续将字符转换为斜杠/ u约定。即使我告诉它使用UTF-8编码。我也尝试过修改项目的编码JVM参数,但仍然没有乐趣。再次感谢您的任何建议。

此外,使用ResourceBundle作为最终解决方案对我来说不是一个选择。这个特定项目有合理的原因,为什么它不适合这项工作,以及我为什么要自己明确地这样做。

编辑:我尝试手动从InputStream中提取字节,完全绕过InputStreamReader及其构造函数,这似乎忽略了我的编码参数。这只会导致相同的行为。斜杠+ U约定而不是正确的字符。我很难理解为什么我不能像看待其他人一样工作。我是否可能在某个地方设置了系统/操作系统设置,这超出了Java正确处理unicode的能力?我在Windows 7版本6.1(也是64位)上使用Java版本1.8.0_65(64位)。

public void load(InputStream is) throws Exception {     
    String line = null;     
    try {
        while ((line = readLine(is)) != null) {
            // The following prints "got line: chinese = \u4f60\u597d" instead of "got line: chinese = 你好"
            System.out.println("got line: " + line);                
        }           
    } finally {
        is.close();
    }       
}

private String readLine(InputStream is) throws Exception {      
    List<Byte> bytesList = new ArrayList<>();       
    while (true) {
        byte b = -1;

        try {
            b = (byte)is.read();
        } catch (EOFException e) {
            return bytesToString(bytesList);
        }           
        if (b == -1) {
            return bytesToString(bytesList);
        }
        char ch = (char)b;
        if (ch == '\n') {
            return bytesToString(bytesList);
        }
        bytesList.add(b);
    }       
}

private String bytesToString(List<Byte> bytesList) {        
    if (bytesList.isEmpty()) {
        return null;
    }       
    byte[] bytes = new byte[bytesList.size()];
    for (int i = 0; i < bytes.length; i++) {
        bytes[i] = bytesList.get(i);
    }       
    return new String(bytes, 0, bytes.length);
}

1 个答案:

答案 0 :(得分:0)

如果其他人遇到同样的麻烦,我能够找到解决方案。由于ResourceBundle总是为我做正确的事情,我挖掘了为什么会这样,并发现java.util.Properties正在使用loadConvert()函数做所有的魔术。在BufferedReader从文件中给我一行文本后,我需要显式解码该String中的Unicode转义字符,如下所示:

public void load(InputStream is) throws Exception {

    BufferedReader br = null;

    try {
        // Passing "UTF8" or "UTF-8" to this constructor makes no difference for me:
        br = new BufferedReader(new InputStreamReader(is, StandardCharsets.UTF_8));
        String line = null;         
        while ((line = br.readLine()) != null) {
            // The following prints "got line: chinese = \u4f60\u597d" instead of "got line: chinese = 你好"
            System.out.println("got line: " + line);
            line = decodeUni(line);
            // The following prints "decoded line: chinese = 你好" exactly as it should!
            System.out.println("decoded line: " + line);
        }
    } finally {
        if (br != null) {
            br.close();
        }
    }       
}

// Converts encoded "\\uxxxx" to unicode chars
private String decodeUni(String string) {

    char[] charsIn = string.toCharArray();
    int len = charsIn.length;
    char[] charsOut = new char[len];
    char ch;
    int outLen = 0;
    int off = 0;
    int end = off + len;

    while (off < end) {
        ch = charsIn[off++];
        // Does aChar start with "\\u" ?
        if (ch == '\\') {
            ch = charsIn[off++];
            if(ch == 'u') {
                // Yep! Convert the hex part to the correct character.
                int value = 0;
                for (int i = 0; i < 4; i++) {
                    ch = charsIn[off++];  
                    switch (ch) {
                        case '0': case '1': case '2': case '3': case '4':
                        case '5': case '6': case '7': case '8': case '9': {
                            value = (value << 4) + ch - '0';
                            break;
                        }
                        case 'a': case 'b': case 'c': case 'd': case 'e': case 'f': {
                            value = (value << 4) + 10 + ch - 'a';
                            break;
                        }
                        case 'A': case 'B': case 'C': case 'D': case 'E': case 'F': {
                            value = (value << 4) + 10 + ch - 'A';
                            break;
                        }
                        default: throw new IllegalArgumentException("Malformed \\uxxxx encoding: " + string);
                    }
                }
                charsOut[outLen++] = (char)value;
            } else {
                // Starts with a slash but not "\\u", handle the other possible escaped characters.
                switch (ch) {
                    case 't':
                        ch = '\t';
                        break;
                    case 'r':
                        ch = '\r';
                        break;
                    case 'n':
                        ch = '\n'; 
                        break;
                    case 'f':
                        ch = '\f';
                        break;
                    default:
                        break;
                }
                charsOut[outLen++] = ch;
            }
        } else {
            // Doesn't start with a slash, leave as-is.
            charsOut[outLen++] = ch;
        }
    }
    return new String(charsOut, 0, outLen).trim();
}