Java - 以独立于系统的方式将UTF8字节从File读入String

时间:2015-10-08 12:34:28

标签: java utf-8

如何准确地将Java中的UTF8编码文件读入字符串?

当我将此.java文件的编码更改为UTF-8(Eclipse>右键单击App.java>属性>资源>文本文件编码)时,它可以在Eclipse中运行,但不能在命令行中运行。似乎eclipse在运行App时设置了file.encoding参数。

为什么源文件的编码会对从字节创建String产生任何影响。当编码已知时,从字节创建String的傻瓜式方法是什么? 我可能有不同编码的文件。一旦知道了文件的编码,我必须能够读入字符串,而不管file.encoding的值是什么?

utf8文件的内容低于

English Hello World.
Korean 안녕하세요.
Japanese 世界こんにちは。
Russian Привет мир.
German Hallo Welt.
Spanish Hola mundo.
Hindi हैलो वर्ल्ड।
Gujarati હેલો વર્લ્ડ.
Thai สวัสดีชาวโลก.

- 文件末尾 -

代码如下。我的意见在其中的评论中。

public class App {
public static void main(String[] args) {
    String slash = System.getProperty("file.separator");
    File inputUtfFile = new File("C:" + slash + "sources" + slash + "TestUtfRead" + slash + "utf8text.txt");
    File outputUtfFile = new File("C:" + slash + "sources" + slash + "TestUtfRead" + slash + "utf8text_out.txt");
    File outputUtfByteWrittenFile = new File(
            "C:" + slash + "sources" + slash + "TestUtfRead" + slash + "utf8text_byteout.txt");
    outputUtfFile.delete();
    outputUtfByteWrittenFile.delete();

    try {

        /*
         * read a utf8 text file with internationalized strings into bytes.
         * there should be no information loss here, when read into raw bytes.
         * We are sure that this file is UTF-8 encoded. 
         * Input file created using Notepad++. Text copied from Google translate.
         */
        byte[] fileBytes = readBytes(inputUtfFile);

        /*
         * Create a string from these bytes. Specify that the bytes are UTF-8 bytes.
         */
        String str = new String(fileBytes, StandardCharsets.UTF_8);

        /*
         * The console is incapable of displaying this string.
         * So we write into another file. Open in notepad++ to check.
         */
        ArrayList<String> list = new ArrayList<>();
        list.add(str);
        writeLines(list, outputUtfFile);

        /*
         * Works fine when I read bytes and write bytes. 
         * Open the other output file in notepad++ and check. 
         */
        writeBytes(fileBytes, outputUtfByteWrittenFile);

        /*
         * I am using JDK 8u60.
         * I tried running this on command line instead of eclipse. Does not work.
         * I tried using apache commons io library. Does not work. 
         *  
         * This means that new String(bytes, charset); does not work correctly. 
         * There is no real effect of specifying charset to string.
         */
    } catch (IOException e) {
        e.printStackTrace();
    }

}

public static void writeLines(List<String> lines, File file) throws IOException {
    BufferedWriter writer = null;
    OutputStreamWriter osw = null;
    OutputStream fos = null;
    try {
        fos = new FileOutputStream(file);
        osw = new OutputStreamWriter(fos);
        writer = new BufferedWriter(osw);
        String lineSeparator = System.getProperty("line.separator");
        for (int i = 0; i < lines.size(); i++) {
            String line = lines.get(i);
            writer.write(line);
            if (i < lines.size() - 1) {
                writer.write(lineSeparator);
            }
        }
    } catch (IOException e) {
        throw e;
    } finally {
        close(writer);
        close(osw);
        close(fos);
    }
}

public static byte[] readBytes(File file) {
    FileInputStream fis = null;
    byte[] b = null;
    try {
        fis = new FileInputStream(file);
        b = readBytesFromStream(fis);
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        close(fis);
    }
    return b;
}

public static void writeBytes(byte[] inBytes, File file) {
    FileOutputStream fos = null;
    try {
        fos = new FileOutputStream(file);
        writeBytesToStream(inBytes, fos);
        fos.flush();
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        close(fos);
    }
}

public static void close(InputStream inStream) {
    try {
        inStream.close();
    } catch (IOException e) {
        e.printStackTrace();
    }
    inStream = null;
}

public static void close(OutputStream outStream) {
    try {
        outStream.close();
    } catch (IOException e) {
        e.printStackTrace();
    }
    outStream = null;
}

public static void close(Writer writer) {
    if (writer != null) {
        try {
            writer.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
        writer = null;
    }
}

public static long copy(InputStream readStream, OutputStream writeStream) throws IOException {
    int bytesread = -1;
    byte[] b = new byte[4096]; //4096 is default cluster size in Windows for < 2TB NTFS partitions
    long count = 0;
    bytesread = readStream.read(b);
    while (bytesread != -1) {
        writeStream.write(b, 0, bytesread);
        count += bytesread;
        bytesread = readStream.read(b);
    }
    return count;
}
public static byte[] readBytesFromStream(InputStream readStream) throws IOException {
    ByteArrayOutputStream writeStream = null;
    byte[] byteArr = null;
    writeStream = new ByteArrayOutputStream();
    try {
        copy(readStream, writeStream);
        writeStream.flush();
        byteArr = writeStream.toByteArray();
    } finally {
        close(writeStream);
    }
    return byteArr;
}
public static void writeBytesToStream(byte[] inBytes, OutputStream writeStream) throws IOException {
    ByteArrayInputStream bis = null;
    bis = new ByteArrayInputStream(inBytes);
    try {
        copy(bis, writeStream);
    } finally {
        close(bis);
    }
}
};

编辑:对于@JB Nizet,以及所有人:)

//writeLines(list, outputUtfFile, StandardCharsets.UTF_16BE); //does not work
//writeLines(list, outputUtfFile, Charset.defaultCharset()); //does not work. 
writeLines(list, outputUtfFile, StandardCharsets.UTF_16LE); //works

我需要在将字节读入String时指定字节编码。 我在将字节从String写入文件时需要指定字节编码。

一旦我在JVM中有一个String,我就不需要记住源字节编码了,对吗?

当我写入文件时,它应该将String转换为我机器的默认Charset(无论是UTF8还是ASCII或cp1252)。那是失败的。 UTF16 BE也失败了。为什么某些Charsets会失败?

1 个答案:

答案 0 :(得分:5)

Java源文件编码确实无关紧要。并且代码的读取部分是正确的(虽然效率低下)。不正确的是写作部分:

osw = new OutputStreamWriter(fos);

应改为

osw = new OutputStreamWriter(fos, StandardCharsets.UTF_8);

否则,您使用默认编码(在您的系统上似乎不是UTF8)而不是使用UTF8。

请注意,Java允许在文件路径中使用正斜杠,即使在Windows上也是如此。你可以简单地写一下

File inputUtfFile = new File("C:/sources/TestUtfRead/utf8text.txt");

编辑:

  

一旦我在JVM中有一个String,我就不需要记住源字节编码了,对吗?

是的,你是对的。

  

当我写入文件时,它应该将String转换为我机器的默认Charset(无论是UTF8还是ASCII或cp1252)。那是失败的。

如果您没有指定任何编码,Java确实会使用平台默认编码将字符转换为字节。如果您指定了编码(如本答案开头所示),那么它会使用您告诉它使用的编码。

但是所有编码都不能像UTF8一样代表所有的unicode字符。例如,ASCII仅支持128个不同的字符。 Cp1252,AFAIK,仅支持256个字符。因此,编码成功,但它用一个特殊的字符替换不可编码的字符(我不记得哪一个)这意味着:我不能编码这个泰语或俄语字符,因为它不是我支持的字符集的一部分。

UTF16编码应该没问题。但是,请确保在读取和显示文件内容时将文本编辑器配置为使用UTF16。如果将其配置为使用其他编码,则显示的内容将不正确。