如何准确地将Java中的UTF8编码文件读入字符串?
当我将此.java文件的编码更改为UTF-8(Eclipse>右键单击App.java>属性>资源>文本文件编码)时,它可以在Eclipse中运行,但不能在命令行中运行。似乎eclipse在运行App时设置了file.encoding参数。
为什么源文件的编码会对从字节创建String产生任何影响。当编码已知时,从字节创建String的傻瓜式方法是什么? 我可能有不同编码的文件。一旦知道了文件的编码,我必须能够读入字符串,而不管file.encoding的值是什么?
utf8文件的内容低于
English Hello World.
Korean 안녕하세요.
Japanese 世界こんにちは。
Russian Привет мир.
German Hallo Welt.
Spanish Hola mundo.
Hindi हैलो वर्ल्ड।
Gujarati હેલો વર્લ્ડ.
Thai สวัสดีชาวโลก.
- 文件末尾 -
代码如下。我的意见在其中的评论中。
public class App {
public static void main(String[] args) {
String slash = System.getProperty("file.separator");
File inputUtfFile = new File("C:" + slash + "sources" + slash + "TestUtfRead" + slash + "utf8text.txt");
File outputUtfFile = new File("C:" + slash + "sources" + slash + "TestUtfRead" + slash + "utf8text_out.txt");
File outputUtfByteWrittenFile = new File(
"C:" + slash + "sources" + slash + "TestUtfRead" + slash + "utf8text_byteout.txt");
outputUtfFile.delete();
outputUtfByteWrittenFile.delete();
try {
/*
* read a utf8 text file with internationalized strings into bytes.
* there should be no information loss here, when read into raw bytes.
* We are sure that this file is UTF-8 encoded.
* Input file created using Notepad++. Text copied from Google translate.
*/
byte[] fileBytes = readBytes(inputUtfFile);
/*
* Create a string from these bytes. Specify that the bytes are UTF-8 bytes.
*/
String str = new String(fileBytes, StandardCharsets.UTF_8);
/*
* The console is incapable of displaying this string.
* So we write into another file. Open in notepad++ to check.
*/
ArrayList<String> list = new ArrayList<>();
list.add(str);
writeLines(list, outputUtfFile);
/*
* Works fine when I read bytes and write bytes.
* Open the other output file in notepad++ and check.
*/
writeBytes(fileBytes, outputUtfByteWrittenFile);
/*
* I am using JDK 8u60.
* I tried running this on command line instead of eclipse. Does not work.
* I tried using apache commons io library. Does not work.
*
* This means that new String(bytes, charset); does not work correctly.
* There is no real effect of specifying charset to string.
*/
} catch (IOException e) {
e.printStackTrace();
}
}
public static void writeLines(List<String> lines, File file) throws IOException {
BufferedWriter writer = null;
OutputStreamWriter osw = null;
OutputStream fos = null;
try {
fos = new FileOutputStream(file);
osw = new OutputStreamWriter(fos);
writer = new BufferedWriter(osw);
String lineSeparator = System.getProperty("line.separator");
for (int i = 0; i < lines.size(); i++) {
String line = lines.get(i);
writer.write(line);
if (i < lines.size() - 1) {
writer.write(lineSeparator);
}
}
} catch (IOException e) {
throw e;
} finally {
close(writer);
close(osw);
close(fos);
}
}
public static byte[] readBytes(File file) {
FileInputStream fis = null;
byte[] b = null;
try {
fis = new FileInputStream(file);
b = readBytesFromStream(fis);
} catch (Exception e) {
e.printStackTrace();
} finally {
close(fis);
}
return b;
}
public static void writeBytes(byte[] inBytes, File file) {
FileOutputStream fos = null;
try {
fos = new FileOutputStream(file);
writeBytesToStream(inBytes, fos);
fos.flush();
} catch (Exception e) {
e.printStackTrace();
} finally {
close(fos);
}
}
public static void close(InputStream inStream) {
try {
inStream.close();
} catch (IOException e) {
e.printStackTrace();
}
inStream = null;
}
public static void close(OutputStream outStream) {
try {
outStream.close();
} catch (IOException e) {
e.printStackTrace();
}
outStream = null;
}
public static void close(Writer writer) {
if (writer != null) {
try {
writer.close();
} catch (IOException e) {
e.printStackTrace();
}
writer = null;
}
}
public static long copy(InputStream readStream, OutputStream writeStream) throws IOException {
int bytesread = -1;
byte[] b = new byte[4096]; //4096 is default cluster size in Windows for < 2TB NTFS partitions
long count = 0;
bytesread = readStream.read(b);
while (bytesread != -1) {
writeStream.write(b, 0, bytesread);
count += bytesread;
bytesread = readStream.read(b);
}
return count;
}
public static byte[] readBytesFromStream(InputStream readStream) throws IOException {
ByteArrayOutputStream writeStream = null;
byte[] byteArr = null;
writeStream = new ByteArrayOutputStream();
try {
copy(readStream, writeStream);
writeStream.flush();
byteArr = writeStream.toByteArray();
} finally {
close(writeStream);
}
return byteArr;
}
public static void writeBytesToStream(byte[] inBytes, OutputStream writeStream) throws IOException {
ByteArrayInputStream bis = null;
bis = new ByteArrayInputStream(inBytes);
try {
copy(bis, writeStream);
} finally {
close(bis);
}
}
};
编辑:对于@JB Nizet,以及所有人:)
//writeLines(list, outputUtfFile, StandardCharsets.UTF_16BE); //does not work
//writeLines(list, outputUtfFile, Charset.defaultCharset()); //does not work.
writeLines(list, outputUtfFile, StandardCharsets.UTF_16LE); //works
我需要在将字节读入String时指定字节编码。 我在将字节从String写入文件时需要指定字节编码。
一旦我在JVM中有一个String,我就不需要记住源字节编码了,对吗?
当我写入文件时,它应该将String转换为我机器的默认Charset(无论是UTF8还是ASCII或cp1252)。那是失败的。 UTF16 BE也失败了。为什么某些Charsets会失败?
答案 0 :(得分:5)
Java源文件编码确实无关紧要。并且代码的读取部分是正确的(虽然效率低下)。不正确的是写作部分:
osw = new OutputStreamWriter(fos);
应改为
osw = new OutputStreamWriter(fos, StandardCharsets.UTF_8);
否则,您使用默认编码(在您的系统上似乎不是UTF8)而不是使用UTF8。
请注意,Java允许在文件路径中使用正斜杠,即使在Windows上也是如此。你可以简单地写一下
File inputUtfFile = new File("C:/sources/TestUtfRead/utf8text.txt");
编辑:
一旦我在JVM中有一个String,我就不需要记住源字节编码了,对吗?
是的,你是对的。
当我写入文件时,它应该将String转换为我机器的默认Charset(无论是UTF8还是ASCII或cp1252)。那是失败的。
如果您没有指定任何编码,Java确实会使用平台默认编码将字符转换为字节。如果您指定了编码(如本答案开头所示),那么它会使用您告诉它使用的编码。
但是所有编码都不能像UTF8一样代表所有的unicode字符。例如,ASCII仅支持128个不同的字符。 Cp1252,AFAIK,仅支持256个字符。因此,编码成功,但它用一个特殊的字符替换不可编码的字符(我不记得哪一个)这意味着:我不能编码这个泰语或俄语字符,因为它不是我支持的字符集的一部分。
UTF16编码应该没问题。但是,请确保在读取和显示文件内容时将文本编辑器配置为使用UTF16。如果将其配置为使用其他编码,则显示的内容将不正确。