Question

为什么以下代码会改变“öäüß”？（我用它来将大文件分成多个小文件......）

InputStream is = new BufferedInputStream(new FileInputStream(file));
File newFile;
BufferedWriter bw;
newFile = new File(filePathBase + "." + String.valueOf(files.size() + 1) + fileExtension);
files.add(newFile);
bw = new BufferedWriter(new FileWriter(newFile));
try {
    byte[] c = new byte[1024];
    int lineCount = 0;
    int readChars = 0;
    while ( ( readChars = is.read(c) ) != -1 )
        for ( int i=0; i<readChars; i++ ) {
            bw.write(c[i]);
            if ( c[i] == '\n' )
                if ( ++lineCount % linesPerFile == 0 ) {
                    bw.close();
                    newFile = new File(filePathBase + "." + String.valueOf(files.size() + 1) + fileExtension);
                    files.add(newFile);
                    bw = new BufferedWriter(new FileWriter(newFile));
                }
        }
} finally {
    bw.close();
    is.close();
}

我对字符编码的理解是，只要我保持每个字节相同，一切都应保持不变。为什么这段代码会改变字节？

提前感谢一堆〜

====================解决方案====================

错误是FileWriter解释字节，不应该只用于输出普通字节，感谢@meriton和@Jonathan Rosenne。只是将所有内容更改为BufferedOutputStream并不会这样做，因为BufferedOutputStream太慢了！我最终改进了我的文件分割和复制代码，以包含更大的读取数组大小，并在必要时仅write() ...

File newFile = new File(filePathBase + "." + String.valueOf(files.size() + 1) + fileExtension);
files.add(newFile);
InputStream iS = new BufferedInputStream(new FileInputStream(file));
OutputStream oS = new FileOutputStream(newFile); // BufferedOutputStream wrapper toooo slow!
try {
    byte[] c;
    if ( linesPerFile > 65536 )
        c = new byte[65536];
    else
        c = new byte[1024];
    int lineCount = 0;
    int readChars = 0;
    while ( ( readChars = iS.read(c) ) != -1 ) {
        int from = 0;
        for ( int idx=0; idx<readChars; idx++ )
            if ( c[idx] == '\n' && ++lineCount % linesPerFile == 0 ) {
                oS.write(c, from, idx+1 - from);
                oS.close();
                from = idx+1;
                newFile = new File(filePathBase + "." + String.valueOf(files.size() + 1) + fileExtension);
                files.add(newFile);
                oS = new FileOutputStream(newFile);
            }
        oS.write(c, from, readChars - from);
    }
} finally {
    iS.close();
    oS.close();
}

Answer 1

InputStream读取字节，OutputStream写入它们。读者读取字符，作家写入字符。

您使用InputStream读取，并使用FileWriter进行写入。也就是说，你读取字节，但写字符。具体地，

bw.write(c[i]);

调用方法

public void write(int c) throws IOException

其Javadoc说：

写一个字符。要写入的字符包含在给定整数值的16个低位中; 16个高位被忽略。

也就是说，字节被隐式转换为int，然后重新解释为unicode代码点，然后使用平台默认编码将其写入文件（因为您没有指定FileWriter应该使用的编码）

Answer 2

您正在读取字节并写入字符。行bw.write（c [i]）;假设每个字节都是一个字符，但在输入文件中不一定如此，它取决于所使用的编码。诸如UTF-8之类的编码可能每个字符使用2个或更多字节，并且您将单独转换每个字节。例如，在UTF-8中，ö编码为2个字节，十六进制c3 b6。当您单独处理它们时，您可能会看到第一个字符为Ã。

Answer 3

尝试调试while条件( readChars = is.read(c) ) != -1，因为它进入了无限循环，bw.close();永远不会被调用，文件仍然处于读取模式，如果同时你试图执行一些操作文件会腐败，你会得到不希望的结果。

如果我逐字节地读取文件的内容，它不应该保持不变吗？

3 个答案: