Question

我正在从一个文件中读取数据，遗憾的是，这种文件有两种类型的字符编码。

有标题和正文。标题始终为ASCII，并定义正文编码的字符集。

标头不是固定长度，必须通过解析器运行才能确定其内容/长度。

文件也可能非常大，所以我需要避免将整个内容带入内存。

所以我开始使用单个InputStream。我最初使用带有ASCII的InputStreamReader包装它并解码标头并提取主体的字符集。一切都好。

然后我创建一个具有正确字符集的新InputStreamReader，将其放在同一个InputStream上并开始尝试读取正文。

不幸的是，javadoc证实了这一点，即InputStreamReader可能会选择提前读取以达到效率目的。因此，标题的阅读会咀嚼身体的一部分/全部。

有没有人有解决这个问题的建议？会手动创建一个CharsetDecoder并一次输入一个字节但是一个好主意（可能包含在一个自定义的Reader实现中吗？）

提前致谢。

编辑：我的最终解决方案是编写一个没有缓冲的InputStreamReader，以确保我可以解析头部而不会咀嚼身体的一部分。虽然这不是非常有效，但我使用BufferedInputStream包装原始InputStream，因此它不会成为问题。

// An InputStreamReader that only consumes as many bytes as is necessary
// It does not do any read-ahead.
public class InputStreamReaderUnbuffered extends Reader
{
    private final CharsetDecoder charsetDecoder;
    private final InputStream inputStream;
    private final ByteBuffer byteBuffer = ByteBuffer.allocate( 1 );

    public InputStreamReaderUnbuffered( InputStream inputStream, Charset charset )
    {
        this.inputStream = inputStream;
        charsetDecoder = charset.newDecoder();
    }

    @Override
    public int read() throws IOException
    {
        boolean middleOfReading = false;

        while ( true )
        {
            int b = inputStream.read();

            if ( b == -1 )
            {
                if ( middleOfReading )
                    throw new IOException( "Unexpected end of stream, byte truncated" );

                return -1;
            }

            byteBuffer.clear();
            byteBuffer.put( (byte)b );
            byteBuffer.flip();

            CharBuffer charBuffer = charsetDecoder.decode( byteBuffer );

            // although this is theoretically possible this would violate the unbuffered nature
            // of this class so we throw an exception
            if ( charBuffer.length() > 1 )
                throw new IOException( "Decoded multiple characters from one byte!" );

            if ( charBuffer.length() == 1 )
                return charBuffer.get();

            middleOfReading = true;
        }
    }

    public int read( char[] cbuf, int off, int len ) throws IOException
    {
        for ( int i = 0; i < len; i++ )
        {
            int ch = read();

            if ( ch == -1 )
                return i == 0 ? -1 : i;

            cbuf[ i ] = (char)ch;
        }

        return len;
    }

    public void close() throws IOException
    {
        inputStream.close();
    }
}

Answer 1

为什么不使用2 InputStream？一个用于读取标题，另一个用于读取身体。

第二个InputStream应该skip标头字节。

Answer 2

这是伪代码。

使用InputStream，但不要包装 Reader围绕它。
读取包含标题和的字节将它们存入 ByteArrayOutputStream。
从中创建ByteArrayInputStream ByteArrayOutputStream并解码标题，这次换行ByteArrayInputStream 使用ASCII字符集进入Reader。
计算非ascii的长度输入，并读取该字节数进入另一个ByteArrayOutputStream。
创建另一个ByteArrayInputStream 从第二个 ByteArrayOutputStream并包装它来自Reader的charset 报头中。

Answer 3

我的第一个想法是关闭流并重新打开它，使用InputStream#skip跳过标题，然后将流提供给新的InputStreamReader。

如果您确实真的不想重新打开该文件，可以使用file descriptors为该文件获取多个流，但您可能必须使用channels才能拥有多个位置在文件中（因为您不能假设您可以使用reset重置位置，因此可能不支持）。

Answer 4

我建议从一开始就使用新的InputStreamReader重新读取流。也许假设支持InputStream.mark。

Answer 5

这更容易：

正如您所说，您的标题始终为ASCII。因此，直接从InputStream中读取标题，完成后，使用正确的编码创建Reader并从中读取

private Reader reader;
private InputStream stream;

public void read() {
    int c = 0;
    while ((c = stream.read()) != -1) {
        // Read encoding
        if ( headerFullyRead ) {
            reader = new InputStreamReader( stream, encoding );
            break;
        }
    }
    while ((c = reader.read()) != -1) {
        // Handle rest of file
    }
}

Answer 6

如果你将InputStream包装起来并且一次只将所有读取限制为1个字节，它似乎会禁用InputStreamReader中的缓冲。

这样我们就不必重写InputStreamReader逻辑了。

public class OneByteReadInputStream extends InputStream
{
    private final InputStream inputStream;

    public OneByteReadInputStream(InputStream inputStream)
    {
        this.inputStream = inputStream;
    }

    @Override
    public int read() throws IOException
    {
        return inputStream.read();
    }

    @Override
    public int read(byte[] b, int off, int len) throws IOException
    {
        return super.read(b, off, 1);
    }
}

构建：

new InputStreamReader(new OneByteReadInputStream(inputStream));

InputStreamReader缓冲问题

6 个答案: