以下是我的解码代码

Question

我有一个检测字符集编码的应用程序，当我将Shift_JIS文件作为输入进行测试时，它返回一个字符集编码类型为EUC_JP。

我使用了{“EUC_JP”，“Shift_JIS”，“UTF-8”}并将以下方法作为Charset实例逐个传递给charset编码。

以下是我的解码代码

private Charset detectCharset(File file, Charset charset)
{
    try
    {
        BufferedInputStream input = new BufferedInputStream(new FileInputStream(file));

        CharsetDecoder decoder = charset.newDecoder();
        decoder.reset();

        byte[] buffer = new byte[BUFFER_SIZE];
        boolean identified = false;

        while ((input.read(buffer) != -1) && (!identified))
        {
            identified = identify(buffer, decoder);
        }
        input.close();

        if (!identified)
        {
            charset = null;
        }

        return charset;
    }
    catch (Exception e)
    {
        return null;
    }
}

private boolean identify(byte[] bytes, CharsetDecoder decoder)
{
    boolean isIdentifies = true;
    try
    {
        decoder.decode(ByteBuffer.wrap(bytes));
    }
    catch (CharacterCodingException e)
    {
        isIdentifies = false;
    }
    return isIdentifies;
}

Answer 1

我认为，识别字符编码的方法存在缺陷。如果缓冲区内容根本无法解码，decode方法只会抛出CharacterCodingException。在可以解码字节的情况下，它不会抛出一个，但结果是乱码。它无法区分有意义和无意义的字符序列。

以下是一些相关的参考资料，提供了更好的方法：

Java : How to determine the correct charset encoding of a stream - 答案以各种方式讨论问题的本质，并建议相关的图书馆。
http://tika.apache.org/0.8/api/org/apache/tika/parser/txt/CharsetDetector.html - 另一个图书馆......

但是你需要记住，用于检测字符编码的任何算法有时会给出错误的答案。

为什么Shift_JIS字符集编码文件以EUC_JP类型返回？

以下是我的解码代码

1 个答案: