Question

我需要检查编码文件。这段代码工作但有点长。如何能够重构这个逻辑。也许可以为这个目标使用另一种变体？

代码：

class CharsetDetector implements Checker {

    Charset detectCharset(File currentFile, String[] charsets) {
        Charset charset = null;

        for (String charsetName : charsets) {
            charset = detectCharset(currentFile, Charset.forName(charsetName));
            if (charset != null) {
                break;
            }
        }

        return charset;
    }

    private Charset detectCharset(File currentFile, Charset charset) {
        try {
            BufferedInputStream input = new BufferedInputStream(
                    new FileInputStream(currentFile));

            CharsetDecoder decoder = charset.newDecoder();
            decoder.reset();

            byte[] buffer = new byte[512];
            boolean identified = false;
            while ((input.read(buffer) != -1) && (!identified)) {
                identified = identify(buffer, decoder);
            }

            input.close();

            if (identified) {
                return charset;
            } else {
                return null;
            }

        } catch (Exception e) {
            return null;
        }
    }

    private boolean identify(byte[] bytes, CharsetDecoder decoder) {
        try {
            decoder.decode(ByteBuffer.wrap(bytes));
        } catch (CharacterCodingException e) {
            return false;
        }
        return true;
    }

    @Override
    public boolean check(File fileChack) {
        if (charsetDetector(fileChack)) {
            return true;
        }
        return false;
    }

    private boolean charsetDetector(File currentFile) {
        String[] charsetsToBeTested = { "UTF-8", "windows-1253", "ISO-8859-7" };

        CharsetDetector charsetDetector = new CharsetDetector();
        Charset charset = charsetDetector.detectCharset(currentFile,
                charsetsToBeTested);

        if (charset != null) {
            try {
                InputStreamReader reader = new InputStreamReader(
                        new FileInputStream(currentFile), charset);

                @SuppressWarnings("unused")
                int valueReaders = 0;
                while ((valueReaders = reader.read()) != -1) {
                    return true;
                }

                reader.close();
            } catch (FileNotFoundException exc) {
                System.out.println("File not found!");
                exc.printStackTrace();
            } catch (IOException exc) {
                exc.printStackTrace();
            }
        } else {
            System.out.println("Unrecognized charset.");
            return false;
        }

        return true;
    }
}

问题：

这个程序逻辑如何重构？
检测编码的另一种方法（ UTF-16 sequance等）？

Answer 1

重构此代码的最佳方法是引入一个为您进行字符检测的第三方库，因为它们可能做得更好，它会使您的代码更小。有关其他选择，请参阅this question

Answer 2

正如已经指出的，你不能“知道”或“检测”文件的编码。完全准确性要求您告知，因为几乎总有一个字节序列对于几个字符编码是不明确的。

您将在此SO question.中找到有关检测UTF-8与ISO8859-1的更多讨论。重要的答案是检查文件中的每个字节序列以验证其兼容性预期的编码。有关UTF-8字节编码规则，请参阅http://en.wikipedia.org/wiki/UTF-8。

特别是，有一篇关于检测字符编码/集的非常有趣的论文 http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html 他们声称他们具有极高的准确性（猜测！）。价格是一个非常复杂的检测系统，具有不同语言的字符频率知识，不适合OP已暗示的正确代码大小的30行。显然，检测算法内置于Mozilla中，因此您可以找到并提取它。

我们选择了一个更简单的方案：a）相信你被告知字符集是什么，如果你被告知b）如果没有，检查BOM并相信如果存在它说什么，否则嗅到纯7位ascii ，然后按顺序utf8或iso8859。你可以构建一个丑陋的例程，在一次传递文件中执行此操作。

（我认为随着时间的推移问题会越来越严重.Unicode每年都有一个新版本，在有效代码点上确实存在细微差别。为此，你需要检查每个代码点的有效性。如果我们“很幸运，他们都向后兼容。”

[编辑：OP似乎无法在Java中编码。我们的解决方案和另一页上的草图没有用Java编码，所以我不能直接复制和粘贴答案。我将根据他的代码在这里起草Java版本;它没有编译或测试。 YMMV]

int UTF8size(byte[] buffer, int buf_index)
// Java-version of character-sniffing test on other page
// This only checks for UTF8 compatible bit-pattern layout
// A tighter test (what we actually did) would check for valid UTF-8 code points
{   int first_character=buffer[buf_index];

    // This first character test might be faster as a switch statement
    if ((first_character & 0x80) == 0) return 1; // ASCII subset character, fast path
    else ((first_character & 0xF8) == 0xF0) { // start of 4-byte sequence
        if (buf_index+3>=buffer.length) return 0;
        if (((buffer[buf_index + 1] & 0xC0) == 0x80)
         && ((buffer[buf_index + 2] & 0xC0) == 0x80)
         && ((buffer[buf_index + 3] & 0xC0) == 0x80))
            return 4;
    }
    else if ((first_character & 0xF0) == 0xE0) { // start of 3-byte sequence
        if (buf_index+2>=buffer.length) return 0;
        if (((buffer[buf_index + 1] & 0xC0) == 0x80)
         && ((buffer[buf_index + 2] & 0xC0) == 0x80))
            return 3;
    }
    else if ((first_character & 0xE0) == 0xC0) { // start of 2-byte sequence
        if (buf_index+1>=buffer.length) return 0;
        if ((buffer[buf_index + 1] & 0xC0) == 0x80)
            return 2;
    }
    return 0;
}

public static boolean isUTF8 ( File file ) {
    int file_size;
    if (null == file) {
        throw new IllegalArgumentException ("input file can't be null");
    }
    if (file.isDirectory ()) {
        throw new IllegalArgumentException ("input file refers to a directory");
    }

    file_size=file.size();
    // read input file
    byte [] buffer = new byte[file_size];
    try {
        FileInputStream fis = new FileInputStream ( input ) ;
        fis.read ( buffer ) ;
        fis.close ();
    }
    catch ( IOException e ) {
        throw new IllegalArgumentException ("Can't read input file, error = " + e.getLocalizedMessage () );
    }

    { int buf_index=0;
      int step;

      while (buf_index<file_size) {
         step=UTF8size(buffer,buf_index);
         if (step==0) return false; // definitely not UTF-8 file
         buf_index+=step;

      }

    }

   return true ; // appears to be UTF-8 file
}

重构自动检测文件的编码

2 个答案: