Question

我正在尝试在读取文件时识别UTF-8的BOM。当然，Java文件喜欢处理16位字符，而BOM字符是8位字节。

我的测试代码如下：

public void testByteOrderMarks() {
    System.out.println("test byte order marks");

    byte[] bytes = {(byte) 0xEF, (byte) 0xBB, (byte) 0xBF, (byte) 'a', (byte) 'b',(byte) 'c'};
    String test = new String(bytes,  Charset.availableCharsets().get("UTF-8"));
    System.out.printf("test len: %s  value %s\n", test.length(), test);
    String three = test.substring(0,3);
    System.out.printf("len %d  >%s<\n", three.length(), three);
    for (int i = 0; i < test.length();i++) {
        byte b = bytes[i];
        char c = test.charAt(i);
        System.out.printf("b: %s %x c: %s %x\n", (char) b, b,  c, (int) c); 
    }
}

结果是：

测试字节顺序标记
  测试len：4值？abc
  len 3＆gt;？ab＆lt;
  b :? EF＆GT; C：？ FEFF
  b :? bb c：a 61
  b :? bf c：b 62
  b：61 c：c 63

我无法弄清楚为什么“测试”的长度是4而不是6。我无法弄清楚为什么我不拿起每个8位字节来进行比较。

由于

Answer 1

尝试找出BOM表头时不要使用字符。 BOM表头是两个或三个字节，因此您应该打开一个（File）InputStream，读取两个字节并处理它们。

顺便提一下，XML标头（<?xml version=... encoding=...>）是纯ASCII，因此将其作为字节流加载也是安全的（除非有BOM表明文件是以16位字符保存的，否则< em>不为UTF-8）。

我的解决方案（参见DecentXML's XMLInputStreamReader）是加载文件的前几个字节并进行分析。这为我提供了足够的信息，可以从Reader中创建正确的解码InputStream。

Answer 2

角色是一个角色。字节顺序标记是Unicode字符U + FEFF。在Java中，它是字符'\uFEFF'。无需深入研究字节。只需读取文件的第一个字符，如果它与'\uFEFF'匹配，则为BOM。如果它不匹配，则写入的文件没有BOM。

private final static char BOM = '\uFEFF';    // Unicode Byte Order Mark
String firstLine = readFirstLineOfFile("filename.txt");
if (firstLine.charAt(0) == BOM) {
    // We have a BOM
} else {
    // No BOM present.
}

Answer 3

如果你想识别一个BOM文件，一个更好的解决方案（对我有用）将使用Mozilla的编码检测器库：http://code.google.com/p/juniversalchardet/ 在该链接中很容易描述如何使用它：

import org.mozilla.universalchardet.UniversalDetector;

public class TestDetector {
  public static void main(String[] args) throws java.io.IOException {
    byte[] buf = new byte[4096];
    String fileName = "testFile.";
    java.io.FileInputStream fis = new java.io.FileInputStream(fileName);

    // (1)
    UniversalDetector detector = new UniversalDetector(null);

    // (2)
    int nread;
    while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
      detector.handleData(buf, 0, nread);
    }
    // (3)
    detector.dataEnd();

    // (4)
    String encoding = detector.getDetectedCharset();
    if (encoding != null) {
      System.out.println("Detected encoding = " + encoding);
    } else {
      System.out.println("No encoding detected.");
    }

    // (5)
    detector.reset();
  }
}

如果你正在使用maven，那么依赖是：

<dependency>
    <groupId>com.googlecode.juniversalchardet</groupId>
    <artifactId>juniversalchardet</artifactId>
    <version>1.0.3</version>
</dependency>

关于Java转换字节到String的混淆，用于比较“字节顺序标记”

3 个答案: