Question

我有这个例子。它从保存为utf-8的文件中读取一行“hello”。这是我的问题：

字符串以UTF-16格式存储在java中。因此，当它读取行hello时，它将其转换为utf-16格式。所以字符串s在utf-16中有一个utf-16 BOM ...我是对的吗？

  filereader = new FileReader(file);
  read= new BufferedReader(filereader);
  String s= null;
  while ((s= read.readLine()) != null) 
 {
  System.out.println(s);
 }

所以当我这样做时：

s= s.replace("\uFEFF","A");

没有任何反应。上述应该找到并替换UTF-16 BOM吗？或者它最终是utf-8格式？对此有点困惑。

谢谢

Answer 1

尝试使用Apache Commons library和类org.apache.commons.io.input.BOMInputStream来摆脱这类问题。

示例：

<div>
  <img src="https://www.python.org/static/opengraph-icon-200x200.png" />
</div>

对于BOM本身而言，正如@seand所说，它只是用于在内存中读/写/存储字符串的元数据。它存在于字符串本身中，但除非在二进制级别工作或重新编码字符串，否则不能替换或修改它。

让我们举几个例子：

String defaultEncoding = "UTF-8";
InputStream inputStream = new FileInputStream(file);

try
{
    BOMInputStream bOMInputStream = new BOMInputStream(inputStream);
    ByteOrderMark bom = bOMInputStream.getBOM();
    String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();

    InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bOMInputStream), charsetName);
    // your code...
}
finally
{
    inputStream.close();
}

在UTF-16（默认为Big Endian）和UTF-16BE版本中，由于插入了BOM以区分BE和LE，因此得到14个字节。如果指定UTF-16LE，则会得到12个字节，因为没有添加BOM。

您尝试使用简单替换时无法从字符串中剥离BOM。因为BOM（如果存在）只是底层字节流的一部分，内存端由java框架作为字符串处理。你不能像操纵字符串本身的字符一样操纵它。

从文件读取时的字符串格式

1 个答案: