Question

在Rails中，我们将一些文本文件作为ISO-8859-1。有时文件以UTF-8 with BOM的形式出现。我正在尝试确定其UTF-8 with BMO是否会将文件重新读为bom|UTF-8。

我尝试以下操作，但它似乎没有正确比较：

# file is saved as UTF-8 with BOM using Sublime Text 2

> string = File.read(file, encoding: 'ISO-8859-1')

# this doesn't work, while it supposed to work
> string.start_with?("\xef\xbb\xbf".force_encoding("UTF-8"))
> false

# it works if I try this
> string.start_with?('ï»¿')
> true

目的是将文件读作UTF-8 with BOM，如果文件在开头有字节顺序标记，我想避免使用string.start_with?('ï»¿')方法。

Answer 1

string.start_with?("\u00ef\u00bb\u00bf")

来自Ruby official documentation：

\xnn十六进制位模式，其中nn是1-2个十六进制数字（[0-9a-fA-F]）

\unnnn Unicode字符，其中nnnn正好是4个十六进制数字（[0-9a-fA-F]）

也就是说，要插入一个unicode字符，应该使用\uXXXX表示法。这是安全的，我们可以可靠地使用这个版本。

Answer 2

这对我不起作用，我不得不检查字节。

string[0].bytes ==  [239, 187, 191] # true for UTF-8 + BOM

See BOM for other encodings

如果您只想检查文件，然后正确地重新打开它（例如 File.open(file, "r:bom|utf-8")）。

那么你不需要整个文件，只需读取前 3 个字节

is_bom = File.open(file) { |f| f.read(3).bytes ==  [239, 187, 191] }

Ruby：检查字节顺序标记

2 个答案: