Question

我需要从字符串中删除字节顺序标记。我已经有了查找BOM的代码，但现在我需要将其从实际字符串中删除。

举个例子。 BOM为feff，长度为2个字节，这意味着字符串的前两个字节不应出现在最终字符串中。但是，当我使用Python字符串剥离时，从字符串中剥离太多。

代码段：

print len(bom)
print as_hex(bom)
print string
print as_hex(string)
string = string[len(bom):]
print string
print as_hex(string)

输出：

2
feff
Organ
feff4f7267616e
rgan
7267616e

我希望得到的是：

2
feff
Organ
feff4f7267616e
Organ
4f7267616e

as_hex()函数只是将字符打印为十六进制（"".join('%02x' % ord(c) for c in bytes））。

Answer 1

我认为你有一个unicode字符串对象。（如果你使用的是Python 3，你肯定会这样做，因为它是唯一的字符串。）你的as_hex函数不会打印第一个字符的“fe”和第二个字符的“ff”。它打印出字符串中第一个unicode字符的“feff”。例如（Python 3）：

>>> mystr = "\ufeffHello world."
>>> mystr[0]
'\ufeff'
>>> '%02x' % ord(mystr[0])
'feff'

您需要只删除一个unicode字符，或者将字符串存储在bytes对象中，并删除两个字节。

（这并不能解释为什么len（bom）是2，如果没有看到更多代码，我就无法分辨。我猜bom是list或bytes对象，而不是unicode字符串。）

我上面的回答假设是Python 3，但我从你的打印语句中已经意识到你正在使用Python 2.基于此，我猜测bom是一个ASCII字符串而{{1}是一个unicode字符串。如果您使用string而不是print repr(x)，它将告诉您unicode和ASCII字符串之间的区别。

Answer 2

使用正确的编解码器，将为您处理BOM。使用utf-8-sig和utf16进行解码将删除前导BOM（如果存在）。使用它们进行编码将添加BOM。如果您不想要BOM，请使用utf-8，utf-16le或utf-16be。

在将文本数据读入程序时，通常应解码为Unicode，并在写入文件，控制台，套接字等时编码为字节。

unicode_str = u'test'
utf8_w_bom = unicode_str.encode('utf-8-sig')
utf16_w_bom = unicode_str.encode('utf16')
utf8_wo_bom = unicode_str.encode('utf-8')
utf16_wo_bom = unicode_str.encode('utf-16le')
print repr(utf8_w_bom)
print repr(utf16_w_bom)
print repr(utf8_wo_bom)
print repr(utf16_wo_bom)
print repr(utf8_w_bom.decode('utf-8-sig'))
print repr(utf16_w_bom.decode('utf16'))
print repr(utf8_wo_bom.decode('utf-8-sig'))
print repr(utf16_wo_bom.decode('utf16'))

输出：

'\xef\xbb\xbftest'
'\xff\xfet\x00e\x00s\x00t\x00'
'test'
't\x00e\x00s\x00t\x00'
u'test'
u'test'
u'test'
u'test'

请注意，如果没有BOM，解码utf16将采用本机字节顺序。

从python中的字符串中删除前两个字节

2 个答案: