Question

我正在尝试从字符串中删除十六进制字符\xef\xbb\xbf但是我收到以下错误。

不太清楚如何解决这个问题。

>>> x = u'\xef\xbb\xbfHello'
>>> x
u'\xef\xbb\xbfHello'
>>> type(x)
<type 'unicode'>
>>> print x
ï»¿Hello
>>> print x.replace('\xef\xbb\xbf', '')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)
>>>

Answer 1

您需要替换unicode对象，否则Python2将尝试使用ascii编解码器对x进行编码，以搜索其中的str。

>>> x = u'\xef\xbb\xbfHello'
>>> x
u'\xef\xbb\xbfHello'
>>> print(x.replace(u'\xef\xbb\xbf',u''))
Hello

这仅适用于Python2。在Python3中，两个版本都可以使用。

Answer 2

尝试使用decode或unicode函数，如下所示：

x.decode('utf-8')

或

unicode(string, 'utf-8')

来源：UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 1

Answer 3

真正的问题是你的Unicode字符串首先被错误地解码了。这些字符是UTF-8字节顺序标记（BOM）字符错误解码为（可能）latin-1或cp1252。

理想情况下，修复它们的解码方式，但您可以通过重新编码为latin1并正确解码来反转错误：

>>> x = u'\xef\xbb\xbfHello'
>>> x.encode('latin1').decode('utf8') # decode correctly, U+FEFF is a BOM.
u'\ufeffHello'
>>> x.encode('latin1').decode('utf-8-sig') # decode and handle BOM.
u'Hello'

从unicode对象中删除十六进制字符

3 个答案: