Question

偶尔我会有一个字符串表示它是一个unicode，但事实上并非如此。它是这样的：

s = u'\xe8\xaf\xb8\xe8\x91\x9b\xe4\xba\xae'

实际上它只是一个带有＆＃39; u＆＃39;的字节串。在它面前。不知道如何解决这个问题。当我尝试使用unicode（s，＆＃39; utf8＆＃39;）将其转换为真正的unicode时，代码失败，因为它已经是一个unicode了。用s.decode解码（＆＃39; utf8＆＃39;）也失败了。还有UnicodeEncodeError。

Answer 1

这是我现在拥有的两种方法：

（1）首先用ord（）获取每个字符的二进制值，然后用chr（）改回。

>>> e
u'\xe8\xaf\xb8\xe8\x91\x9b\xe4\xba\xae'
>>> map(ord,e)
[232, 175, 184, 232, 145, 155, 228, 186, 174]
>>> map(chr,map(ord,e))
['\xe8', '\xaf', '\xb8', '\xe8', '\x91', '\x9b', '\xe4', '\xba', '\xae']
>>> ''.join(map(chr,map(ord,e)))
'\xe8\xaf\xb8\xe8\x91\x9b\xe4\xba\xae'
>>> print ''.join(map(chr,map(ord,e)))
诸葛亮

（2）正如Ignacio Vazquez-Abrams所说ISO 8859-1 (aka Latin-1) maps the first 256 Unicode codepoints to their byte values。

>>> e.encode('latin1')
'\xe8\xaf\xb8\xe8\x91\x9b\xe4\xba\xae'
>>> print e.encode('latin1')
诸葛亮

如何修复一个字符串，说明它是一个unicode但实际上是字节串

1 个答案: