Question

我正在尝试将包含＆gt; 128个字符的标准字符串转换为Unicode。例如，

a='en métro'
b=u'en métro'
c = whatToDoWith(a)

这样我可以在类型和值中得到c与b完全相等。

在我的真实程序中，txt = 'en métro'

时出现以下错误

 utxt = txt.decode('utf8')
 File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
 return codecs.utf_8_decode(input, errors, True)
 UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 3: invalid continuation byte

为了调查，我还有以下测试代码：

# -*- coding: utf-8 -*-

c='en métro'
print type(c)
print c
d=c.decode('utf8')
print type(d)
print d
a='中文'
print type(a)
print a
b=a.decode('utf8')
print type(b)
print b

预计这一时间结果：

<type 'str'>
en mÃ©tro
<type 'unicode'>
en métro
<type 'str'>
ä¸æ–‡
<type 'unicode'>
中文

我不知道与我的真实节目有什么不同。我也有其中# -*- coding: utf-8 -*-行。

有人可以指出可能存在的问题吗？

Answer 1

str.decode()应该最适合您的情况：

# coding=utf-8

a = "en métro"
b = u"en métro"
c = a.decode("utf-8")

print(type(a))  # <type 'str'>
print(type(b))  # <type 'unicode'>
print(type(c))  # <type 'unicode'>

if b == c:
    print("b equals c!")  # hooray they are equal in value

if type(b) == type(c):
    print("b is the same type as c!")  # hooray they are of equal type

Answer 2

感谢上面的回答，我得到了相同的结果，但是我没有回答我的问题，为什么我的test.py有效，但我的真实程序没有。

我做了更多调查，发现从文件中读取的字符串与内联评估不同：

enter code here

# -*- coding: utf-8 -*-
c='en métro'
print "c:"
print type(c)
print len(c)
for x in c:
   print ord(x)
file = open('test.txt','r')

e = file.read()
print "\n\ne:"
print type(e)
print len(e)
for x in e:
  print ord(x)
file.close()

我得到了结果：

c:
<type 'str'>
9
101
110
32
109
195
169
116
114
111


e:
<type 'str'>

101
110
32
109
233
116
114
111

我相信这是导致我在真实节目中失败的原因。有人可以解释原因和解决方案吗？

Answer 3

您正在处理不同的文本编码。

unicode码点233（0xe9）是带有ACUTE的拉丁文小写字母E。

在UTF-8中，此字符编码为两个字节：

>>> unichr(233).encode('utf-8')
'\xc3\xa9'
>>> for b in unichr(233).encode('utf-8'):print ord(b),
... 
195 169

在cp1252（Windows西欧代码页），latin-1和其他一些欧洲/拉丁8位编码中，字符被编码为单个字节：

>>> unichr(233).encode('cp1252')
'\xe9'
>>> ord(_)
233

Python 2.7，如何将（ord＆gt; 128）字符串转换为unicode

3 个答案: