当我传入一个unicode字符串(直接来自数据库)时,我有一个遗留代码段,总是encode('utf-8')
,有没有办法将unicode字符串更改为其他格式,以便将其编码为& #39; UTF-8'再次没有收到错误,因为我不允许更改遗留代码段。
我首先尝试解码它,但它会返回此错误
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
如果我保留unicode字符串,则返回
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 986: ordinal not in range(128)
如果我将遗留代码更改为encode('utf-8')
它不起作用,但这不是一个可行的选择
编辑:
以下是代码段
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
if __name__ == "__main__":
# 1
a = u'贸易'
# 2
a = a.decode('utf-8')
# 3
a.encode('utf-8')
出于某种原因,如果我跳过#2
我没有得到上面提到的错误,我仔细检查字符串的类型,看起来两者都是unicode,两者都是相同的字符,但我正在处理的代码不允许我对utf-8
进行编码或解码,而某些代码段中的相同字符允许我这样做。
答案 0 :(得分:5)
考虑以下情况:
在这些情况中,没有一种情况适合多次编码或解码。
答案 1 :(得分:3)
In order for encode('utf-8')
to make sense, the string must be a unicode
string (or contain all-ASCII characters...). So, unless it's a unicode
instance already, you have to decode it first from whatever encoding it's in to a unicode
string, after which you can pass it into your legacy interface.
At no point does it make sense for anything to be double-encoded -- encoding takes a string and transforms it to a series of bytes; decoding takes a series of bytes and transforms them back into a string. The confusion only arises because Python 2 uses the str
for both plain-ASCII strings and byte sequences.
>>> u'é'.encode('utf-8') # unicode string
'\xc3\xa9' # bytes, not unicode string
>>> '\xc3\xa9'.decode('utf-8')
u'\xe9' # unicode string
>>> u'\xe9' == u'é'
True