Question

当我传入一个unicode字符串（直接来自数据库）时，我有一个遗留代码段，总是encode('utf-8')，有没有办法将unicode字符串更改为其他格式，以便将其编码为＆＃39; UTF-8＆＃39;再次没有收到错误，因为我不允许更改遗留代码段。

我首先尝试解码它，但它会返回此错误

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

如果我保留unicode字符串，则返回

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 986: ordinal not in range(128)

如果我将遗留代码更改为encode('utf-8')它不起作用，但这不是一个可行的选择

编辑：

以下是代码段

#!/usr/bin/env python2
# -*- coding: utf-8 -*-



if __name__ == "__main__":
   # 1
   a = u'贸易'
   # 2
   a = a.decode('utf-8')
   # 3
   a.encode('utf-8')

出于某种原因，如果我跳过#2我没有得到上面提到的错误，我仔细检查字符串的类型，看起来两者都是unicode，两者都是相同的字符，但我正在处理的代码不允许我对utf-8进行编码或解码，而某些代码段中的相同字符允许我这样做。

Answer 1

考虑以下情况：

如果你想要一个unicode字符串，并且你已经拥有一个unicode字符串，那你就什么都不做。
如果你想要一个bytestring，并且你已经有了一个bytestring，你就什么都不做。
如果你有一个unicode字符串并且想要一个bytestring，你可以对它进行编码。
如果你有一个bytestring并且想要一个unicode字符串，你可以解码它。

在这些情况中，没有一种情况适合多次编码或解码。

Answer 2

In order for encode('utf-8') to make sense, the string must be a unicode string (or contain all-ASCII characters...). So, unless it's a unicode instance already, you have to decode it first from whatever encoding it's in to a unicode string, after which you can pass it into your legacy interface.

At no point does it make sense for anything to be double-encoded -- encoding takes a string and transforms it to a series of bytes; decoding takes a series of bytes and transforms them back into a string. The confusion only arises because Python 2 uses the str for both plain-ASCII strings and byte sequences.

>>> u'é'.encode('utf-8')    # unicode string
'\xc3\xa9'                  # bytes, not unicode string
>>> '\xc3\xa9'.decode('utf-8')
u'\xe9'                     # unicode string
>>> u'\xe9' == u'é'
True

如何允许编码（＆＃39; utf-8＆＃39;）两次而不会在python中出错？

2 个答案: