如何允许编码(' utf-8')两次而不会在python中出错?

时间:2015-07-13 20:56:51

标签: python unicode encoding utf-8

当我传入一个unicode字符串(直接来自数据库)时,我有一个遗留代码段,总是encode('utf-8'),有没有办法将unicode字符串更改为其他格式,以便将其编码为& #39; UTF-8'再次没有收到错误,因为我不允许更改遗留代码段。

我首先尝试解码它,但它会返回此错误

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

如果我保留unicode字符串,则返回

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 986: ordinal not in range(128)

如果我将遗留代码更改为encode('utf-8')它不起作用,但这不是一个可行的选择

编辑:

以下是代码段

#!/usr/bin/env python2
# -*- coding: utf-8 -*-



if __name__ == "__main__":
   # 1
   a = u'贸易'
   # 2
   a = a.decode('utf-8')
   # 3
   a.encode('utf-8')

出于某种原因,如果我跳过#2我没有得到上面提到的错误,我仔细检查字符串的类型,看起来两者都是unicode,两者都是相同的字符,但我正在处理的代码不允许我对utf-8进行编码或解码,而某些代码段中的相同字符允许我这样做。

2 个答案:

答案 0 :(得分:5)

考虑以下情况:

  1. 如果你想要一个unicode字符串,并且你已经拥有一个unicode字符串,那你就什么都不做。
  2. 如果你想要一个bytestring,并且你已经有了一个bytestring,你就什么都不做。
  3. 如果你有一个unicode字符串并且想要一个bytestring,你可以对它进行编码。
  4. 如果你有一个bytestring并且想要一个unicode字符串,你可以解码它。
  5. 在这些情况中,没有一种情况适合多次编码或解码。

答案 1 :(得分:3)

In order for encode('utf-8') to make sense, the string must be a unicode string (or contain all-ASCII characters...). So, unless it's a unicode instance already, you have to decode it first from whatever encoding it's in to a unicode string, after which you can pass it into your legacy interface.

At no point does it make sense for anything to be double-encoded -- encoding takes a string and transforms it to a series of bytes; decoding takes a series of bytes and transforms them back into a string. The confusion only arises because Python 2 uses the str for both plain-ASCII strings and byte sequences.

>>> u'é'.encode('utf-8')    # unicode string
'\xc3\xa9'                  # bytes, not unicode string
>>> '\xc3\xa9'.decode('utf-8')
u'\xe9'                     # unicode string
>>> u'\xe9' == u'é'
True