Question

我在Django项目中遇到了UnicodeEncodeError的问题，最后通过改变

中的故障__unicode__方法的返回值来解决问题（经过很多挫折）

return unicode("<span><b>{0}</b>{1}<span>".format(val_str, self.text))

到

return u"<span><b>{0}</b>{1}<span>".format(val_str, self.text)

但我很困惑为什么这有效（或者更确切地说，为什么首先出现问题）。 u前缀和unicode函数不做同样的事情吗？在控制台中尝试时，它们似乎给出了相同的结果：

# with the function
test = unicode("<span><b>{0}</b>{1}<span>".format(2,4))
>>> test
u'<span><b>2</b>4<span>'
>>> type(test)
<type 'unicode'>

# with the prefix
test = u"<span><b>{0}</b>{1}<span>".format(2,4)
>>> test
u'<span><b>2</b>4<span>'
>>> type(test)
<type 'unicode'>

但似乎编码在某种程度上以不同的方式完成，具体取决于使用的内容。这是怎么回事？

Answer 1

您的问题在于您将unicode() 应用于;你的两个表达式不等效。

unicode("<span><b>{0}</b>{1}<span>".format(val_str, self.text))

将unicode()应用于以下结果：

"<span><b>{0}</b>{1}<span>".format(val_str, self.text)

，而

u"<span><b>{0}</b>{1}<span>".format(val_str, self.text)

相当于：

unicode("<span><b>{0}</b>{1}<span>").format(val_str, self.text)

注意右括号的位置！

因此，您的第一个版本首先格式化，只有然后将格式化结果转换为unicode。这是一个重要的区别！

将str.format()与unicode值一起使用时，这些值会传递给str()，隐式将这些字符串编码为ASCII。这会导致您的异常：

>>> 'str format: {}'.format(u'unicode ascii-range value works')
'str format: unicode ascii-range value works'
>>> 'str format: {}'.format(u"unicode latin-range value doesn't work: å")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 40: ordinal not in range(128)

您在结果上调用unicode()并不重要;已经提出异常。

另一方面，用unicode.format()格式化没有这样的问题：

>>> u'str format: {}'.format(u'unicode lating-range value works: å')
u'str format: unicode lating-range value works: \xe5'

python：unicode函数vs u前缀

1 个答案: