Question

在处理unicode问题时，我发现unicode(self)和self.__unicode__()有不同的行为：

#-*- coding:utf-8 -*-
import sys
import dis
class test():
    def __unicode__(self):
        s = u'中文'
        return s.encode('utf-8')

    def __str__(self):
        return self.__unicode__()
print dis.dis(test)
a = test()
print a

以上代码可以正常使用，但如果我将self.__unicode__()更改为unicode(self)，则会显示错误：

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

有问题的代码是：

#-*- coding:utf-8 -*-
import sys
import dis
class test():
    def __unicode__(self):
        s = u'中文'
        return s.encode('utf-8')

    def __str__(self):
        return unicode(self)
print dis.dis(test)
a = test()
print a

非常好奇python如何处理这个，我尝试了dis模块，但没有看到太多的区别：

Disassembly of __str__:
 12           0 LOAD_FAST                0 (self)
              3 LOAD_ATTR                0 (__unicode__)
              6 CALL_FUNCTION            0
              9 RETURN_VALUE

VS

Disassembly of __str__:
 10           0 LOAD_GLOBAL              0 (unicode)
              3 LOAD_FAST                0 (self)
              6 CALL_FUNCTION            1
              9 RETURN_VALUE

Answer 1

您从bytes方法返回__unicode__。

说清楚：

In [18]: class Test(object):
    def __unicode__(self):
        return u'äö↓'.encode('utf-8')
    def __str__(self):
        return unicode(self)
   ....:     

In [19]: class Test2(object):
    def __unicode__(self):
        return u'äö↓'
    def __str__(self):
        return unicode(self)
   ....:     

In [20]: t = Test()

In [21]: t.__str__()
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
/home/dav1d/<ipython-input-21-e2650f29e6ea> in <module>()
----> 1 t.__str__()

/home/dav1d/<ipython-input-18-8bc639cbc442> in __str__(self)
      3         return u'äö↓'.encode('utf-8')
      4     def __str__(self):
----> 5         return unicode(self)
      6 

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

In [22]: unicode(t)
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
/home/dav1d/<ipython-input-22-716c041af66e> in <module>()
----> 1 unicode(t)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

In [23]: t2 = Test2()

In [24]: t2.__str__()
Out[24]: u'\xe4\xf6\u2193'

In [25]: str(_) # _ = last result
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
/home/dav1d/<ipython-input-25-3a1a0b74e31d> in <module>()
----> 1 str(_) # _ = last result

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)'

In [26]: unicode(t2)
Out[26]: u'\xe4\xf6\u2193'

In [27]: class Test3(object):
def __unicode__(self):
    return u'äö↓'
def __str__(self):
    return unicode(self).encode('utf-8')
....:     

In [28]: t3 = Test3()

In [29]: t3.__unicode__()
Out[29]: u'\xe4\xf6\u2193'

In [30]: t3.__str__()
Out[30]: '\xc3\xa4\xc3\xb6\xe2\x86\x93'

In [31]: print t3
äö↓

In [32]: print unicode(t3)
äö↓

print a或在我的情况下print t会调用t.__str__预期返回bytes的{{1}}，让它返回unicode，以便它尝试对其进行编码ascii这不起作用。

轻松修复：让__unicode__返回unicode和__str__字节。

Answer 2

s = u'中文'
return s.encode('utf-8')

这将返回非Unicode的字节字符串。这就是encode正在做的事情。 utf-8不是神奇地将数据转换为Unicode的东西;如果有的话，它是相反的 - 以字节（数据，或多或少）表示Unicode（抽象）的方式。

我们需要一些术语。编码是使用某种编码来获取Unicode字符串并创建表示它的字节字符串。解码是相反的：取一个字节字符串（我们认为编码一个Unicode字符串），使用指定的编码将解释为Unicode字符串。

当我们编码为字节字符串然后使用相同的编码进行解码时，我们将获得原始的Unicode。

utf-8是一种可能的编码。还有很多很多。

有时，当您致电UnicodeDecodeError时，Python会报告encode。为什么？因为您尝试encode一个字节字符串。此过程的正确输入是Unicode字符串，因此Python“有帮助”首先尝试将decode字节字符串转换为Unicode。但是它不知道要使用什么编解码器，因此它假定为ascii。在您可以接收各种数据的环境中，此编解码器是最安全的选择。它只报告字节＆gt; = 128的错误，这些错误在各种8位编码中以多种不同的方式处理。（还记得尝试将带有é等字母的Word文件从Mac导入到PC，反之亦然吗？在另一台计算机上，您会在另一台计算机上获得其他奇怪的符号，因为该平台已构建-in编码不同。）

使事情变得更加复杂，在Python 2中，encode / decode机制也用于实现一些与解释Unicode无关的其他整洁的东西。例如，有一个Base64编码器，以及一个自动处理字符串转义序列的东西（即它会将一个反斜杠，后跟一个字母't'更改为一个选项卡）。其中一些执行“编码”或“解码”从字节字符串到字节字符串，或从Unicode到Unicode。

（顺便说一下，这一切都完全不同 - 更清楚地说，恕我直言 - 在Python 3中。）

同样，当__unicode__返回一个字节字符串（不应该，作为样式）时，Python unicode()内置函数会自动将其解码为ascii;当__str__返回一个Unicode字符串（同样不应该）时，str()会将其编码为ascii。这种情况发生在幕后，在您无法控制的代码中。但是，您可以修复__unicode__和__str__来执行他们应该执行的操作。

（事实上，您可以通过传递第二个参数来覆盖unicode的编码。但是，这是错误的解决方案，因为您应该已经从__unicode__返回了一个Unicode字符串。并且str没有采用编码参数，所以你在那里运气不好。）

所以，现在我们可以解决问题了。

问题：我们希望__unicode__返回Unicode字符串u'中文'，我们希望__str__返回utf-8编码的版本。

解决方案：直接在__unicode__中返回该字符串，并在__str__中明确执行编码：

class test():
    def __unicode__(self):
        return u'中文'

    def __str__(self):
        return unicode(self).encode('utf-8')

Answer 3

当您在Python对象上调用unicode时，输出是您传递给unicode方法的参数的unicode表示。

由于您尚未指定应使用的编码，因此会出现一个错误，即只能使用ASCII来表示参数。

使用__unicode__时，您指定应使用utf-8对该字符串进行编码，这是正确的，并且没有问题。

您可以使用所需的编码作为unicode方法的第二个参数，例如：

unicode( str, "utf-8" )

这应该与__unicode__方法的工作方式相同。

Answer 4

当您定义__unicode__特殊方法时，您告诉它使用哪种编码。当您只是调用unicode时没有指定编码，因此Python使用默认的“ascii”。

BTW，__str__应该返回一个字节串，而不是unicode。并且__unicode__应该返回unicode，而不是字节字符串。所以这段代码是倒退的。由于它没有返回unicode，Python可能会尝试使用默认编码转换它。

Python类中的unicode（self）和self . unicode （）之间的区别是什么？

4 个答案:

Python类中的unicode（self）和self .__ unicode __（）之间的区别是什么？

4 个答案:

Python类中的unicode（self）和self . unicode （）之间的区别是什么？