Question

在Python 3.x中，字符串由Unicode序数项组成。（请参阅下面的语言参考中的引文。）Unicode字符串的内部表示是什么？是UTF-16吗？

字符串对象的项目是 Unicode代码单元。 Unicode代码 unit由字符串对象表示一个项目，可以持有一个表示a的16位或32位值 Unicode序数（最大值为顺序是给出的 sys.maxunicode，取决于如何 Python在编译时配置）。替代对可能存在于 Unicode对象，将被报告作为两个单独的项目。

Answer 1

内部表示将在Python 3.3中更改，它实现PEP 393。新的表示将选择一个或几个ascii，latin-1，utf-8，utf-16，utf-32，通常试图获得紧凑的表示。

只有在与旧版API（那些仅存在于Windows上，其中wchar_t为两个字节）存在时才会进行代理对的隐式转换; Python字符串将被保留。以下是release notes。

Answer 2

在Python 3.3及更高版本中，字符串的内部表示将取决于字符串，并且可以是latin-1，UCS-2或UCS-4中的任何一个，如PEP 393中所述。

对于以前的Pythons，内部表示依赖于Python的构建标志。可以使用标记值--enable-unicode=ucs2或--enable-unicode=ucs4构建Python。 ucs2版本实际上是use UTF-16 as their internal representation，ucs4版本使用UCS-4 / UTF-32。

Answer 3

在Include/unicodeobject.h中查看CPython 3.1.5的源代码：

/* --- Unicode Type ------------------------------------------------------- */

typedef struct {
    PyObject_HEAD
    Py_ssize_t length;          /* Length of raw Unicode data in buffer */
    Py_UNICODE *str;            /* Raw Unicode buffer */
    long hash;                  /* Hash value; -1 if not set */
    int state;                  /* != 0 if interned. In this case the two
                                 * references from the dictionary to this object
                                 * are *not* counted in ob_refcnt. */
    PyObject *defenc;           /* (Default) Encoded version as Python
                                   string, or NULL; this is used for
                                   implementing the buffer protocol */
} PyUnicodeObject;

字符存储为Py_UNICODE的数组。在大多数平台上，我认为Py_UNICODE #define为wchar_t。

Answer 4

Python 2.X和3.X之间的Unicode内部表示没有任何变化。

绝对不是UTF-16。 UTF-anything是一个面向字节的EXTERNAL表示。

每个代码单元（字符，代理等）都已从范围（0,2 ** 21）中分配了一个数字。这被称为“序数”。

真的，你引用的文档说明了这一切。大多数Python二进制文件使用16位顺序，这限制你使用基本多语言平面（“BMP”），除非你想要与代理人混在一起（如果你找不到你的头发衬衫并且你的指甲床不在生锈）。对于使用完整的Unicode指令表，您更喜欢“宽版本”（32位宽）。

简而言之，unicode对象中的内部表示是一个16位无符号整数数组，或一个32位无符号整数数组（仅使用21位）。

Answer 5

取决于：here。就内部表示而言，Python 3仍然如此。

Answer 6

我认为，很难判断UTF-16（只是一个16位字的序列）与Python的字符串对象之间的区别。

如果用Unicode = UCS4选项编译python，它将在UTF-32和Python字符串之间进行比较。

所以，最好考虑一下，虽然你可以互相改造，但它们属于不同的类别。

Answer 7

>>> import array; s = 'Привет мир!'; b = array.array('u', s).tobytes(); print(b); print(len(s) * 4 == len(b))
b'\x1f\x04\x00\x00@\x04\x00\x008\x04\x00\x002\x04\x00\x005\x04\x00\x00B\x04\x00\x00 \x00\x00\x00<\x04\x00\x008\x04\x00\x00@\x04\x00\x00!\x00\x00\x00'
True
>>> import array; s = 'test'; b = array.array('u', s).tobytes(); print(b); print(len(s) * 4 == len(b))
b't\x00\x00\x00e\x00\x00\x00s\x00\x00\x00t\x00\x00\x00'
True
>>>

Answer 8

内部表示形式从latin-1，UCS-2到UCS-4不等。 UCS表示表示长度为2或4个字节，并且unicode代码单元在数值上等于相应的代码点。我们可以通过查找代码单元的大小在哪里更改来进行检查。

要显示它们的范围从1字节的latin-1到4字节的UCS-4：

>>> getsizeof('')           
49
>>> getsizeof('a')  #------------------ + 1 byte as the representaion here is latin-1 
50
>>> getsizeof('\U0010ffff') 
80
>>> getsizeof('\U0010ffff\U0010ffff') # + 4 bytes as the representation here is UCS-4
84

我们可以检查开头是否确实是latin-1而不是UTF-8，因为对2字节代码单元的更改发生在字节边界处，而不是在''\U0000007f'-'\U00000080'边界处就像UTF-8：

>>> getsizeof('\U0000007f')  
50
>>> getsizeof('\U00000080') #----------The size of the string changes at \x74 - \x80 boundary but..
74
>>> getsizeof('\U00000080\U00000080') # ..the size of the code-unit is still one. so not UTF-8
75

>>> getsizeof('\U000000ff')  
74
>>> getsizeof('\U000000ff\U000000ff')# (+1 byte)    
75
>>> getsizeof('\U00000100')  
76
>>> getsizeof('\U00000100\U00000100') # Size change at byte boundary(+2 bytes). Rep is UCS-2.             
78

>>> getsizeof('\U0000ffff') 
76
>>> getsizeof('\U0000ffff\U0000ffff') # (+ 2 bytes)
78
>>> getsizeof('\U00010000')            
80
>>> getsizeof('\U00010000\U00010000') # (+ 4 bytes) Thes size of the code unit changes to 4 at byte boundary again.
84

Python 3.x中字符串的内部表示是什么

8 个答案: