Question

Cpython优化了字符串的递增操作，在初始化字符串的内存时，程序会为其留出额外的扩展空间，因此，在递增时，原始字符串不会复制到新位置。我的问题是为什么字符串变量的id会更改。

>>> s = 'ab'
>>> id(s)
991736112104
>>> s += 'cd'
>>> id(s)
991736774080

为什么字符串变量的id更改。

Answer 1

字符串是不可变的。在+=上使用str并不是就地操作；它会创建一个具有新内存地址的新对象，这是id()在CPython的实现下提供的内容。

特别是对于str，__iadd__未定义，因此操作会退回到__add__或__radd__。有关详细信息，请参见Python文档的data model部分。

>>> hasattr(s, '__iadd__')                                                                                                                                
False

Answer 2

您要触发的优化是CPython的实现细节，这是一件非常微妙的事情：有很多细节（例如您正在体验的一个细节）可能会阻止它。

要获得详细的解释，需要深入研究CPython的实现，因此首先，我将尝试给出一个挥手的解释，这至少应该给出正在发生的事情的要旨。详细信息将在第二部分中突出显示重要的代码部分。

让我们看一下该功能，该功能具有所需的/优化的行为

def add_str(str1, str2, n):
    for i in range(n):
        str1+=str2
        print(id(str1))
    return str1

调用它会导致以下输出：

>>> add_str("1","2",100)
2660336425032
... 4 times
2660336425032
2660336418608
... 6 times
2660336418608
2660336361520
... 6 times
2660336361520
2660336281800
 and so on

即每增加8个字符串就会创建一个新字符串，否则旧字符串（或我们将看到的内存）将被重用。第一个id只被打印6次，因为它在unicode-object的大小为2模8时开始打印（而不是在后面的情况下为0）。

第一个问题是，如果字符串在CPython中是不可变的，那么如何（最好是何时）对其进行更改？显然，如果将字符串绑定到不同的变量，我们将无法更改-但是，如果当前变量是唯一的引用，则可以更改它-由于引用了CPython，因此可以很容易地对其进行检查（这是为什么该优化不适用于不使用引用计数的其他实现）。

让我们通过添加其他参考来更改上面的功能：

def add_str2(str1, str2, n):
    for i in range(n):
        ref = str1
        str1+=str2
        print(id(str1))
    return str1

调用它会导致：

>>> add_str2("1","2",20)
2660336437656
2660337149168
2660337149296
2660337149168
2660337149296
... every time a different string - there is copying!

这实际上解释了您的观察：

import sys
s = 'ab'
print(sys.getrefcount(s))
# 9
print(id(s))
# 2660273077752
s+='a'
print(id(s))
# 2660337158664  Different

您的字符串s是interned（有关字符串实习和整数池的更多信息，请参见this SO-answer），因此s不仅是一个“使用”对象该字符串，因此该字符串无法更改。

如果我们避免进行实习，我们可以看到字符串已被重用：

import sys
s = 'ab'*21  # will not be interned
print(sys.getrefcount(s))
# 2, that means really not interned
print(id(s))
# 2660336107312
s+='a'
print(id(s))
# 2660336107312  the same id!

但是此优化如何工作？

CPython使用其自己的内存管理-the pymalloc allocator，该内存管理针对寿命短的小型对象进行了优化。使用的内存块是8字节的倍数，这意味着如果仅要求分配器提供1个字节，则仍将8个字节标记为已使用（由于返回的指针8-byte aligment的原因，因此更为精确）剩余的7个字节不能用于其他对象）。

但是有功能PyMem_Realloc：如果要求分配器将1个字节的块重新分配为2个字节的块，则无所事事-仍然有一些保留的字节。

这样，如果只有一个对字符串的引用，则CPython可以要求分配器重新分配该字符串，并要求多一个字节。在8种情况下，有7种情况与分配器无关，附加字节几乎可用。

但是，如果字符串的大小变化超过7个字节，则必须进行复制：

>>> add_str("1", "1"*8, 20)  # size change of 8
2660337148912
2660336695312
2660336517728
... every time another id

此外，pymalloc退回到PyMem_RawMalloc，它通常是C运行时的内存管理器，并且上面的字符串优化不再可行：

>>> add_str("1"*512, "1", 20) #  str1 is larger as 512 bytes
2660318800256
2660318791040
2660318788736
2660318807744
2660318800256
2660318796224
... every time another id

实际上，每次重新分配后地址是否不同取决于C运行时的内存分配器及其状态。如果未对内存进行碎片整理，则很有可能realloc设法扩展内存而不进行复制（但在我的机器上不是这种情况，因为我做了这些实验），另请参见this SO-post。 / p>

出于好奇，这里是str1+=str2操作的整个追溯，可以在a debugger中轻松进行：

这是怎么回事：

+=被编译为BINARY_ADD-optcode，在ceval.c中求值时，会出现一个钩子/ special handling for unicode objects（请参阅PyUnicode_CheckExact）：

case TARGET(BINARY_ADD): {
    PyObject *right = POP();
    PyObject *left = TOP();
    PyObject *sum;
    ...
    if (PyUnicode_CheckExact(left) &&
             PyUnicode_CheckExact(right)) {
        sum = unicode_concatenate(left, right, f, next_instr);
        /* unicode_concatenate consumed the ref to left */
    }
    ...

unicode_concatenate最终调用PyUnicode_Append，该命令检查左操作数是否可修改（其中basically checks仅存在一个引用，没有被字符串约束以及其他内容）并调整其大小或创建新的unicode对象，否则：

if (unicode_modifiable(left)
    && ...)
{
    /* append inplace */
    if (unicode_resize(p_left, new_len) != 0)
        goto error;

    /* copy 'right' into the newly allocated area of 'left' */
    _PyUnicode_FastCopyCharacters(*p_left, left_len, right, 0, right_len);
}
else {
    ...
    /* Concat the two Unicode strings */
    res = PyUnicode_New(new_len, maxchar);
    if (res == NULL)
        goto error;
    _PyUnicode_FastCopyCharacters(res, 0, left, 0, left_len);
    _PyUnicode_FastCopyCharacters(res, left_len, right, 0, right_len);
    Py_DECREF(left);
    ...
}

unicode_resize最终调用resize_compact（主要是因为在我们的情况下，我们只有ASCII字符），which ends up调用了PyObject_REALLOC：

...
new_unicode = (PyObject *)PyObject_REALLOC(unicode, new_size);
...

基本上将呼叫pymalloc_realloc：

static int
pymalloc_realloc(void *ctx, void **newptr_p, void *p, size_t nbytes)
{
    ...
    /* pymalloc is in charge of this block */
    size = INDEX2SIZE(pool->szidx);
    if (nbytes <= size) {
        /* The block is staying the same or shrinking.
          ....
            *newptr_p = p;
            return 1; // 1 means success!
          ...
    }
    ...
}

INDEX2SIZE仅舍入到最接近的8的倍数：

#define ALIGNMENT               8               /* must be 2^N */
#define ALIGNMENT_SHIFT         3

/* Return the number of bytes in size class I, as a uint. */
#define INDEX2SIZE(I) (((uint)(I) + 1) << ALIGNMENT_SHIFT)

问。

CPython：为什么字符串的+ =会更改字符串变量的ID

2 个答案: