Question

最初我是在比较内置D数组和普通指针的性能，但最终我遇到了一个不同的问题。出于某种原因，如果我一个接一个地运行两个相同的for循环，第二个总是更快地完成。

以下是代码：

import std.stdio : writeln;
import std.datetime : StopWatch;
import core.stdc.stdlib : malloc, free;

void main()
{
    immutable N = 1_000_000_000;
    StopWatch sw;

    uint* ptr = cast(uint*)malloc(uint.sizeof * N);

    sw.start();
    for (uint i = 0; i < N; ++i)
        ptr[i] = 1;
    sw.stop();
    writeln("the first for loop time: ", sw.peek().msecs(), " msecs");
    sw.reset();

    sw.start();
    for (uint i = 0; i < N; ++i)
        ptr[i] = 2;
    sw.stop();
    writeln("the second for loop time: ", sw.peek().msecs(), " msecs");
    sw.reset();

    free(ptr);
}

使用dmd -release -O -noboundscheck -inline test.d -of=test && ./test编译并运行后，它会打印：

the first for loop time: 1253 msecs
the second for loop time: 357 msecs

我不确定这是否与D或dmd有关，所以我用C ++重写了这段代码：

#include <iostream>
#include <chrono>

int main()
{
    const unsigned int N = 1000000000;

    unsigned int* ptr = (unsigned int*)malloc(sizeof(unsigned int) * N);

    auto start = std::chrono::high_resolution_clock::now();
    for (uint i = 0; i < N; ++i)
        ptr[i] = 1;
    auto finish = std::chrono::high_resolution_clock::now();
    auto milliseconds = std::chrono::duration_cast<std::chrono::milliseconds>(finish-start);
    std::cout << "the first for loop time: " << milliseconds.count() << " msecs" << std::endl;

    start = std::chrono::high_resolution_clock::now();
    for (uint i = 0; i < N; ++i)
        ptr[i] = 2;
    finish = std::chrono::high_resolution_clock::now();
    milliseconds = std::chrono::duration_cast<std::chrono::milliseconds>(finish-start);
    std::cout << "the second for loop time: " << milliseconds.count() << " msecs" << std::endl;

    free(ptr);
}

和g++ -O3 test.cpp -o test && ./test给出了类似的输出：

the first for loop time: 1029 msecs
the second for loop time: 349 msecs

每次运行此代码时结果都一样。分配的数据太大而无法缓存。没有分支点，因此不应涉及分支预测问题。两个循环都以相同的直接顺序访问内存，所以我猜这不应该与内存布局有关。

那么为什么第二个比第一个跑得快？

Answer 1

因为uint* ptr = cast(uint*)malloc(uint.sizeof * N);在为多个元素循环之前不会分配内存。你可以测试一下：

import core.stdc.stdlib : malloc, free;

void main()
{
    immutable N = 1_000_000_000;
    uint* ptr = cast(uint*)malloc(uint.sizeof * N);

    foreach (_; 0 .. 100)
    for (uint i = 0; i < N; ++i)
        ptr[N-1] = 1;

    // until this point almost no memory is allocated
    for (uint i = 0; i < N; ++i)
        ptr[i] = 2;

    free(ptr);
}

<强>更新 @Eljay已经在评论中解释了这个

为什么相同的for循环在第二次运行得更快？

1 个答案: