最初我是在比较内置D数组和普通指针的性能,但最终我遇到了一个不同的问题。出于某种原因,如果我一个接一个地运行两个相同的for循环,第二个总是更快地完成。
以下是代码:
import std.stdio : writeln;
import std.datetime : StopWatch;
import core.stdc.stdlib : malloc, free;
void main()
{
immutable N = 1_000_000_000;
StopWatch sw;
uint* ptr = cast(uint*)malloc(uint.sizeof * N);
sw.start();
for (uint i = 0; i < N; ++i)
ptr[i] = 1;
sw.stop();
writeln("the first for loop time: ", sw.peek().msecs(), " msecs");
sw.reset();
sw.start();
for (uint i = 0; i < N; ++i)
ptr[i] = 2;
sw.stop();
writeln("the second for loop time: ", sw.peek().msecs(), " msecs");
sw.reset();
free(ptr);
}
使用dmd -release -O -noboundscheck -inline test.d -of=test && ./test
编译并运行后,它会打印:
the first for loop time: 1253 msecs
the second for loop time: 357 msecs
我不确定这是否与D或dmd有关,所以我用C ++重写了这段代码:
#include <iostream>
#include <chrono>
int main()
{
const unsigned int N = 1000000000;
unsigned int* ptr = (unsigned int*)malloc(sizeof(unsigned int) * N);
auto start = std::chrono::high_resolution_clock::now();
for (uint i = 0; i < N; ++i)
ptr[i] = 1;
auto finish = std::chrono::high_resolution_clock::now();
auto milliseconds = std::chrono::duration_cast<std::chrono::milliseconds>(finish-start);
std::cout << "the first for loop time: " << milliseconds.count() << " msecs" << std::endl;
start = std::chrono::high_resolution_clock::now();
for (uint i = 0; i < N; ++i)
ptr[i] = 2;
finish = std::chrono::high_resolution_clock::now();
milliseconds = std::chrono::duration_cast<std::chrono::milliseconds>(finish-start);
std::cout << "the second for loop time: " << milliseconds.count() << " msecs" << std::endl;
free(ptr);
}
和g++ -O3 test.cpp -o test && ./test
给出了类似的输出:
the first for loop time: 1029 msecs
the second for loop time: 349 msecs
每次运行此代码时结果都一样。分配的数据太大而无法缓存。没有分支点,因此不应涉及分支预测问题。两个循环都以相同的直接顺序访问内存,所以我猜这不应该与内存布局有关。
那么为什么第二个比第一个跑得快?
答案 0 :(得分:2)
因为uint* ptr = cast(uint*)malloc(uint.sizeof * N);
在为多个元素循环之前不会分配内存。你可以测试一下:
import core.stdc.stdlib : malloc, free;
void main()
{
immutable N = 1_000_000_000;
uint* ptr = cast(uint*)malloc(uint.sizeof * N);
foreach (_; 0 .. 100)
for (uint i = 0; i < N; ++i)
ptr[N-1] = 1;
// until this point almost no memory is allocated
for (uint i = 0; i < N; ++i)
ptr[i] = 2;
free(ptr);
}
<强>更新强> @Eljay已经在评论中解释了这个