Question

我使用Google Benchmark框架测试了下一个代码，以测量不同阵列大小的内存访问延迟：

int64_t MemoryAccessAllElements(const int64_t *data, size_t length) {
    for (size_t id = 0; id < length; id++) {
        volatile int64_t ignored = data[id];
    }
    return 0;
}
int64_t MemoryAccessEvery4th(const int64_t *data, size_t length) {
    for (size_t id = 0; id < length; id += 4) {
        volatile int64_t ignored = data[id];
    }
    return 0;
}

然后我得到下一个结果（对于大型数组，结果是按Google基准进行平均的，大约有10次迭代，对于较小的数组，还有更多的工作要做）：

这张图片上发生了很多不同的事情，不幸的是，我无法解释图中的所有变化。

我在具有下一个缓存配置的单核CPU上测试了此代码：

CPU Caches:                                                                                              
  L1 Data 32K (x1), 8 way associative
  L1 Instruction 32K (x1), 8 way associative
  L2 Unified 256K (x1), 8 way associative
  L3 Unified 30720K (x1), 20 way associative

在此图片中，我们可以看到图形行为的许多变化：

64字节数组大小之后出现峰值，这可以用以下事实来解释：高速缓存行大小为64字节长，并且数组大小超过64字节时，我们还会遇到一个L1高速缓存未命中（可以归类）作为强制性的高速缓存未命中）
此外，在缓存大小范围附近增加的延迟也很容易解释-这时我们遇到容量缓存未命中的情况

但是关于结果的很多问题我无法解释：

为什么MemoryAccessEvery4th的延迟在数组超过约1024个字节后减少？
为什么我们可以在512字节左右看到MemoryAccessAllElements的另一个峰值？有趣的是，这时我们开始访问多于一组的高速缓存行（一组中的8 * 64字节）。但这真的是由此事件引起的吗？如果不是，那么如何解释？
为什么在基准测试MemoryAccessEvery4th时经过L2缓存大小后，延迟会增加，但是与MemoryAccessAllElements却没有这种区别？

我试图将我的结果与gallery of processor cache effects和what every programmer should know about memory的结果进行比较，但是我无法用本文的推理来充分描述我的结果。

有人可以帮助我了解CPU缓存的内部过程吗？

UPD： 我使用以下代码来评估内存访问的性能：

#include <benchmark/benchmark.h>
using namespace benchmark;
void InitializeWithRandomNumbers(long long *array, size_t length) {
    auto random = Random(0);
    for (size_t id = 0; id < length; id++) {
        array[id] = static_cast<long long>(random.NextLong(0, 1LL << 60));
    }
}
static void MemoryAccessAllElements_Benchmark(State &state) {
    size_t size = static_cast<size_t>(state.range(0));
    auto array = new long long[size];
    InitializeWithRandomNumbers(array, size);
    for (auto _ : state) {
        DoNotOptimize(MemoryAccessAllElements(array, size));
    }
    delete[] array;
}
static void CustomizeBenchmark(benchmark::internal::Benchmark *benchmark) {
    for (int size = 2; size <= (1 << 24); size *= 2) {
        benchmark->Arg(size);
    }
}
BENCHMARK(MemoryAccessAllElements_Benchmark)->Apply(CustomizeBenchmark);
BENCHMARK_MAIN();

您可以在the repository中找到稍微不同的示例，但是实际上，该问题中基准测试的基本方法是相同的。

对CPU缓存的了解

0 个答案: