Question

我正在测试读取多个数据流如何影响CPU缓存性能。我正在使用以下代码来对此进行基准测试。基准测试读取顺序存储在内存中的整数，并按顺序写入部分和。从中读取的顺序块的数量是变化的。来自块的整数以循环方式读取。

#include <iostream>
#include <vector>
#include <chrono>
using std::vector;
void test_with_split(int num_arrays) {
    int num_values = 100000000;
    // Fix up the number of values. The effect of this should be insignificant.
    num_values -= (num_values % num_arrays);
    int num_values_per_array = num_values / num_arrays;
    // Initialize data to process
    auto results = vector<int>(num_values);
    auto arrays = vector<vector<int>>(num_arrays);
    for (int i = 0; i < num_arrays; ++i) {
        arrays.emplace_back(num_values_per_array);
    }
    for (int i = 0; i < num_values; ++i) {
        arrays[i%num_arrays].emplace_back(i);
        results.emplace_back(0);
    }
    // Try to clear the cache
    const int size = 20*1024*1024; // Allocate 20M. Set much larger then L2
    char *c = (char *)malloc(size);
    for (int i = 0; i < 100; i++)
        for (int j = 0; j < size; j++)
            c[j] = i*j;
    free(c);
    auto start = std::chrono::high_resolution_clock::now();
    // Do the processing
    int sum = 0;
    for (int i = 0; i < num_values; ++i) {
        sum += arrays[i%num_arrays][i/num_arrays];
        results[i] = sum;
    }
    std::cout << "Time with " << num_arrays << " arrays: " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - start).count() << " ms\n";
}
int main() {
    int num_arrays = 1;
    while (true) {
        test_with_split(num_arrays++);
    }
}

以下是在Intel Core 2 Quad CPU Q9550 @ 2.83GHz上拆分1-80路的时间：

Time taken when splitting to different number of streams

8个流之后不久的速度颠簸对我来说很有意义，因为处理器有一个8路关联L1缓存。 24路关联L2缓存反过来解释了24个流的颠簸。如果我得到与Why is one loop so much slower than two loops?中相同的效果，那么这些特别有效，其中多个大分配总是在同一个关联集中结束。比较我在一个大块中完成分配时的结束时间。

但是，我不完全理解从一个流到两个流的颠簸。我自己的猜测是它与预取到L1缓存有关。阅读Intel 64 and IA-32 Architectures Optimization Reference Manual似乎L2流式预取器支持跟踪多达32个数据流，但没有为L1流式预取器提供此类信息。 L1预取程序是否无法跟踪多个流，或者此处还有其他内容？

背景

我正在调查这个因为我想了解游戏引擎中的组织实体作为数组结构样式中的组件如何影响性能。目前看来，转换所需的数据分为两个组件而不是8-10个组件，这对现代CPU来说并不重要。但是，上面的测试表明，如果允许“瓶颈”转换仅使用一个组件，有时可能有必要避免将某些数据拆分为多个组件，即使这意味着某些其他转换必须读取数据不感兴趣。

在一个区块中分配

以下是如果改为分配多个数据块，则只有一个以分步方式分配和访问的时序。这不会将碰撞从一个流改变为两个，但为了完整起见，我已将其包括在内。

Timings when only one big block is allocated

以下是修改后的代码：

void test_with_split(int num_arrays) {
    int num_values = 100000000;
    num_values -= (num_values % num_arrays);
    int num_values_per_array = num_values / num_arrays;

    // Initialize data to process
    auto results = vector<int>(num_values);
    auto array = vector<int>(num_values);
    for (int i = 0; i < num_values; ++i) {
        array.emplace_back(i);
        results.emplace_back(0);
    }

    // Try to clear the cache
    const int size = 20*1024*1024; // Allocate 20M. Set much larger then L2
    char *c = (char *)malloc(size);
    for (int i = 0; i < 100; i++)
        for (int j = 0; j < size; j++)
            c[j] = i*j;
    free(c);

    auto start = std::chrono::high_resolution_clock::now();
    // Do the processing
    int sum = 0;
    for (int i = 0; i < num_values; ++i) {
        sum += array[(i%num_arrays)*num_values_per_array+i/num_arrays];
        results[i] = sum;
    }
    std::cout << "Time with " << num_arrays << " arrays: " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - start).count() << " ms\n";
}

编辑1

我确保1 vs 2分裂差异不是由于编译器展开循环并以不同方式优化第一次迭代。使用__attribute__ ((noinline))我确保工作函数没有内联到main函数中。我通过查看生成的程序集验证了它没有发生。这些改变后的时间是一样的。

Answer 1

回答问题的主要部分： L1预取程序是否能够跟踪多个流？

没有。这实际上是因为L1缓存根本没有预取器。 L1缓存不足以冒险推测可能不会使用的数据。它会导致过多的驱逐并对任何不以特定模式读取数据的软件产生负面影响，这些模式适合于特定的L1缓存预测方案。相反，L1缓存已经显式读取或写入的数据。 L1缓存仅在您编写数据和重新读取最近访问过的数据时才有用。

L1缓存实现不是您的配置文件从1X到2X阵列深度突变的原因。在流式读取时，就像您设置的那样，L1缓存在性能上几乎没有影响因素。您的大多数读取都直接来自L2缓存。在您使用嵌套向量的第一个示例中，可能从L1中提取了一些读取数（见下文）。

我的猜测是你从1X到2X的碰撞与算法有很大关系以及编译器如何优化它。如果编译器知道num_arrays是一个等于1的常量，那么它将自动为你消除大量的每次迭代开销。

现在第二部分，关于为什么第二个版本更快？：

第二个版本更快的原因不在于数据在物理内存中的排列方式，而在于嵌套std::vector<std::vector<int>>类型暗示的内部逻辑更改。

在嵌套（第一个）案例中，编译代码执行以下步骤：

访问顶级std::vector课程。该类包含指向数据数组开头的指针。
必须从内存加载指针值。
将当前循环偏移[i%num_arrays]添加到该指针。
访问嵌套的std::vector类数据。（可能是L1缓存命中）
将指针加载到向量的数据数组的开头。（可能是L1缓存命中）
添加循环偏移[i/num_arrays]
读取数据。终于来了！

（请注意，在下一次迭代之前，通过循环可能导致驱逐的可能性，因此在步骤＃4和＃5之后获得L1缓存命中的可能性会大大减少

第二个版本，相比之下：

访问顶级std::vector类。
将指针加载到向量的数据数组的开头。
添加偏移量[(i%num_arrays)*num_values_per_array+i/num_arrays]
阅读数据！

删除了一整套引擎盖下的步骤。偏移的计算稍长，因为它需要额外乘以num_values_per_array。但其他步骤不仅仅是弥补它。

为什么处理多个数据流比处理一个数据流慢？

背景

在一个区块中分配

编辑1

1 个答案: