Question

我有一个大小为（m * l * 4）的矩阵A，m的大小约为100,000，l = 100。列表的大小总是等于n且n <= m。我想对给定的索引列表进行矩阵添加。我写了这个函数，并且必须多次调用这个函数。

void MatrixAddition(int l, int n, vector<int>& list, int ***A,int ***C,int cluster)
{
    for(int i=0;i<l;i++)
    {
         for(int j=0;j<4;j++)
              C[cluster][i][j]=0;
    }   

for (int i = 0; i < l; i++)
{
        for(int j=0;j<n;++j)
    {
        for(int k=0;k<4;k++)
            C[cluster][i][k]+=A[list[j]][i][k];
    }
}

}

我使用gprof计算整个代码中每个函数的时间，我发现MatrixAddition函数占用了60％的时间。是否有任何替代方法来编写此函数，以便我的运行时间减少。

时间秒秒呼叫ms /呼叫ms /呼叫名称
52.00 7.85 7.85 20 392.60 405.49 MatrixAddition（int，int，std :: vector＆gt;＆amp;，int ***，int ***，int）

Answer 1

通过i循环交换，并在第二部分中循环j。这将使该功能更加缓存友好。

for(int j=0;j<n;++j)
{
    for (int i = 0; i < l; i++)
    {
        for(int k=0;k<4;k++)
            C[cluster][i][k]+=A[list[j]][i][k];
    }
}

另外，我希望你不要忘记-O3标志。

Answer 2

（更新：早期版本的索引编写错误。此版本相当容易自动矢量化。）

使用C多维数组（而不是指针指针数组）或使用i*cols + j索引的平面1D数组，因此内存是连续的。这将对硬件预取的有效性产生巨大影响，以充分利用内存带宽。来自另一个负载的地址的负载真的很糟糕，或者相反，提前知道可预测的地址有很大帮助，因为负载可以在它们需要之前很好地启动（由于无序执行）

另外，@ user31264的答案是正确的，你需要交换循环，以便j上的循环是最外层的。这很好，但它本身就不够了。

这也将允许编译器自动向量化。实际上，我很难让gcc自动矢量化。（但那可能是因为我的索引编写错误了，因为我第一次只查看代码。所以编译器并不知道我们在连续的内存上循环。）

我在Godbolt compiler explorer上玩了它。

我终于从这个版本获得了很好的编译器输出，它将A和C作为平面1D数组并自行编制索引：

void MatrixAddition_contiguous(int rows, int n, const  vector<int>& list,
                               const int *__restrict__ A, int *__restrict__ C, int cluster)
  // still auto-vectorizes without __restrict__, but checks for overlap and
  // runs a scalar loop in that case
{
  const int cols = 4;  // or global constexpr or something
  int *__restrict__ Ccluster = C + ((long)cluster) * rows * cols;

  for(int i=0;i<rows;i++)
    //#pragma omp simd  
    for(int k=0;k<4;k++)
      Ccluster[cols*i + k]=0;

  for(int j=0;j < cols;++j) { // loop over clusters in A in the outer-most loop
    const int *__restrict__ Alistj = A + ((long)list[j]) * rows * cols;
    // #pragma omp simd    // Doesn't work: only auto-vectorizes with -O3
    // probably only -O3 lets gcc see through the k=0..3 loop and treat it like one big loop
    for (int i = 0; i < rows; i++) {
      long row_offset = cols*i;
      //#pragma omp simd  // forces vectorization with 16B vectors, so it hurts AVX2
      for(int k=0;k<4;k++)
        Ccluster[row_offset + k] += Alistj[row_offset + k];
    }
  }
}

手动提升list[j]肯定有助于编译器意识到存储到C不会影响将从list[j]加载的索引。手动提升其他东西可能并非必要。

吊装A[list[j]]而不仅仅是list[j]，是previous approach where I had the indexing wrong的工件。只要我们尽可能地从list[j]提升负载，编译器就可以做得很好，即使它不知道list没有重叠{{1} }。

内部循环，gcc 5.3定位x86-64 C（和-O3 -Wall -march=haswell -fopenmp）是：

-fverbose-asm

因此它同时使用AVX2 vpaddd进行了8次添加，未对齐的加载和未对齐的存储返回到C中。

由于这是 auto -vectorizing，因此它应该使用ARM NEON或PPC Altivec或任何可以进行32位加密打包的代码。

我无法通过-ftree-vectorizer-verbose=2让gcc告诉我任何事情，但是clang的-Rpass-analysis=loop-vectorize稍微有点帮助。

更快速地添加多维矩阵？

2 个答案: