Question

使用平铺方法（Cache Aware）转换大小为1 gb的全局2D方形矩阵/数组在单线程执行中没有通过Normal转置方法的性能提升。不使用AVX，SSE（SIMD）或任何其他缓存遗忘转置算法（http://supertech.csail.mit.edu/papers/FrigoLePr12.pdf）讨论转置加速

#include <stdio.h>
#include <sys/time.h>
#define SIZE 16384
float a[SIZE][SIZE], b[SIZE][SIZE];

void testNormalTranspose() {
int i, j, k, l;
b[0][9999] = 1.0;
for (i=0; i<SIZE; i++)
    for (j=0; j<SIZE; j++)
      a[i][j] = b[j][i];
}

void testTiledTranspose(){
    int i, j, k, l;
    b[0][9999] = 1.0;
    int blocksize = 16;
    for (i=0; i<SIZE; i+= blocksize) {
        for (j=0; j<SIZE; j+=blocksize) {
            for (int ii = i;ii <i + blocksize; ++ii) {
                for (int jj = j; jj < j + blocksize; ++jj) {
                    a[ii][jj] = b[jj][ii];
                }

            }
        }   
    }  
}

int main()
{
    struct timeval t1, t2;
    /*
      gettimeofday(&t1, NULL);
      testNormalTranspose();
      gettimeofday(&t2, NULL);
      printf("Time for the Normal transpose  is %ld milliseconds\n",
             (t2.tv_sec - t1.tv_sec)*1000 + 
             (t2.tv_usec - t1.tv_usec) / 1000);
    */
      gettimeofday(&t1, NULL);
      testTiledTranspose();
      gettimeofday(&t2, NULL);
      printf("Time for the Tiled transpose  is %ld milliseconds\n",
             (t2.tv_sec - t1.tv_sec)*1000 + 
             (t2.tv_usec - t1.tv_usec) / 1000);
      printf("%f\n", a[9999][0]);
}

Answer 1

循环平铺有助于数据重用。如果你使用元素SIZE次，你最好使用SIZE次，然后才能进入下一个元素。

不幸的是，转换2D矩阵你不会重复使用矩阵a和b的任何元素。更重要的是，因为在循环中你混合了行和cols访问（即a [i] [j] = b [j] [i]），你永远不会在a和b数组上同时获得单位跨步内存访问时间，但仅限于其中一个。

因此，在这种情况下，平铺效率并不高，但即使使用＆＃34; random＆＃34;也可能会有一些性能提升。内存访问如果：

您现在访问的元素位于同一个缓存行中，其中包含您之前访问过的元素AND
该缓存行仍然可用。

所以，要看到这个＆＃34;随机＆＃34;的内存占用的任何改进。访问必须适合您系统的缓存。基本上这意味着你必须仔细选择blocksize，你在示例中选择的16可能在一个系统上工作得更好而在另一个系统上更糟糕。

以下是我的计算机针对2个块大小和SIZE 4096的不同功率的结果：

---------------------------------------------------------------
Benchmark                        Time           CPU Iterations
---------------------------------------------------------------
transpose_2d              32052765 ns   32051761 ns         21
tiled_transpose_2d/2      22246701 ns   22245867 ns         31
tiled_transpose_2d/4      16912984 ns   16912487 ns         41
tiled_transpose_2d/8      16284471 ns   16283974 ns         43
tiled_transpose_2d/16     16604652 ns   16604149 ns         42
tiled_transpose_2d/32     23661431 ns   23660226 ns         29
tiled_transpose_2d/64     32260575 ns   32259564 ns         22
tiled_transpose_2d/128    32107778 ns   32106793 ns         22
fixed_tile_transpose_2d   16735583 ns   16729876 ns         41

正如您所看到的那样blocksize 8版本对我来说效果最佳，而且性能几乎翻了一番。

以下是SIZE 4131的结果和3个块大小的功效：

---------------------------------------------------------------
Benchmark                        Time           CPU Iterations
---------------------------------------------------------------
transpose_2d              29875351 ns   29874381 ns         23
tiled_transpose_2d/3      30077471 ns   30076517 ns         23
tiled_transpose_2d/9      20420423 ns   20419499 ns         35
tiled_transpose_2d/27     13470242 ns   13468992 ns         51
tiled_transpose_2d/81     11318953 ns   11318646 ns         61
tiled_transpose_2d/243    10229250 ns   10228884 ns         65
fixed_tile_transpose_2d   10217339 ns   10217066 ns         67

关于16384尺寸问题。我无法重现它，即我仍然看到大矩阵的相同增益。请注意，16384 * 16384 * sizeof（float）会产生4GB，这可能会暴露一些系统问题......

使用循环平铺转换大型2d矩阵无法获得性能提升

1 个答案: