Question

没有裂变的代码看起来像这样：

int check(int * res, char * map, int n, int * keys){
    int ret = 0;
    for(int i = 0; i < n; ++i){
        res[ret] = i;
        ret += map[hash(keys[i])]
    }
    return ret;
}

裂变：

int check(int * res, char * map, int n, int * keys){
    int ret = 0;
    for(int i = 0; i < n; ++i){
        tmp[i] = map[hash(keys[i])];
    }
    for(int i = 0; i < n; ++i){
        res[ret] = i;
        ret += tmp[i];
    }
    return ret;
}

注意：

瓶颈是map[hash(keys[i])]，它会随机访问内存。
通常情况下，if(tmp[i]) res[ret++] = i;可以避免if，我正在使用ret += tmp[i]。
map[..]始终为0或1

裂变版本通常明显更快，我试图解释原因。我最好的猜测是ret += map[..]仍然会引入一些依赖性并阻止推测性执行。

我想听听是否有人有更好的解释。

Answer 1

从我的测试中，我在融合循环和分裂循环之间获得大约2倍的速度差异。无论我如何调整循环，这种速度差异都非常一致。

Fused: 1.096258 seconds
Split: 0.562272 seconds

（有关完整的测试代码，请参阅底部。）

虽然我不是百分百肯定，但我怀疑这是由于两件事的结合：

由于来自map[gethash(keys[i])]的缓存未命中，memory disambigutation的加载存储缓冲区的饱和度。
融合循环版本中添加的依赖项。

很明显，map[gethash(keys[i])]几乎每次都会导致缓存未命中。实际上，它可能足以使整个加载存储缓冲区饱和。

现在让我们看一下添加的依赖项。问题是ret变量：

int check_fused(int * res, char * map, int n, int * keys){
    int ret = 0;
    for(int i = 0; i < n; ++i){
        res[ret] = i;
        ret += map[gethash(keys[i])];
    }
    return ret;
}

商店ret的地址解析需要 res[ret] = i;变量。

在融合循环中，ret来自确定的缓存未命中。

在分割循环中，ret即将来临tmp[i] - 这要快得多。

融合循环情况的地址解析延迟可能导致res[ret] = i存储以阻塞加载存储缓冲区以及map[gethash(keys[i])]。

由于加载存储缓冲区具有固定大小，但你有两倍的垃圾：
你只能将缓存未命中的一半重叠到以前。这样2倍减速。

假设我们将融合循环更改为：

int check_fused(int * res, char * map, int n, int * keys){ int ret = 0; for(int i = 0; i < n; ++i){ res[i] = i; // Change "res" to "i" ret += map[gethash(keys[i])]; } return ret; }

这将破坏地址解析依赖性。

^{（请注意，它不再相同，只是为了证明性能差异。）}

然后我们得到类似的时间：

Fused: 0.487477 seconds Split: 0.574585 seconds

以下是完整的测试代码：

#define SIZE 67108864 unsigned gethash(int key){ return key & (SIZE - 1); } int check_fused(int * res, char * map, int n, int * keys){ int ret = 0; for(int i = 0; i < n; ++i){ res[ret] = i; ret += map[gethash(keys[i])]; } return ret; } int check_split(int * res, char * map, int n, int * keys, int *tmp){ int ret = 0; for(int i = 0; i < n; ++i){ tmp[i] = map[gethash(keys[i])]; } for(int i = 0; i < n; ++i){ res[ret] = i; ret += tmp[i]; } return ret; } int main() { char *map = (char*)calloc(SIZE,sizeof(char)); int *keys = (int*)calloc(SIZE,sizeof(int)); int *res = (int*)calloc(SIZE,sizeof(int)); int *tmp = (int*)calloc(SIZE,sizeof(int)); if (map == NULL || keys == NULL || res == NULL || tmp == NULL){ printf("Memory allocation failed.\n"); system("pause"); return 1; } // Generate Random Data for (int i = 0; i < SIZE; i++){ keys[i] = (rand() & 0xff) | ((rand() & 0xff) << 16); } printf("Start...\n"); double start = omp_get_wtime(); int ret; ret = check_fused(res,map,SIZE,keys); // ret = check_split(res,map,SIZE,keys,tmp); double end = omp_get_wtime(); printf("ret = %d",ret); printf("\n\nseconds = %f\n",end - start); system("pause"); }

Answer 2

我不认为这是数组索引，而是对函数hash()的调用可能导致管道停滞并阻止最佳指令重新排序。

为什么循环裂变在这种情况下有意义？

2 个答案: