Question

考虑以下矩阵乘法代码：

#define BLOCKING 64
void mat_mult_ijk_blocked(
          const int m,
              const int n,
              const int p,
              real a[restrict m][p],
              const real b[m][n],
              const real c[n][p]) {
  for(int i=0; i<m; i++) {
    for(int k=0; k<p; k++) {
      a[i][k] = 0.0;
    }
  }
#pragma omp parallel for
  for(int block_i=0; block_i<m; block_i += BLOCKING) {
    for(int block_j=0; block_j<n; block_j += BLOCKING) {
      for(int i=block_i; i<min(block_i + BLOCKING, m); i++) {
        for(int j=block_j; j<min(block_j + BLOCKING, n); j++) {
          real w = b[i][j];
          for(int k=0; k<p; k++) {
            a[i][k] += w * c[j][k];
          }
        }
      }
    }
  }
}

对于1800×2200×1400乘法：

使用没有-fopenmp的clang，我的机器上的代码需要3.2秒。
使用-fopenmp和
- OMP_NUM_THREADS = 4 1.6s
- OMP_NUM_THREADS = 3 2.1s
- OMP_NUM_THREADS = 2 1.6s
- OMP_NUM_THREADS = 1 2.6s

这台机器有两个超线程核心，这看起来很合理。

现在使用gcc-5.4.0：

没有-fopenmp 2.8s
with -fopenmp
- OMP_NUM_THREADS = 4 5.1s
- OMP_NUM_THREADS = 3 4.6s
- OMP_NUM_THREADS = 2 4.9s
- OMP_NUM_THREADS = 1 10s

我做错了什么导致gcc性能如此糟糕！？使用OpenMP的单个线程的性能比原始单线程代码的性能差得多。

更新 - 使用gcc-6.2

没有-fopenmp 3.s
with -fopenmp
- OMP_NUM_THREADS = 4 2.23s
- OMP_NUM_THREADS = 3 1.89s
- OMP_NUM_THREADS = 2 1.61s
- OMP_NUM_THREADS = 1 2.9s

显然，现在的结果与LLVM中的结果类似。这是什么原因？

无法解释的LLVM与GCC OpenMP差异

0 个答案: