Question

我有以下代码。 bitCount函数只计算64位整数中的位数。 test函数是一个类似于我在一段更复杂的代码中做的事情的例子，其中我试图在其中复制如何写入矩阵显着减慢for循环的性能，我正在尝试找出它为什么这样做，以及是否有任何解决方案。

#include <vector>
#include <cmath>
#include <omp.h>

// Count the number of bits
inline int bitCount(uint64_t n){

  int count = 0;

  while(n){

    n &= (n-1);
    count++;

  }

  return count;

}


void test(){

  int nthreads = omp_get_max_threads();
  omp_set_dynamic(0);
  omp_set_num_threads(nthreads);

  // I need a priority queue per thread
  std::vector<std::vector<double> > mat(nthreads, std::vector<double>(1000,-INFINITY));
  std::vector<uint64_t> vals(100,1);

  # pragma omp parallel for shared(mat,vals)
  for(int i = 0; i < 100000000; i++){
    std::vector<double> &tid_vec = mat[omp_get_thread_num()];
    int total_count = 0;
    for(unsigned int j = 0; j < vals.size(); j++){
      total_count += bitCount(vals[j]);
      tid_vec[j] = total_count; // if I comment out this line, performance increase drastically
    }
  }

}

此代码在大约11秒内运行。如果我注释掉以下行：

tid_vec[j] = total_count;

代码在大约2秒内运行。有没有理由在我的案例中写一个矩阵在性能上花费如此之多？

Answer 1

由于您对编译器/系统规范一无所知，我假设您正在使用GCC和标志-O2 -fopenmp进行编译。

如果您对该行发表评论：

tid_vec[j] = total_count;

编译器将优化掉未使用其结果的所有计算。因此：

  total_count += bitCount(vals[j]);

也进行了优化。如果您的应用程序主内核未被使用，则程序运行速度更快。

另一方面，我不会自己实现位计数功能，而是依赖已经提供给你的功能。例如，GCC builtin functions包含__builtin_popcount，它完全符合您的要求。

作为奖励：处理私有数据更好，而不是使用不同的数组元素处理公共数组。它改善了局部性（当访问内存不统一时特别重要，也称为NUMA）并且可以减少访问争用。

# pragma omp parallel shared(mat,vals)
{
std::vector<double> local_vec(1000,-INFINITY);
#pragma omp for
for(int i = 0; i < 100000000; i++) {
  int total_count = 0;
  for(unsigned int j = 0; j < vals.size(); j++){
    total_count += bitCount(vals[j]);
    local_vec[j] = total_count;
  }
}
// Copy local vec to tid_vec[omp_get_thread_num()]
}

C ++ OpenMP：写入for循环内部的矩阵会显着减慢for循环

1 个答案: