for循环的OpenMP并行化:我的代码效率很低

时间:2016-11-03 05:52:13

标签: c++ multithreading openmp

我的功能显然是我整个程序的瓶颈。我认为与OpenMP的并行化可能会有所帮助。

这是我计算的一个工作示例(抱歉,函数有点长)。在我的程序中,5个嵌套循环之前的一些工作是在其他地方完成的,并且对于效率来说根本不是问题。

#include <vector>
#include <iostream>
#include <cmath>
#include <cstdio>
#include <chrono>
#include "boost/dynamic_bitset.hpp"

using namespace std::chrono;

void compute_mddr(unsigned Ns, unsigned int block_size, unsigned int sector)
{
  std::vector<unsigned int> basis;
  for (std::size_t s = 0; s != std::pow(2,Ns); s++) {
    boost::dynamic_bitset<> s_bin(Ns,s);
    if (s_bin.count() == Ns/2) {
      basis.push_back(s);
    }
  }
  std::vector<double> gs(basis.size());
  for (unsigned int i = 0; i != gs.size(); i++)
    gs[i] = double(std::rand())/RAND_MAX;

  unsigned int ns_A = block_size;
  unsigned int ns_B = Ns-ns_A;
  boost::dynamic_bitset<> mask_A(Ns,(1<<ns_A)-(1<<0));
  boost::dynamic_bitset<> mask_B(Ns,((1<<ns_B)-(1<<0))<<ns_A);

  // Find the basis of the A block
  unsigned int NAsec = sector;
  std::vector<double> basis_NAsec;
  for (unsigned int s = 0; s < std::pow(2,ns_A); s++) {
    boost::dynamic_bitset<> s_bin(ns_A,s);
    if (s_bin.count() == NAsec)
      basis_NAsec.push_back(s);
  }
  unsigned int bs_A = basis_NAsec.size();

  // Find the basis of the B block
  unsigned int NBsec = (Ns/2)-sector;
  std::vector<double> basis_NBsec;
  for (unsigned int s = 0; s < std::pow(2,ns_B); s++) {
    boost::dynamic_bitset<> s_bin(ns_B,s);
    if (s_bin.count() == NBsec)
      basis_NBsec.push_back(s);
  }
  unsigned int bs_B = basis_NBsec.size();

  std::vector<std::vector<double> > mddr(bs_A);
  for (unsigned int i = 0; i != mddr.size(); i++) {
    mddr[i].resize(bs_A);
    for (unsigned int j = 0; j != mddr[i].size(); j++) {
      mddr[i][j] = 0.0;
    }
  }

  // Main calculation part
  for (unsigned int mu_A = 0; mu_A != bs_A; mu_A++) { // loop 1
    boost::dynamic_bitset<> mu_A_bin(ns_A,basis_NAsec[mu_A]);
    for (unsigned int nu_A = mu_A; nu_A != bs_A; nu_A++) { // loop 2
      boost::dynamic_bitset<> nu_A_bin(ns_A,basis_NAsec[nu_A]);

      double sum = 0.0;
#pragma omp parallel for reduction(+:sum)
      for (unsigned int mu_B = 0; mu_B < bs_B; mu_B++) { // loop 3
        boost::dynamic_bitset<> mu_B_bin(ns_B,basis_NBsec[mu_B]);

        for (unsigned int si = 0; si != basis.size(); si++) { // loop 4
          boost::dynamic_bitset<> si_bin(Ns,basis[si]);
          boost::dynamic_bitset<> si_A_bin = si_bin & mask_A;
          si_A_bin.resize(ns_A);
          if (si_A_bin != mu_A_bin)
            continue;
          boost::dynamic_bitset<> si_B_bin = (si_bin & mask_B)>>ns_A;
          si_B_bin.resize(ns_B);
          if (si_B_bin != mu_B_bin)
            continue;

          for (unsigned int sj = 0; sj < basis.size(); sj++) { // loop 5
            boost::dynamic_bitset<> sj_bin(Ns,basis[sj]);
            boost::dynamic_bitset<> sj_A_bin = sj_bin & mask_A;
            sj_A_bin.resize(ns_A);
            if (sj_A_bin != nu_A_bin)
              continue;
            boost::dynamic_bitset<> sj_B_bin = (sj_bin & mask_B)>>ns_A;
            sj_B_bin.resize(ns_B);
            if (sj_B_bin != mu_B_bin)
              continue;
            sum += gs[si]*gs[sj];
          }
        }
      }
      mddr[nu_A][mu_A] = mddr[mu_A][nu_A] = sum;
    }
  }
}


int main()
{
  unsigned int l = 8;
  unsigned int Ns = 2*l;
  unsigned block_size = 6; // must be between 1 and l
  unsigned sector = (block_size%2 == 0) ? block_size/2 : (block_size+1)/2;

  high_resolution_clock::time_point t1 = high_resolution_clock::now();
  compute_mddr(Ns,block_size,sector);
  high_resolution_clock::time_point t2 = high_resolution_clock::now();
  duration<double> time_span = duration_cast<duration<double>>(t2 - t1);
  std::cout << "Function took " << time_span.count() << " seconds.";
  std::cout << std::endl;
}

compute_mddr函数基本上完全填充矩阵mddr,这对应于最外面的循环1和2。 我决定并行化循环3,因为它实质上是计算总和。为了给出幅度的顺序,循环3在basis_NBsec向量中超过~50-100个元素,而两个最里面的循环sisj在向量的~10000个元素上运行basis

然而,当运行代码(在gcc 5.4.0,ubuntu 16.0.4和i5-4440 cpu上使用-O3 -fopenmp编译)时,我看到没有加速(2个线程)或非常有限的增益(3和4个线程):

time OMP_NUM_THREADS=1 ./a.out
Function took 230.435 seconds.
real    3m50.439s
user    3m50.428s
sys 0m0.000s


time OMP_NUM_THREADS=2 ./a.out 
Function took 227.754 seconds.
real    3m47.758s
user    7m2.140s
sys 0m0.048s


time OMP_NUM_THREADS=3 ./a.out 
Function took 181.492 seconds.
real    3m1.495s
user    7m36.056s
sys 0m0.036s


time OMP_NUM_THREADS=4 ./a.out 
Function took 150.564 seconds.
real    2m30.568s
user    7m56.156s
sys 0m0.096s

如果我正确理解来自用户的数字,对于3和4个线程,cpu使用率并不好(实际上,当代码运行时,3个线程的cpu使用率约为250%,4个线程的cpu使用率仅为300%)

这是我第一次使用OpenMP,我只是简单地用简单的例子玩它。在这里,据我所知,我没有修改并行部分中的任何共享向量basis_NAsecbasis_NBsecbasis,只是读取(这是指出的一个方面)在我读过的几个相关问题中。

那么,我做错了什么?

1 个答案:

答案 0 :(得分:3)

使用perf record快速查看程序的性能表明,无论线程数多少,大部分时间都花费在malloc&amp; free。这通常是一个不好的迹象,它也会抑制并行化。

Samples: 1M of event 'cycles:pp', Event count (approx.): 743045339605                                                                                                                         
  Children      Self  Command  Shared Object        Symbol                                                                                                                                    
+   17.14%    17.12%  a.out    a.out                [.] _Z12compute_mddrjjj._omp_fn.0                                                                                                         
+   15.45%    15.43%  a.out    libc-2.23.so         [.] __memcmp_sse4_1                                                                                                                       
+   15.21%    15.19%  a.out    libc-2.23.so         [.] __memset_avx2                                                                                                                         
+   13.09%    13.07%  a.out    libc-2.23.so         [.] _int_free                                                                                                                             
+   11.66%    11.65%  a.out    libc-2.23.so         [.] _int_malloc                                                                                                                           
+   10.21%    10.20%  a.out    libc-2.23.so         [.] malloc                                                                                                                                

malloc&amp;的原因freeboost::dynamic_bitset个对象的常量创建,基本上是std::vector个。注意:使用某个功能的perfcan be challenging to find the callers。您可以在gdb中运行,在执行阶段中断break balloccontinue以找出来电者。

提高性能的直接方法是尽可能地保持这些对象的存活,以避免一遍又一遍地重新分配。这违背了通常的好做法,即尽可能在本地声明变量。重用dynamic_bitset对象的转换可能如下所示:

#pragma omp parallel for reduction(+:sum)
  for (unsigned int mu_B = 0; mu_B < bs_B; mu_B++) { // loop 3
    boost::dynamic_bitset<> mu_B_bin(ns_B,basis_NBsec[mu_B]);

    boost::dynamic_bitset<> si_bin(Ns);
    boost::dynamic_bitset<> si_A_bin(Ns);
    boost::dynamic_bitset<> si_B_bin(Ns);

    boost::dynamic_bitset<> sj_bin(Ns);
    boost::dynamic_bitset<> sj_A_bin(Ns);

    boost::dynamic_bitset<> sj_B_bin(Ns);

    for (unsigned int si = 0; si != basis.size(); si++) { // loop 4
      si_bin = basis[si];
      si_A_bin = si_bin;
      assert(si_bin.size() == Ns);
      assert(si_A_bin.size() == Ns);
      assert(mask_A.size() == Ns);
      si_A_bin &= mask_A;
      si_A_bin.resize(ns_A);
      if (si_A_bin != mu_A_bin)
        continue;
      si_B_bin = si_bin;
      assert(si_bin.size() == Ns);
      assert(si_B_bin.size() == Ns);
      assert(mask_B.size() == Ns);
      // Optimization note: dynamic_bitset::operator&
      // does create a new object, operator&= does not
      // Same for >>
      si_B_bin &= mask_B;
      si_B_bin >>= ns_A;
      si_B_bin.resize(ns_B);
      if (si_B_bin != mu_B_bin)
        continue;

      for (unsigned int sj = 0; sj < basis.size(); sj++) { // loop 5
        sj_bin = basis[sj];
        sj_A_bin = sj_bin;
        assert(sj_bin.size() == Ns);
        assert(sj_A_bin.size() == Ns);
        assert(mask_A.size() == Ns);
        sj_A_bin &= mask_A;
        sj_A_bin.resize(ns_A);
        if (sj_A_bin != nu_A_bin)
          continue;

        sj_B_bin = sj_bin;

        assert(sj_bin.size() == Ns);
        assert(sj_B_bin.size() == Ns);
        assert(mask_B.size() == Ns);

        sj_B_bin &= mask_B;
        sj_B_bin >>= ns_A;
        sj_B_bin.resize(ns_B);
        if (sj_B_bin != mu_B_bin)
          continue;
        sum += gs[si]*gs[sj];
      }
    }
  }

这已经将我系统上的单线程运行时间从~289 s减少到~39 s。此程序几乎完美地扩展到~10个线程(4.1 s)。

对于更多线程,并行循环中存在负载平衡问题。可以通过添加schedule(dynamic)来缓解一点,但我不确定这对你有多重要。

更重要的是,您应该考虑使用std::bitset。即使没有极其昂贵的boost::dynamic_bitset构造函数,它也非常昂贵。大多数情况下,在单个单词中使用超级dynamic_bitest / vector代码和memmove / memcmp

+   32.18%    32.15%  ope_gcc_dyn  ope_gcc_dyn          [.] _ZNSt6vectorImSaImEEaSERKS1_                                                                                                      
+   29.13%    29.10%  ope_gcc_dyn  ope_gcc_dyn          [.] _Z12compute_mddrjjj._omp_fn.0                                                                                                     
+   21.65%     0.00%  ope_gcc_dyn  [unknown]            [.] 0000000000000000                                                                                                                  
+   16.24%    16.23%  ope_gcc_dyn  ope_gcc_dyn          [.] _ZN5boost14dynamic_bitsetImSaImEE6resizeEmb.constprop.102                                                                         
+   10.25%    10.23%  ope_gcc_dyn  libc-2.23.so         [.] __memcmp_sse4_1                                                                                                                   
+    9.61%     0.00%  ope_gcc_dyn  libc-2.23.so         [.] 0xffffd47cb9d83b78                                                                                                                
+    7.74%     7.73%  ope_gcc_dyn  libc-2.23.so         [.] __memmove_avx_unaligned  

如果您只使用std::bitset中很少的单词,那基本上就消失了。也许64位对你来说总是足够的。如果它在大范围内是动态的,您可以创建整个函数的模板并将其实例化为多个不同的位点,您可以动态地选择合适的位。我怀疑你应该获得另一个数量级的表现。这可能反过来降低并行效率,需要进行另一轮性能分析。

使用工具来了解代码的性能非常重要。对于各种情况,都有非常简单和非常好的工具。在你的情况下,像perf这样的简单就足够了。