我的功能显然是我整个程序的瓶颈。我认为与OpenMP的并行化可能会有所帮助。
这是我计算的一个工作示例(抱歉,函数有点长)。在我的程序中,5个嵌套循环之前的一些工作是在其他地方完成的,并且对于效率来说根本不是问题。
#include <vector>
#include <iostream>
#include <cmath>
#include <cstdio>
#include <chrono>
#include "boost/dynamic_bitset.hpp"
using namespace std::chrono;
void compute_mddr(unsigned Ns, unsigned int block_size, unsigned int sector)
{
std::vector<unsigned int> basis;
for (std::size_t s = 0; s != std::pow(2,Ns); s++) {
boost::dynamic_bitset<> s_bin(Ns,s);
if (s_bin.count() == Ns/2) {
basis.push_back(s);
}
}
std::vector<double> gs(basis.size());
for (unsigned int i = 0; i != gs.size(); i++)
gs[i] = double(std::rand())/RAND_MAX;
unsigned int ns_A = block_size;
unsigned int ns_B = Ns-ns_A;
boost::dynamic_bitset<> mask_A(Ns,(1<<ns_A)-(1<<0));
boost::dynamic_bitset<> mask_B(Ns,((1<<ns_B)-(1<<0))<<ns_A);
// Find the basis of the A block
unsigned int NAsec = sector;
std::vector<double> basis_NAsec;
for (unsigned int s = 0; s < std::pow(2,ns_A); s++) {
boost::dynamic_bitset<> s_bin(ns_A,s);
if (s_bin.count() == NAsec)
basis_NAsec.push_back(s);
}
unsigned int bs_A = basis_NAsec.size();
// Find the basis of the B block
unsigned int NBsec = (Ns/2)-sector;
std::vector<double> basis_NBsec;
for (unsigned int s = 0; s < std::pow(2,ns_B); s++) {
boost::dynamic_bitset<> s_bin(ns_B,s);
if (s_bin.count() == NBsec)
basis_NBsec.push_back(s);
}
unsigned int bs_B = basis_NBsec.size();
std::vector<std::vector<double> > mddr(bs_A);
for (unsigned int i = 0; i != mddr.size(); i++) {
mddr[i].resize(bs_A);
for (unsigned int j = 0; j != mddr[i].size(); j++) {
mddr[i][j] = 0.0;
}
}
// Main calculation part
for (unsigned int mu_A = 0; mu_A != bs_A; mu_A++) { // loop 1
boost::dynamic_bitset<> mu_A_bin(ns_A,basis_NAsec[mu_A]);
for (unsigned int nu_A = mu_A; nu_A != bs_A; nu_A++) { // loop 2
boost::dynamic_bitset<> nu_A_bin(ns_A,basis_NAsec[nu_A]);
double sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (unsigned int mu_B = 0; mu_B < bs_B; mu_B++) { // loop 3
boost::dynamic_bitset<> mu_B_bin(ns_B,basis_NBsec[mu_B]);
for (unsigned int si = 0; si != basis.size(); si++) { // loop 4
boost::dynamic_bitset<> si_bin(Ns,basis[si]);
boost::dynamic_bitset<> si_A_bin = si_bin & mask_A;
si_A_bin.resize(ns_A);
if (si_A_bin != mu_A_bin)
continue;
boost::dynamic_bitset<> si_B_bin = (si_bin & mask_B)>>ns_A;
si_B_bin.resize(ns_B);
if (si_B_bin != mu_B_bin)
continue;
for (unsigned int sj = 0; sj < basis.size(); sj++) { // loop 5
boost::dynamic_bitset<> sj_bin(Ns,basis[sj]);
boost::dynamic_bitset<> sj_A_bin = sj_bin & mask_A;
sj_A_bin.resize(ns_A);
if (sj_A_bin != nu_A_bin)
continue;
boost::dynamic_bitset<> sj_B_bin = (sj_bin & mask_B)>>ns_A;
sj_B_bin.resize(ns_B);
if (sj_B_bin != mu_B_bin)
continue;
sum += gs[si]*gs[sj];
}
}
}
mddr[nu_A][mu_A] = mddr[mu_A][nu_A] = sum;
}
}
}
int main()
{
unsigned int l = 8;
unsigned int Ns = 2*l;
unsigned block_size = 6; // must be between 1 and l
unsigned sector = (block_size%2 == 0) ? block_size/2 : (block_size+1)/2;
high_resolution_clock::time_point t1 = high_resolution_clock::now();
compute_mddr(Ns,block_size,sector);
high_resolution_clock::time_point t2 = high_resolution_clock::now();
duration<double> time_span = duration_cast<duration<double>>(t2 - t1);
std::cout << "Function took " << time_span.count() << " seconds.";
std::cout << std::endl;
}
compute_mddr
函数基本上完全填充矩阵mddr
,这对应于最外面的循环1和2。
我决定并行化循环3,因为它实质上是计算总和。为了给出幅度的顺序,循环3在basis_NBsec
向量中超过~50-100个元素,而两个最里面的循环si
和sj
在向量的~10000个元素上运行basis
。
然而,当运行代码(在gcc 5.4.0,ubuntu 16.0.4和i5-4440 cpu上使用-O3 -fopenmp编译)时,我看到没有加速(2个线程)或非常有限的增益(3和4个线程):
time OMP_NUM_THREADS=1 ./a.out
Function took 230.435 seconds.
real 3m50.439s
user 3m50.428s
sys 0m0.000s
time OMP_NUM_THREADS=2 ./a.out
Function took 227.754 seconds.
real 3m47.758s
user 7m2.140s
sys 0m0.048s
time OMP_NUM_THREADS=3 ./a.out
Function took 181.492 seconds.
real 3m1.495s
user 7m36.056s
sys 0m0.036s
time OMP_NUM_THREADS=4 ./a.out
Function took 150.564 seconds.
real 2m30.568s
user 7m56.156s
sys 0m0.096s
如果我正确理解来自用户的数字,对于3和4个线程,cpu使用率并不好(实际上,当代码运行时,3个线程的cpu使用率约为250%,4个线程的cpu使用率仅为300%)
这是我第一次使用OpenMP,我只是简单地用简单的例子玩它。在这里,据我所知,我没有修改并行部分中的任何共享向量basis_NAsec
,basis_NBsec
和basis
,只是读取(这是指出的一个方面)在我读过的几个相关问题中。
那么,我做错了什么?
答案 0 :(得分:3)
使用perf record
快速查看程序的性能表明,无论线程数多少,大部分时间都花费在malloc
&amp; free
。这通常是一个不好的迹象,它也会抑制并行化。
Samples: 1M of event 'cycles:pp', Event count (approx.): 743045339605
Children Self Command Shared Object Symbol
+ 17.14% 17.12% a.out a.out [.] _Z12compute_mddrjjj._omp_fn.0
+ 15.45% 15.43% a.out libc-2.23.so [.] __memcmp_sse4_1
+ 15.21% 15.19% a.out libc-2.23.so [.] __memset_avx2
+ 13.09% 13.07% a.out libc-2.23.so [.] _int_free
+ 11.66% 11.65% a.out libc-2.23.so [.] _int_malloc
+ 10.21% 10.20% a.out libc-2.23.so [.] malloc
malloc
&amp;的原因free
是boost::dynamic_bitset
个对象的常量创建,基本上是std::vector
个。注意:使用某个功能的perf
,can be challenging to find the callers。您可以在gdb
中运行,在执行阶段中断break balloc
,continue
以找出来电者。
提高性能的直接方法是尽可能地保持这些对象的存活,以避免一遍又一遍地重新分配。这违背了通常的好做法,即尽可能在本地声明变量。重用dynamic_bitset
对象的转换可能如下所示:
#pragma omp parallel for reduction(+:sum)
for (unsigned int mu_B = 0; mu_B < bs_B; mu_B++) { // loop 3
boost::dynamic_bitset<> mu_B_bin(ns_B,basis_NBsec[mu_B]);
boost::dynamic_bitset<> si_bin(Ns);
boost::dynamic_bitset<> si_A_bin(Ns);
boost::dynamic_bitset<> si_B_bin(Ns);
boost::dynamic_bitset<> sj_bin(Ns);
boost::dynamic_bitset<> sj_A_bin(Ns);
boost::dynamic_bitset<> sj_B_bin(Ns);
for (unsigned int si = 0; si != basis.size(); si++) { // loop 4
si_bin = basis[si];
si_A_bin = si_bin;
assert(si_bin.size() == Ns);
assert(si_A_bin.size() == Ns);
assert(mask_A.size() == Ns);
si_A_bin &= mask_A;
si_A_bin.resize(ns_A);
if (si_A_bin != mu_A_bin)
continue;
si_B_bin = si_bin;
assert(si_bin.size() == Ns);
assert(si_B_bin.size() == Ns);
assert(mask_B.size() == Ns);
// Optimization note: dynamic_bitset::operator&
// does create a new object, operator&= does not
// Same for >>
si_B_bin &= mask_B;
si_B_bin >>= ns_A;
si_B_bin.resize(ns_B);
if (si_B_bin != mu_B_bin)
continue;
for (unsigned int sj = 0; sj < basis.size(); sj++) { // loop 5
sj_bin = basis[sj];
sj_A_bin = sj_bin;
assert(sj_bin.size() == Ns);
assert(sj_A_bin.size() == Ns);
assert(mask_A.size() == Ns);
sj_A_bin &= mask_A;
sj_A_bin.resize(ns_A);
if (sj_A_bin != nu_A_bin)
continue;
sj_B_bin = sj_bin;
assert(sj_bin.size() == Ns);
assert(sj_B_bin.size() == Ns);
assert(mask_B.size() == Ns);
sj_B_bin &= mask_B;
sj_B_bin >>= ns_A;
sj_B_bin.resize(ns_B);
if (sj_B_bin != mu_B_bin)
continue;
sum += gs[si]*gs[sj];
}
}
}
这已经将我系统上的单线程运行时间从~289 s减少到~39 s。此程序几乎完美地扩展到~10个线程(4.1 s)。
对于更多线程,并行循环中存在负载平衡问题。可以通过添加schedule(dynamic)
来缓解一点,但我不确定这对你有多重要。
更重要的是,您应该考虑使用std::bitset
。即使没有极其昂贵的boost::dynamic_bitset
构造函数,它也非常昂贵。大多数情况下,在单个单词中使用超级dynamic_bitest
/ vector
代码和memmove
/ memcmp
。
+ 32.18% 32.15% ope_gcc_dyn ope_gcc_dyn [.] _ZNSt6vectorImSaImEEaSERKS1_
+ 29.13% 29.10% ope_gcc_dyn ope_gcc_dyn [.] _Z12compute_mddrjjj._omp_fn.0
+ 21.65% 0.00% ope_gcc_dyn [unknown] [.] 0000000000000000
+ 16.24% 16.23% ope_gcc_dyn ope_gcc_dyn [.] _ZN5boost14dynamic_bitsetImSaImEE6resizeEmb.constprop.102
+ 10.25% 10.23% ope_gcc_dyn libc-2.23.so [.] __memcmp_sse4_1
+ 9.61% 0.00% ope_gcc_dyn libc-2.23.so [.] 0xffffd47cb9d83b78
+ 7.74% 7.73% ope_gcc_dyn libc-2.23.so [.] __memmove_avx_unaligned
如果您只使用std::bitset
中很少的单词,那基本上就消失了。也许64位对你来说总是足够的。如果它在大范围内是动态的,您可以创建整个函数的模板并将其实例化为多个不同的位点,您可以动态地选择合适的位。我怀疑你应该获得另一个数量级的表现。这可能反过来降低并行效率,需要进行另一轮性能分析。
使用工具来了解代码的性能非常重要。对于各种情况,都有非常简单和非常好的工具。在你的情况下,像perf这样的简单就足够了。