Question

我正在执行以下代码，在每个点和我在地图dat[]中拥有的所有其他点之间构建距离矩阵。尽管代码工作正常，但代码在运行时方面的性能并没有提高，这意味着如果我在8核机器上设置thread = 1或甚至10的数量，则需要相同的时间。因此，如果有人能帮助我知道我的代码中有什么问题，并且有人有任何建议可以帮助我使代码运行得更快，那么我将非常感激。以下是代码：

map< int,string >::iterator datIt;
map <int, map< int, double> > dist;
int mycont=0;
datIt=dat.begin();
int size=dat.size();
omp_lock_t lock;
omp_init_lock(&lock);
#pragma omp  parallel    //construct the distance matrix
{   
    map< int,string >::iterator datItLocal=datIt;
    int lastIdx = 0;
    #pragma omp for   
    for(int i=0;i<size;i++)
    {
        std::advance(datItLocal, i - lastIdx);
        lastIdx = i;
        map< int,string >::iterator datIt2=datItLocal;
        datIt2++;
        while(datIt2!=dat.end())
        {
            double ecl=0;
            int c=count((*datItLocal).second.begin(),(*datItLocal).second.end(),delm);
            string line1=(*datItLocal).second;
            string line2=(*datIt2).second;
            for (int i=0;i<c;i++)
            {
                double num1=atof(line1.substr(0,line1.find_first_of(delm)).c_str());
                line1=line1.substr(line1.find_first_of(delm)+1).c_str();
                double num2=atof(line2.substr(0,line2.find_first_of(delm)).c_str());
                line2=line2.substr(line2.find_first_of(delm)+1).c_str();
                ecl += (num1-num2)*(num1-num2);
            }
            ecl=sqrt(ecl);
            omp_set_lock(&lock);
            dist[(*datItLocal).first][(*datIt2).first]=ecl;
            dist[(*datIt2).first][(*datItLocal).first]=ecl;
            omp_unset_lock(&lock);
            datIt2++;
        }
    }
}
omp_destroy_lock(&lock);

Answer 1

我的猜测是使用单个锁来保护'dist'序列化您的程序。选项1：考虑使用细粒度锁定策略。通常，如果dist.size（）远大于线程数，则可以从中受益。

map <int, omp_lock_t  > locks;
...
int key1 = (*datItLocal).first;
int key2 = (*datIt2).first;
omp_set_lock(&(locks[key1]));
omp_set_lock(&(locks[key2]));
dist[(*datItLocal).first][(*datIt2).first]=ecl;
dist[(*datIt2).first][(*datItLocal).first]=ecl;
omp_unset_lock(&(locks[key2]));
omp_unset_lock(&(locks[key1]));

选项2：您的编译器可能已在选项1中提及此优化，因此您可以尝试删除锁并使用内置的临界区：

   #pragma omp critical
   {
      dist[(*datItLocal).first][(*datIt2).first]=ecl;
      dist[(*datIt2).first][(*datItLocal).first]=ecl;
   }

Answer 2

我有点不确定你正在尝试用你的循环等做什么，看起来它会在地图上做一个二次嵌套循环。假设这是预期的，我认为以下行在并行化时表现不佳：

std::advance(datItLocal, i - lastIdx);

如果OpenMP被禁用，那么每次都会前进一步，这很好。但是使用OpenMP，会有多个线程随机地执行该循环的块。因此，其中一个可能从i = 100000开始，因此它必须在地图中前进100000步才能开始。如果有很多线程一次被给予相对较小的循环块，那么这可能会发生很多。甚至可能是你最终被内存/缓存限制，因为你经常不得不走过所有这个可能是大的地图。看起来这可能是（部分）你的罪魁祸首，因为随着更多线程可用，它可能会变得更糟。

从根本上说，我想我对尝试在顺序数据结构上并行迭代有点怀疑。如果您对它进行了分析，您可能会更多地了解它的哪些部分确实很慢。

openmp代码的性能以及如何使其更快

2 个答案: