Question

我使用的是16核的节点。但是，当我运行代码并行运行时，它运行速度比串行慢几百倍。我无法理解原因。平行区域如下：

int Vector_mult_Matrix(vector<double> & vec, CTMC_matrix & ctmc_um)
{


     vector<double> res_vec(vec.size(),0);
      omp_set_num_threads(16);
    #pragma omp parallel num_threads(16)
    {
    #pragma omp for schedule(static) nowait 
    for(size_t i=0; i<ctmc_um.trans_num; i++)
    {
    double temp = 0;
        temp = res_vec[ctmc_um.to_index[i]]+vec[ctmc_um.from_index[i]]*ctmc_um.rate[i];

    #pragma omp critical
    res_vec[ctmc_um.to_index[i]] = temp;
    }

}

vec.swap(res_vec);
return 0;
}

Answer 1

我不确定为什么慢100倍但是由于读/写相同的内存区域而变慢，多线程需要锁定此区域，否则您将看到竞争条件。（如果你只是阅读，则不需要锁定。）

您正在使用res_vec[ctmc_um.to_index[i]]，因此即使openmp已使用stride拆分索引，您的res_vec访问索引也可能会被纠缠（[ctmc_um.to_index[i]]的结果。因此，每个其他线程可能需要等待一个线程完成它的工作，其中有16个。

openmp比串行代码慢100倍

1 个答案: