Open MP性能不佳/混乱

时间:2019-02-18 22:52:20

标签: c multithreading openmp

以下是Tim Mattson在Open MP上的一系列视频中的代码。我所做的唯一更改是使线程数达到24,因为我拥有24核计算机。它的表现不尽如人意,我对为什么感到困惑(请参阅下面的结果)。我在这里想念什么吗?我应该提及的是,我是一位具有算法经验的理论计算机科学家,但是在硬件方面我有些生疏。

#include <stdio.h>
#include <omp.h>
static long num_steps = 100000000;
double step;
int main ()
{
  int i;
  double x, pi, sum = 0.0;
  double start_time, run_time;

  step = 1.0/(double) num_steps;
  for (i=1;i<=24;i++){
    sum = 0.0;
    omp_set_num_threads(i);
    start_time = omp_get_wtime();
#pragma omp parallel  
    {
#pragma omp single
      printf(" num_threads = %d",omp_get_num_threads());

#pragma omp for reduction(+:sum)
      for (i=1;i<= num_steps; i++){
          x = (i-0.5)*step;
          sum = sum + 4.0/(1.0+x*x);
      }
    }

    pi = step * sum;
    run_time = omp_get_wtime() - start_time;
    printf("\n pi is %f in %f seconds and %d threads\n",pi,run_time,i);
  }
}

我希望使用24核,速度会快20-24倍,但几乎没有速度快两倍。为什么?!输出如下:

 num_threads = 1
 pi is 3.141593 in 1.531695 seconds and 1 threads
 num_threads = 2
 pi is 3.141594 in 1.405237 seconds and 2 threads
 num_threads = 3
 pi is 3.141593 in 1.313049 seconds and 3 threads
 num_threads = 4
 pi is 3.141592 in 1.069563 seconds and 4 threads
 num_threads = 5
 pi is 3.141587 in 1.058272 seconds and 5 threads
 num_threads = 6
 pi is 3.141590 in 1.016013 seconds and 6 threads
 num_threads = 7
 pi is 3.141579 in 1.023723 seconds and 7 threads
 num_threads = 8
 pi is 3.141582 in 0.760994 seconds and 8 threads
 num_threads = 9
 pi is 3.141585 in 0.791577 seconds and 9 threads
 num_threads = 10
 pi is 3.141593 in 0.868043 seconds and 10 threads
 num_threads = 11
 pi is 3.141592 in 0.797610 seconds and 11 threads
 num_threads = 12
 pi is 3.141592 in 0.802422 seconds and 12 threads
 num_threads = 13
 pi is 3.141590 in 0.941856 seconds and 13 threads
 num_threads = 14
 pi is 3.141591 in 0.928252 seconds and 14 threads
 num_threads = 15
 pi is 3.141592 in 0.867834 seconds and 15 threads
 num_threads = 16
 pi is 3.141593 in 0.830614 seconds and 16 threads
 num_threads = 17
 pi is 3.141592 in 0.856769 seconds and 17 threads
 num_threads = 18
 pi is 3.141591 in 0.907325 seconds and 18 threads
 num_threads = 19
 pi is 3.141592 in 0.880962 seconds and 19 threads
 num_threads = 20
 pi is 3.141592 in 0.855475 seconds and 20 threads
 num_threads = 21
 pi is 3.141592 in 0.825202 seconds and 21 threads
 num_threads = 22
 pi is 3.141592 in 0.759689 seconds and 22 threads
 num_threads = 23
 pi is 3.141592 in 0.751121 seconds and 23 threads
 num_threads = 24
 pi is 3.141592 in 0.745476 seconds and 24 threads

那么,我想念什么?

2 个答案:

答案 0 :(得分:3)

您有一个x变量,该变量在所有线程之间共享。

尽管编译器将优化其使用方式,以便您仍能获得正确的结果(通过将x的计算值保留在寄存器中),但该值将在每次迭代时写出到内存中。在刷新并重新加载缓存行时,这会造成停顿。

解决方法是在循环的主体(x)中声明double x = (i-0.5)*step;,而不是在main的顶部声明。

答案 1 :(得分:0)

通常,对于线程处理,有两点需要考虑以加快速度:

  • 任务的大小,以及其部分是否足够适合并行化
  • 并行化本身的开销(例如,创建线程,杀死线程等)

Amdahl's law为我们提供了一些背景信息。让我们大方一些,并假设将从该加速(p)中受益的代码部分为0.5,或代码的一半。您断言这会使代码快24倍(使s = 24)快:

enter image description here

因此,在理论中,您获得的性能提高了1.92倍,这并不是您希望获得的24倍改进。

对此,有一些想法是分析哪些部分更适合于大量并行化。对此没有线程进行性能分析,并查看性能是否也优于当前线程布局的 with