Question

我正在使用openMP中的代码。代码必须在文件中打印所有素数在2到1000000之间。串行算法需要150秒才能完成所有计算，两个线程export OMP_NUM_THREADS=2代码在81秒内运行（这意味着加速等于1.85）。但最多2 export OMP_THREADS=3,4个线程，加速不会改变。它仍然等于~1.8。

我也没有任何改变地改变了时间安排。

我的代码在哪里 primes.cpp 。您可以在编辑器上复制并过去它，并使用以下行命令进行编译：

~$ g++ primes.cpp -o primes -fopenmp

将流程数量更改为2（或任何您喜欢的）

~$ export OMP_NUM_THREADS=2

更改任务计划（静态，动态，指导）

~$ export OMP_SCHEDULE=dynamic,100000

并使用

运行它

~$ ./primes

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <vector>
#include <algorithm>
#include <time.h>
#include <omp.h>

#define SIZE 1000000

using namespace std;



int main(){
    // code permettant derecuperer dans un fichier la liste des
    // nombres premiers entre O et SIZE

    // variables
    int cprime;
    int chunk;
    int lap, loop, i;
    int isprime;
    int count;

    FILE * file;
    char * filename;

    time_t t1;
    vector<int>primelist;

    int thread_num;
    //omp_sched_t schedule;

    // initialisation
    t1 = time(NULL);
    chunk = 100000;
    count = 0;

    filename = (char *) malloc(sizeof(char)*100);
    strcpy(filename, "primes.txt");

    file = fopen(filename, "w");

    // ------------- ALGORITHME ---------------
    #pragma omp parallel private(thread_num)
    {
      thread_num = omp_get_thread_num();

      if(thread_num == 0) 
          printf("%d processor are available for work\n", omp_get_num_threads());      

      #pragma omp barrier
      #pragma omp critical
      {
     printf("I'm processor %d ready for work\n", thread_num);
      }

    }

    #pragma omp parallel for private(cprime, loop, isprime) schedule(runtime)     shared(primelist) reduction(+:count)
    for(cprime = 2; cprime < SIZE; cprime++){

        loop = 1;
        isprime = 1;

        // looking if it's a prime number
        while((++loop<cprime) && isprime){
            if(cprime % loop == 0) isprime = 0;
        }

        if(isprime) {    
             #pragma omp critical
          {
            primelist.push_back(loop);
          }   

          count++;
        }

        #pragma omp critical 
        {
          if(cprime % chunk == 0) 
            printf("Indicator from thread %d current(size N) : %d\n",omp_get_thread_num(),     cprime);
        }

    }

    sort(primelist.begin(), primelist.end());
    lap = primelist.size();

    for(i = 0; i < lap; i++)
      fprintf(file, "%d\n", primelist[i]);

    fclose(file);

    printf("%d primes where discover between 0 and %d, duration of the operation         %d\n", count, SIZE, (int) difftime(time(NULL), t1));

    return 0;

}

运行时环境信息

我的电脑有4个处理器

我已在文件/proc/cpuinfo中对其进行了验证，其中说明从processor : 0转移到processor 3。所有 Intel（R）Core（TM）i5 CPU M 600 @ 2.53 GHZ

感谢您的回复

Answer 1

检查正在运行它的计算机上的CPU。如果它没有超过2个核心，那么除了两个线程之外，你不太可能看到很多改进。

要小心考虑超线程CPU，它们的内核数量是操作系统中实际数量的两倍。

Answer 2

我要做的第一件事就是使用像

这样的OpenMP分析器

http://www.vi-hps.org/datapool/page/18/fuerlinger.pdf

以确定并行性是否有问题。可能是你正在认真对待事情中间的后退，这需要时间。或者也许for循环没有正确并行化，即使快速浏览并没有告诉我某些东西本身就是错误的。

接下来，请记住针对最快的已知串行实现来测量代码。在Knuth，TaOCP中有一个基于筛子的 hard 用并行算法击败。

Answer 3

首先，你不应该期望从一个简单的实现中获得线性加速。仅在极少数情况下，并行实现将针对任意数量的内核进行线性扩展。

但是您的代码以及测量运行时的方式也存在一些问题。两者都可能给你一个加速加速的印象。

关于您的代码我必须说同步（在您的情况下通过具有关键部分）总是会显着降低您的软件速度。我自己已经多次遇到这个问题了。但与你的问题相反，我事先知道我的载体中会有多少元素。所以我可以先调整向量的大小，然后将元素放在正确的位置，而不将它们附加到向量上。这显着加快了许多处理器的代码速度。不过，我对你的问题没有任何类似的解决方案。

您的代码中也存在一些小错误：您的变量count在几次分配后将没有任何可预测的值。它也应该在关键部分（或者在这里你可以使用atomic操作）。更好的方法是在for循环中使用此变量OpenMP private并使用+进行缩减，如下所示：

#pragma omp parallel for private(cprime, loop, isprime, count) reduction (+: count) schedule(runtime)

这将在完成for循环后为count生成正确的结果。

我真的不明白为什么你在schedule(runtime)声明中使用for或者在这里发生了什么。但是您应该知道您覆盖了先前使用export声明设置的计划。

现在，这是应用程序计时的问题：您正在计算整个应用程序，而不仅仅是并行for循环。在这种情况下，您应该考虑您还包括顺序排序。这限制了您可以从应用程序中获得的加速。此外，对于顺序应用程序的初始基准测试，您应该只使用一个线程打开OpenMP;它会比没有OpenMP的应用程序慢，因为OpenMP - 即使只有一个线程 - 也会有很小的开销。这可能会为您提供两个线程的预期2倍速度。

如何在我的代码中使用超过3个线程获得线性加速？

3 个答案: