素数计数器(Eratosthenes的筛子)使用MPI,太慢

时间:2014-05-11 05:36:16

标签: performance mpi overhead sieve-of-eratosthenes

以下代码计算所有素数,直到50,000,000并正确地100%工作。问题是它需要太长时间。有32个处理器,我有大约42秒。我的同伴有16秒,我似乎无法找到我的代码滞后的地方。请留下任何建议:)

#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

const int n=50000000;   //number until which primes are counted (inclusive)

int main (int argc, char *argv[]) {

    long local_count=0, global_count=0,     //variables used to keep count of the primes
    start, finish,      //highest and lowest possible primes for this processor 
    i;  

    int rank,           //processor id 
    p,          //number of processes
    size, proc0_size;   //amount of numbers to check on any processor and proc 0
double runtime;         //variable used to keep track of total elapsed time

    //initialize the MPI execution environment
    MPI_Init (NULL, NULL);

    //determine the rank of the calling processor in the communicator
    MPI_Comm_rank (MPI_COMM_WORLD, &rank);
        //determine the size of communicator
    MPI_Comm_size (MPI_COMM_WORLD, &p);

    //Ensures that every processor enters code at around the same time
    MPI_Barrier(MPI_COMM_WORLD);
    //Start the timer
    runtime = -MPI_Wtime();

    //compute the range and size to be used for this processor
    start = 2 + rank*(n-1)/p;
    finish = 1 + (rank+1)*(n-1)/p;
    size = finish - start + 1;

    //determine the size of processor 0
    proc0_size = (n-1)/p;

    //in the case where there are too many processors for the amount of numbers
    //to check...   
    if ((2 + proc0_size) < (long) sqrt((double) n)) {
        if (rank == 0)
            printf ("Too many processors to calculate the number of primes up to %d\n", n);
        MPI_Finalize();
        exit(1);
    }

    int j;
    //check every number in the range of this processor
    for (j=start; j<=finish; j++){
     //if the number is not composite, ie prime, increment the local counter
     if (isComposite(j)==0){
        local_count++;
     }
     }

     //MPI_Reduce used to combine all local counts
         MPI_Reduce (&local_count, &global_count, 1, MPI_LONG_LONG, MPI_SUM, 0, MPI_COMM_WORLD);

     //adjust the total elapsed runtime
     runtime += MPI_Wtime();
     printf("Process %d finished\n", rank); 

     // print the results from process ranked root
     if (rank == 0) {
            printf ("There are %ld primes less than or equal to %d\n", global_count, n);
        printf ("Total elapsed time:  %f seconds\n", runtime);
     }

     //terminate the execution environment
     MPI_Finalize ();
     return 0;
   }

//function to check if a number is composite
//returns 0 if it is not composite (prime) and returns a 1 if it is composite

int isComposite (int num){

int retval=0;

if(num==1)
        retval = 1;

   else if(num%2 == 0 && num!=2)
        retval = 1;

    if(retval != 1) {
    int j;
    for(j=3; j<num; j+=2) {
                if(num%j == 0 ){
                        retval = 1;
                        break;
                }
                if(j*j>num) break;
            }
    }

if(retval == 0)
    return 0;
else 
    return 1;

}

1 个答案:

答案 0 :(得分:0)

我编译了你的测试并在本地机器上运行它。

我得到了

 3001134 primes less than or equal to 50000000 

Total elapsed time:约35-37秒。我的PC有4核Q6600核心,2.4 GHz,64位ubuntu。测试编译为mpicc t.c -o t -lm -O3

只有两个核心 - 时间是66秒。

我注意到:排名较高的流程需要做更多的工作,并且他们的计算时间晚于排名较低的流程。因此,当您使用批次流程时,总执行时间由最后一个具有最高级别的流程定义。

尝试重新分配工作(为较高级别的流程提供较少的细分)并优化您的isComposite功能。正如我在使用perf record进行性能分析的结果中看到的那样,num%j == 0行需要花费很多时间(大约80%)。

最好准备小素数列表并将它们作为第一步筛选,然后再转换为更昂贵的筛分。

同样,筛选意味着不会在start..finish上迭代j并测试每个j,而是从start..finish创建数组,然后使用i进行迭代 - 每个数字减去{ {1}},并将sqrt(finish)array[2*i]array[3*i]array[4*i]等标记为复合(您可以使用加法而不是乘法)。然后,您将从数组中计算未标记的元素,以从间隔中获取素数。 (请检查来自http://en.wikipedia.org/wiki/Sieve_of_Eratosthenes的动画)。您的代码是最慢的http://en.wikipedia.org/wiki/Trial_division