Question

我正在使用OpenMP测试C中并行程序的加速。使用-O3标志用gcc编译代码，执行时间似乎要小得多。但是，与没有优化标志编译的代码相比，我对不同的线程数（2,4,8,16,24）的速度一直变慢。这怎么可能？

以下是有关我迄今为止所发现的内容的更多信息。我正在编写一个代码，用于根据Sieve of Eratosthenes查找素数，并尝试使用OpenMP使用并行版本对其进行优化。这是代码

#include <stdio.h>
#include <stdlib.h>
#include <omp.h> 
#include <math.h> 

// ind2num: returns the integer (3<=odd<=numMax)
//      represented by index i at prime_numbers (0<=i<=maxInd)
#define ind2num(i)  (2*(i)+3)
// num2ind: retorns the index (0<=i<=maxInd) at prime_numbers
//      which represents the number (3<=odd<=numMax)
#define num2ind(i)  (((i)-3)/2)

// Sieve: find all prime numbers until ind2num(maxInd)
void Sieve(int *prime_numbers, long maxInd) {
    long maxSqrt;
    long baseInd;
    long base;
    long i;

    // square root of the largest integer (largest possible prime factor)
    maxSqrt = (long) sqrt((long) ind2num(maxInd));

    // first base
    baseInd=0;
    base=3;

    do {
        // marks as non-prime all multiples of base starting at base^2
        #pragma omp parallel for schedule (static)
        for (i=num2ind(base*base); i<=maxInd; i+=base) {
            prime_numbers[i]=0;
        }

        // updates base to next prime number
        for (baseInd=baseInd+1; baseInd<=maxInd; baseInd++)
            if (primos[baseInd]) {
                base = ind2num(baseInd);
                break;
            }
    }
    while (baseInd <= maxInd && base <= maxSqrt);

}

例如，如果我执行它以查找小于1000000000（10 ^ 9）的所有素数，我最终会得到不同线程数的以下执行时间（1,2,4,8,16,24）：

没有-O3 | 56.31s | 28.87s | 21.77s | 11.19s | 6.13s | 4.50s |

使用-O3 .... | 10.10s | 5.23s | 3.74s | 2.81s | 2.62s | 2.52s |

以下是相应的加速：

没有-O3 | 1 | 1.95 | 2.59 | 5.03 | 9.19 | 12.51 |

使用-O3 .... | 1 | 1.93 | 2.70 | 3.59 | 3.85 | 4.01 |

为什么我用-O3标志继续降低速度？

Answer 1

算法的执行需要一定量的内存带宽。代码越不优化，内部CPU机制就越多地占据运行时间。代码越优化，内存速度越大，运行时间就越大。

由于未经优化的代码效率较低，因此在系统内存带宽饱和之前，可以运行更多内核。由于优化的代码效率更高，因此可以更快地完成内存访问，从而对系统内存带宽造成更大的负担。这使得它不太可并行化。

O3优化标志使并行处理中的加速变差

1 个答案: