Question

我有一个非常简单的并行for循环，它只是将零写入整数数组。但事实证明线程越多，循环越慢。我认为这是由于一些缓存抖动所以我玩了调度，块大小，__restrict__，在并行块内嵌入并行，并刷新。然后我注意到读取数组进行缩减也比较慢。

这显然应该非常简单，并且应该几乎线性加速。我在这里缺少什么？

完整代码：

#include <omp.h>
#include <vector>
#include <iostream>
#include <ctime>

void tic(), toc();

int main(int argc, const char *argv[])
{
    const int COUNT = 100;
    const size_t sz = 250000 * 200;
    std::vector<int> vec(sz, 1);

    std::cout << "max threads: " << omp_get_max_threads()<< std::endl;

    std::cout << "serial reduction" << std::endl;
    tic();
    for(int c = 0; c < COUNT; ++ c) {
        double sum = 0;
        for(size_t i = 0; i < sz; ++ i)
            sum += vec[i];
    }
    toc();

    int *const ptr = vec.data();
    const int sz_i = int(sz); // some OpenMP implementations only allow parallel for with int

    std::cout << "parallel reduction" << std::endl;
    tic();
    for(int c = 0; c < COUNT; ++ c) {
        double sum = 0;
        #pragma omp parallel for default(none) reduction(+:sum)
        for(int i = 0; i < sz_i; ++ i)
            sum += ptr[i];
    }
    toc();

    std::cout << "serial memset" << std::endl;
    tic();
    for(int c = 0; c < COUNT; ++ c) {
        for(size_t i = 0; i < sz; ++ i)
            vec[i] = 0;
    }
    toc();

    std::cout << "parallel memset" << std::endl;
    tic();
    for(int c = 0; c < COUNT; ++ c) {
        #pragma omp parallel for default(none)
        for(int i = 0; i < sz_i; ++ i)
            ptr[i] = 0;
    }
    toc();

    return 0;
}

static clock_t ClockCounter;

void tic()
{
    ClockCounter = std::clock();
}

void toc()
{
    ClockCounter = std::clock() - ClockCounter;
    std::cout << "\telapsed clock ticks: " << ClockCounter << std::endl;
}

运行此产生：

g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1
./omp_test
max threads: 12
serial reduction
  elapsed clock ticks: 1790000
parallel reduction
  elapsed clock ticks: 19690000
serial memset
  elapsed clock ticks: 3860000
parallel memset
  elapsed clock ticks: 20800000

如果我使用-O2运行，g ++能够优化串行减少，我得到零时间，因此-O1。此外，尽管仍然存在一些差异，但omp_set_num_threads(1);使得时间更加相似：

g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1
./omp_test
max threads: 1
serial reduction
  elapsed clock ticks: 1770000
parallel reduction
  elapsed clock ticks: 7370000
serial memset
  elapsed clock ticks: 2290000
parallel memset
  elapsed clock ticks: 3550000

这应该是相当明显的，我觉得我没有看到一些非常基本的东西。我的CPU是具有超线程的英特尔（R）Xeon（R）CPU E5-2640 0 @ 2.50GHz，但在具有4个内核且没有超线程的同事的i5中观察到相同的行为。我们都在运行Linux。

修改

似乎有一个错误是在时间方面，运行时：

static double ClockCounter;

void tic()
{
    ClockCounter = omp_get_wtime();//std::clock();
}

void toc()
{
    ClockCounter = omp_get_wtime()/*std::clock()*/ - ClockCounter;
    std::cout << "\telapsed clock ticks: " << ClockCounter << std::endl;
}

产生更“合理”的时间：

g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1
./omp_test
max threads: 12
serial reduction
  elapsed clock ticks: 1.80974
parallel reduction
  elapsed clock ticks: 2.07367
serial memset
  elapsed clock ticks: 2.37713
parallel memset
  elapsed clock ticks: 2.23609

但是，仍然没有加速，只是不慢。

EDIT2 ：

根据user8046的建议，代码严重受内存限制。并且正如Z boson所建议的那样，串行代码很容易被优化掉，并且不确定这里测量的是什么。所以我做了一个小的改变，将总和放在循环之外，这样它在c的每次迭代时都不会为零。我还用sum+=F(vec[i])和memset操作替换了vec[i] = F(i)的缩减操作。运行方式：

g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1 -D"F(x)=sqrt(double(x))"
./omp_test
max threads: 12
serial reduction
  elapsed clock ticks: 23.9106
parallel reduction
  elapsed clock ticks: 3.35519
serial memset
  elapsed clock ticks: 43.7344
parallel memset
  elapsed clock ticks: 6.50351

计算平方根为线程增加了更多工作，最终有一些合理的加速（大约7x，这是有意义的，因为超线程内核共享内存通道）。

Answer 1

你发现了时间错误。仍然没有加速，因为你的两个测试用例都是严重的内存限制。在典型的消费者硬件上，所有内核共享一个内存总线，因此使用更多线程不会为您带来更多带宽，因为这是瓶颈，加速。如果减小问题大小，这可能会改变，因此它适合缓存，或者确保增加每个数据的计算次数，例如，如果计算减少exp（vec [i]）或1 / vec [一世]。对于memset：你可以用一个线程使内存饱和，你永远不会看到加速。（仅当您可以访问具有更多线程的第二个内存总线时，与某些多插槽体系结构一样）。关于减少的一个评论，这很可能不是用锁实现的，这可能是非常低效的，但是使用一个没有那么糟糕的对数加速的加法树。

Answer 2

除了您在Linux中使用时钟功能时出错，您可以通过阅读这些问题/答案来回答您的其余问题。

in-an-openmp-parallel-code-would-there-be-any-benefit-for-memset-to-be-run-in-p/11579987

measuring-memory-bandwidth-from-the-dot-product-of-two-arrays

memset-in-parallel-with-threads-bound-to-each-physical-core

因此，您应该看到使用memset的多个线程带来的显着优势，即使在单个套接字系统上也可以进行减少。我已经编写了自己的工具来测量带宽。您可以从我的i5-4250U（Haswell）中找到一些结果，其中2核（GCC 4.8，Linux 3.13，EGLIBC 2.19）超过1 GB。 vsum是你的减少。请注意，即使在这两个核心系统上也有显着的改进。

一个帖子

C standard library
                       GB    time(s)       GB/s     GFLOPS   efficiency
memset:              0.50       0.80       6.68       0.00        inf %
memcpy:              1.00       1.35       7.93       0.00        inf %

Agner Fog's asmlib
                       GB    time(s)       GB/s     GFLOPS   efficiency
memset:              0.50       0.71       7.53       0.00        inf %
memcpy:              1.00       0.93      11.51       0.00        inf %

my_memset   
                     0.50       0.71       7.53       0.00        inf %


FMA3 reduction tests
                       GB    time(s)       GB/s     GFLOPS   efficiency
vsum:                0.50       0.53      10.08       2.52        inf %
vmul:                0.50       0.68       7.93       1.98        inf %
vtriad:              0.50       0.70       7.71       3.85        inf %
dot                  1.00       1.08       9.93       2.48        inf %

两个帖子

C standard library
                       GB    time(s)       GB/s     GFLOPS   efficiency
memset:              0.50       0.64       8.33       0.00        inf %
memcpy:              1.00       1.10       9.76       0.00        inf %

Agner Fog's asmlib
                       GB    time(s)       GB/s     GFLOPS   efficiency
memset:              0.50       0.36      14.98       0.00        inf %
memcpy:              1.00       0.66      16.30       0.00        inf %

my_memset
                     0.50       0.36      15.03       0.00        inf %


FMA3 tests
standard sum tests with OpenMP: 2 threads
                       GB    time(s)       GB/s     GFLOPS   efficiency
vsum:                0.50       0.41      13.03       3.26        inf %
vmul:                0.50       0.39      13.67       3.42        inf %
vtriad:              0.50       0.44      12.20       6.10        inf %
dot                  1.00       0.97      11.11       2.78        inf %

这是我的自定义memset函数（我还有其他几个测试）。

void my_memset(int *s, int c, size_t n) {
    int i;
    __m128i v = _mm_set1_epi32(c);
    #pragma omp parallel for
    for(i=0; i<n/4; i++) {
        _mm_stream_si128((__m128i*)&s[4*i], v);
    }
}

编辑：

您应该使用-O3和-ffast-math进行编译。定义外环的外部总和，然后将其打印出来，这样GCC就不会对其进行优化。 GCC不会自动向量化减少因为浮点运算不是关联的，并且对循环进行向量化可能会破坏IEEE浮点规则。使用-ffast-math允许浮点arithemetic是关联的，这允许GCC对减少进行矢量化。应该指出的是，已经在OpenMP中进行了减少，假设浮点运算是关联的，因此它已经破坏了IEEE浮点规则。

double sum = 0;
tic();
for(int c = 0; c < COUNT; ++ c) { 
    #pragma omp parallel for reduction(+:sum)
    for(int i = 0; i < sz_i; ++ i)
        sum += ptr[i];
}
toc();
printf("sum %f\n", sum);

编辑：

我测试了你的代码并进行了一些修改。使用多线程的减少和memset，我得到更快的时间

max threads: 4
serial reduction
dtime 1.86, sum 705032704
parallel reduction
dtime 1.39 s, sum 705032704
serial memset
dtime 2.95 s
parallel memset
dtime 2.44 s
serial my_memset
dtime 2.66 s
parallel my_memset
dtime 1.35 s

这是我使用的代码（g ++ foo.cpp -fopenmp -O3 -ffast-math）

#include <omp.h>
#include <vector>
#include <iostream>
#include <ctime>
#include <stdio.h>

#include <xmmintrin.h>

void my_memset(int *s, int c, size_t n) {
    int i;
    __m128i v = _mm_set1_epi32(c);
    for(i=0; i<n/4; i++) {
        _mm_stream_si128((__m128i*)&s[4*i], v);
    }
}

void my_memset_omp(int *s, int c, size_t n) {
    int i;
    __m128i v = _mm_set1_epi32(c);
    #pragma omp parallel for
    for(i=0; i<n/4; i++) {
        _mm_stream_si128((__m128i*)&s[4*i], v);
    }
}

int main(int argc, const char *argv[])
{
    const int COUNT = 100;
    const size_t sz = 250000 * 200;
    std::vector<int> vec(sz, 1);

    std::cout << "max threads: " << omp_get_max_threads()<< std::endl;

    std::cout << "serial reduction" << std::endl;
    double dtime;
    int sum;

    dtime = -omp_get_wtime();
    sum = 0;
    for(int c = 0; c < COUNT; ++ c) {
        for(size_t i = 0; i < sz; ++ i)
            sum += vec[i];
    }
    dtime += omp_get_wtime();
    printf("dtime %.2f, sum %d\n", dtime, sum);

    int *const ptr = vec.data();
    const int sz_i = int(sz); // some OpenMP implementations only allow parallel for with int

    std::cout << "parallel reduction" << std::endl;


    dtime = -omp_get_wtime();
    sum = 0;
    for(int c = 0; c < COUNT; ++ c) {
        #pragma omp parallel for default(none) reduction(+:sum)
        for(int i = 0; i < sz_i; ++ i)
            sum += ptr[i];
    }
    dtime += omp_get_wtime();
    printf("dtime %.2f s, sum %d\n", dtime, sum);

    std::cout << "serial memset" << std::endl;

    dtime = -omp_get_wtime();
    for(int c = 0; c < COUNT; ++ c) {
        for(size_t i = 0; i < sz; ++ i)
            vec[i] = 0;
    }   
    dtime += omp_get_wtime();
    printf("dtime %.2f s\n", dtime);

    std::cout << "parallel memset" << std::endl;
    dtime = -omp_get_wtime();
    for(int c = 0; c < COUNT; ++ c) {
        #pragma omp parallel for default(none)
        for(int i = 0; i < sz_i; ++ i)
            ptr[i] = 0;
    }
    dtime += omp_get_wtime();
    printf("dtime %.2f s\n", dtime);

    std::cout << "serial my_memset" << std::endl;

    dtime = -omp_get_wtime();
    for(int c = 0; c < COUNT; ++ c) my_memset(ptr, 0, sz_i);

    dtime += omp_get_wtime();
    printf("dtime %.2f s\n", dtime);

    std::cout << "parallel my_memset" << std::endl;
    dtime = -omp_get_wtime();
    for(int c = 0; c < COUNT; ++ c) my_memset_omp(ptr, 0, sz_i);
    dtime += omp_get_wtime();
    printf("dtime %.2f s\n", dtime);

    return 0;
}

Answer 3

您正在使用std :: clock来报告使用的CPU时间，而不是实时。因此，每个处理器的时间相加，并且总是高于单线程时间（由于开销）。

http://en.cppreference.com/w/cpp/chrono/c/clock

OpenMP令人尴尬地并行循环，没有加速

3 个答案: