Question

在下面的示例中，C ++ 11线程执行大约需要50秒，但OMP线程只需5秒。有什么想法吗？（我可以向你保证，如果你正在做真正的工作而不是doNothing，或者如果你以不同的顺序做等等，它仍然适用。）我也在16核机器上。

#include <iostream>
#include <omp.h>
#include <chrono>
#include <vector>
#include <thread>

using namespace std;

void doNothing() {}

int run(int algorithmToRun)
{
    auto startTime = std::chrono::system_clock::now();

    for(int j=1; j<100000; ++j)
    {
        if(algorithmToRun == 1)
        {
            vector<thread> threads;
            for(int i=0; i<16; i++)
            {
                threads.push_back(thread(doNothing));
            }
            for(auto& thread : threads) thread.join();
        }
        else if(algorithmToRun == 2)
        {
            #pragma omp parallel for num_threads(16)
            for(unsigned i=0; i<16; i++)
            {
                doNothing();
            }
        }
    }

    auto endTime = std::chrono::system_clock::now();
    std::chrono::duration<double> elapsed_seconds = endTime - startTime;

    return elapsed_seconds.count();
}

int main()
{
    int cppt = run(1);
    int ompt = run(2);

    cout<<cppt<<endl;
    cout<<ompt<<endl;

    return 0;
}

Answer 1

OpenMP thread-pools for its Pragmas（也是here和here）。旋转和拆卸螺纹是昂贵的。 OpenMP避免了这种开销，所以它所做的只是实际工作和执行状态的最小共享内存穿梭。在你的Threads代码中，你每次迭代都会旋转并拆掉一组新的16个线程。

Answer 2

我尝试了一个100循环的代码 Choosing the right threading framework并且花了 OpenMP 0.0727，Intel TBB 0.6759和C ++线程库0.5962 mili-seconds。

我也应用了AruisDante建议的内容;

void nested_loop(int max_i, int band)  
{
    for (int i = 0; i < max_i; i++)
    {
        doNothing(band);
    }
}
...
else if (algorithmToRun == 5)
{
    thread bristle(nested_loop, max_i, band);
    bristle.join();
}

此代码看起来比原始C ++ 11线程部分花费更少的时间。

OpenMP vs C ++ 11线程

2 个答案: