Question

免责声明：我非常非常擅长在C ++中处理与线程相关的东西。可能我错过了一些明显的东西。

好的，所以这是我的问题：我正在用C ++进行一些Monte-Carlo集成。执行集成的函数将被称为几百次，因此我决定将其并行化以获得潜在的加速。由于Integration本身由一个大循环组成，我决定将它分成相等的 part并让这些子循环中的每一个在不同的线程上运行。请参阅此最小工作示例

#include <iostream>
#include <vector>
#include <cmath>
#include <chrono>
#include <random>
#include <thread>

#define INTEGRATION_POINTS_3D 4e7

#define NUM_THREADS 4

double Psi(std::array<double,3> x, double sigma)
{
    return std::exp(-(x[0]*x[0]+x[1]*x[1]+x[2]*x[2])/(sigma*sigma) )
        + std::exp(-(x[0]*x[0]+x[1]*x[1]+x[2]*x[2])/(2*sigma*sigma) )
        + std::exp(-(x[0]*x[0]+x[1]*x[1]+x[2]*x[2])/(3*sigma*sigma) );
}

double NormPsiMC(double sigma, std::default_random_engine rng)
{
    double xmin(-4*sigma), xmax(4*sigma);
    static std::uniform_real_distribution<double> space_dist(xmin, xmax);

    static std::vector<std::thread> threads(NUM_THREADS);
    std::vector<double> res(NUM_THREADS, 0);
    for(int tid = 0; tid < NUM_THREADS; ++tid)
    {
        threads[tid] = std::thread(std::bind([&](int id){
            std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();

            for(int i = 0; i < INTEGRATION_POINTS_3D/NUM_THREADS; ++i)
            {
                res[id] += std::abs(Psi(std::array<double,3>{{space_dist(rng),space_dist(rng),space_dist(rng)}}, sigma));
            }
            std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
            std::cout << "In thread " << id << "\nI've spend "
                << std::chrono::duration_cast<std::chrono::seconds>(end-start).count() << " s in the loop." << std::endl;
        }, tid) );
    }

    double result(0);
    for(int i = 0; i < NUM_THREADS; ++i)
    {
        threads[i].join();
        result += res[i];
    }
    return std::pow(xmax-xmin,3)*result/INTEGRATION_POINTS_3D;
}

int main()
{
    double sigma(10);
    std::default_random_engine rng;

    for(int i = 0; i < 5; ++i)
    {
        std::cout << NormPsiMC(sigma, rng) << std::endl << std::endl;
    }

    return 0;
}

现在，为此我得到了输出

In thread 1                                                              
I've spent 2 s in the loop.                                              
In thread 2                                                              
I've spent 2 s in the loop.                                              
In thread 3                                                              
I've spent 2 s in the loop.                                              
In thread 0                                                              
I've spent 2 s in the loop.                                              
50289.1

In thread 2                                                              
I've spent 0 s in the loop.                                              
In thread 1                                                              
I've spent 0 s in the loop.                                              
In thread 3                                                              
I've spent 0 s in the loop.                                              
In thread 0                                                              
I've spent 2 s in the loop.                                              
50289.1

...并且每次连续运行都需要0。好吧，这很奇怪，因为第一个线程的5次调用花费的时间更长，即使他们正在做同样的事情。 Aren的线程没有＆＃34;热身＆＃34;？

我知道这听起来很傻，但尽管如此，我还是试着去测试一下。我加了一个功能

void idle_func(void)
{
    std::this_thread::sleep_for(std::chrono::milliseconds(100));
}

并在int main()

std::vector<std::thread> threads(NUM_THREADS);
for(int i = 0; i < NUM_THREADS; ++i)
{
    threads[i] = std::thread(idle_func);
    threads[i].join();
}

现在输出变为

In thread 1                                                              
I've spent 0 s in the loop.                                              
In thread 2                                                              
I've spent 0 s in the loop.                                              
In thread 3                                                              
I've spent 2 s in the loop.                                              
In thread 0                                                              
I've spent 2 s in the loop.                                              
50289.1

In thread 2                                                              
I've spent 0 s in the loop.                                              
In thread 1                                                              
I've spent 0 s in the loop.                                              
In thread 3                                                              
I've spent 0 s in the loop.                                              
In thread 0                                                              
I've spent 0 s in the loop.                                              
50289.1

好吧，代码需要更长时间的实例肯定更少。似乎我，至少在某种程度上，是在正确的轨道上。但是，添加更多的idle_func - 线程调用不会进一步降低这种影响，老实说，这个＆＃34;修复＆＃34;很糟糕。另一方面，我真的不知道发生了什么。

顺便说一句：我正在使用-Ofast，但这会出现在所有优化级别。仅在没有优化的情况下，效果不可见。然而，开销似乎在很大程度上取决于集成点的数量;对于8e7，慢线程的时间增加到5s，快速的时间增加到1。

编辑：我现在已在其他两台机器上对此进行了测试，但无法重现上述症状。因此，我认为这可能与我的工作机器有关。

c ++ std :: thread运行时依赖于先前的调用

0 个答案: