免责声明:我非常非常擅长在C ++中处理与线程相关的东西。可能我错过了一些明显的东西。
好的,所以这是我的问题:我正在用C ++进行一些Monte-Carlo集成。执行集成的函数将被称为几百次,因此我决定将其并行化以获得潜在的加速。由于Integration本身由一个大循环组成,我决定将它分成相等的 part并让这些子循环中的每一个在不同的线程上运行。请参阅此最小工作示例
#include <iostream>
#include <vector>
#include <cmath>
#include <chrono>
#include <random>
#include <thread>
#define INTEGRATION_POINTS_3D 4e7
#define NUM_THREADS 4
double Psi(std::array<double,3> x, double sigma)
{
return std::exp(-(x[0]*x[0]+x[1]*x[1]+x[2]*x[2])/(sigma*sigma) )
+ std::exp(-(x[0]*x[0]+x[1]*x[1]+x[2]*x[2])/(2*sigma*sigma) )
+ std::exp(-(x[0]*x[0]+x[1]*x[1]+x[2]*x[2])/(3*sigma*sigma) );
}
double NormPsiMC(double sigma, std::default_random_engine rng)
{
double xmin(-4*sigma), xmax(4*sigma);
static std::uniform_real_distribution<double> space_dist(xmin, xmax);
static std::vector<std::thread> threads(NUM_THREADS);
std::vector<double> res(NUM_THREADS, 0);
for(int tid = 0; tid < NUM_THREADS; ++tid)
{
threads[tid] = std::thread(std::bind([&](int id){
std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
for(int i = 0; i < INTEGRATION_POINTS_3D/NUM_THREADS; ++i)
{
res[id] += std::abs(Psi(std::array<double,3>{{space_dist(rng),space_dist(rng),space_dist(rng)}}, sigma));
}
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
std::cout << "In thread " << id << "\nI've spend "
<< std::chrono::duration_cast<std::chrono::seconds>(end-start).count() << " s in the loop." << std::endl;
}, tid) );
}
double result(0);
for(int i = 0; i < NUM_THREADS; ++i)
{
threads[i].join();
result += res[i];
}
return std::pow(xmax-xmin,3)*result/INTEGRATION_POINTS_3D;
}
int main()
{
double sigma(10);
std::default_random_engine rng;
for(int i = 0; i < 5; ++i)
{
std::cout << NormPsiMC(sigma, rng) << std::endl << std::endl;
}
return 0;
}
现在,为此我得到了输出
In thread 1
I've spent 2 s in the loop.
In thread 2
I've spent 2 s in the loop.
In thread 3
I've spent 2 s in the loop.
In thread 0
I've spent 2 s in the loop.
50289.1
In thread 2
I've spent 0 s in the loop.
In thread 1
I've spent 0 s in the loop.
In thread 3
I've spent 0 s in the loop.
In thread 0
I've spent 2 s in the loop.
50289.1
...并且每次连续运行都需要0。好吧,这很奇怪,因为第一个 线程的5次调用花费的时间更长,即使他们正在做同样的事情。 Aren的线程没有&#34;热身&#34;?
我知道这听起来很傻,但尽管如此,我还是试着去测试一下。我加了一个 功能
void idle_func(void)
{
std::this_thread::sleep_for(std::chrono::milliseconds(100));
}
并在int main()
std::vector<std::thread> threads(NUM_THREADS);
for(int i = 0; i < NUM_THREADS; ++i)
{
threads[i] = std::thread(idle_func);
threads[i].join();
}
现在输出变为
In thread 1
I've spent 0 s in the loop.
In thread 2
I've spent 0 s in the loop.
In thread 3
I've spent 2 s in the loop.
In thread 0
I've spent 2 s in the loop.
50289.1
In thread 2
I've spent 0 s in the loop.
In thread 1
I've spent 0 s in the loop.
In thread 3
I've spent 0 s in the loop.
In thread 0
I've spent 0 s in the loop.
50289.1
好吧,代码需要更长时间的实例肯定更少。似乎我,至少在某种程度上,是在正确的轨道上。但是,添加更多的idle_func
- 线程调用不会进一步降低这种影响,老实说,这个&#34;修复&#34;很糟糕。另一方面,我真的不知道发生了什么。
顺便说一句:我正在使用-Ofast
,但这会出现在所有优化级别。仅在没有优化的情况下,效果不可见。然而,开销似乎在很大程度上取决于集成点的数量;对于8e7
,慢线程的时间增加到5s,快速的时间增加到1。
编辑:我现在已在其他两台机器上对此进行了测试,但无法重现上述症状。因此,我认为这可能与我的工作机器有关。