我正在尝试使用以下代码编写并行向量填充:
#include <iostream>
#include <thread>
#include <vector>
#include <chrono>
#include <algorithm>
using namespace std;
using namespace std::chrono;
void fill_part(vector<double> & v, int ii, int num_threads)
{
fill(v.begin() + ii*v.size()/num_threads, v.begin() + (ii+1)*v.size()/num_threads, 0);
}
int main()
{
vector<double> v(200*1000*1000);
high_resolution_clock::time_point t = high_resolution_clock::now();
fill(v.begin(), v.end(), 0);
duration<double> d = high_resolution_clock::now() - t;
cout << "Filling the vector took " << duration_cast<milliseconds>(d).count()
<< " ms in serial.\n";
unsigned num_threads = thread::hardware_concurrency() ? thread::hardware_concurrency() : 1;
cout << "Num threads: " << num_threads << '\n';
vector<thread> threads;
t = high_resolution_clock::now();
for(int ii = 0; ii< num_threads; ++ii)
{
threads.emplace_back(fill_part, std::ref(v), ii, num_threads);
}
for(auto & t : threads)
{
if(t.joinable()) t.join();
}
d = high_resolution_clock::now() - t;
cout << "Filling the vector took " << duration_cast<milliseconds>(d).count()
<< " ms in parallel.\n";
}
我在四种不同的架构上尝试了这个代码(所有Intel CPU - 但无论如何)。
我尝试的第一个有4个CPU,并行化没有加速。第二个有4个,速度是4倍,第三个有4个,速度快了两倍,最后一个有2个,没有加速。
我的假设是出现差异是因为RAM总线可以被单个CPU饱和,但是这是正确的吗?如何预测哪种架构将从此并行化中受益?
奖金问题:void fill_part
函数很笨拙,所以我想用lambda做这个:
threads.emplace_back([&]{fill(v.begin() + ii*v.size()/num_threads, v.begin() + (ii+1)*v.size()/num_threads, 0); });
这会编译但会因总线错误而终止; lambda语法有什么问题?