Question

我正在尝试使用以下代码编写并行向量填充：

#include <iostream>
#include <thread>
#include <vector>
#include <chrono>
#include <algorithm>

using namespace std; 
using namespace std::chrono;

void fill_part(vector<double> & v, int ii, int num_threads)
{
  fill(v.begin() + ii*v.size()/num_threads, v.begin() +   (ii+1)*v.size()/num_threads, 0);
}

int main()
{
  vector<double> v(200*1000*1000);

  high_resolution_clock::time_point t = high_resolution_clock::now();
  fill(v.begin(), v.end(), 0);
  duration<double> d = high_resolution_clock::now() - t;
  cout << "Filling the vector took " << duration_cast<milliseconds>(d).count()
       << " ms in serial.\n";

  unsigned num_threads = thread::hardware_concurrency() ? thread::hardware_concurrency() :  1;

  cout << "Num threads: " << num_threads << '\n';
  vector<thread> threads;
  t = high_resolution_clock::now();
  for(int ii = 0; ii< num_threads; ++ii)
  {
    threads.emplace_back(fill_part, std::ref(v), ii, num_threads);
  }
  for(auto & t : threads)
  {
    if(t.joinable()) t.join();
  }
  d = high_resolution_clock::now() - t;
  cout << "Filling the vector took " << duration_cast<milliseconds>(d).count()
       << " ms in parallel.\n";
}

我在四种不同的架构上尝试了这个代码（所有Intel CPU - 但无论如何）。

我尝试的第一个有4个CPU，并行化没有加速。第二个有4个，速度是4倍，第三个有4个，速度快了两倍，最后一个有2个，没有加速。

我的假设是出现差异是因为RAM总线可以被单个CPU饱和，但是这是正确的吗？如何预测哪种架构将从此并行化中受益？

奖金问题：void fill_part函数很笨拙，所以我想用lambda做这个：

 threads.emplace_back([&]{fill(v.begin() + ii*v.size()/num_threads, v.begin() + (ii+1)*v.size()/num_threads, 0); });

这会编译但会因总线错误而终止; lambda语法有什么问题？

并行std :: fill在不同的体系结构上具有不同的性能;为什么？

0 个答案: