我有一个C ++程序,它基本上执行一些矩阵计算。对于这些我使用LAPACK / BLAS并且通常根据平台链接到MKL或ACML。很多这些矩阵计算在不同的独立矩阵上运行,因此我使用std :: thread' s让这些操作并行运行。但是,我注意到在使用更多线程时我没有加速。我将问题追溯到daxpy Blas例程。似乎如果两个线程并行使用此例程,则每个线程花费两倍的时间,即使两个线程在不同的阵列上运行。
我接下来尝试的是编写一个新的简单方法来执行向量添加以替换daxpy例程。使用一个线程,这个新方法与BLAS例程一样快,但是,当使用gcc进行编译时,它会遇到与BLAS例程相同的问题:并行运行的线程数加倍也会使每个线程需要的时间加倍,所以没有加速。但是,使用英特尔C ++编译器时,这个问题就会消失:随着线程数量的增加,单个线程需要的时间是不变的。
但是,我需要在没有英特尔编译器的系统上进行编译。所以我的问题是:为什么gcc没有加速,是否有提高gcc性能的可能性?
我写了一个小程序来证明效果:
// $(CC) -std=c++11 -O2 threadmatrixsum.cpp -o threadmatrixsum -pthread
#include <iostream>
#include <thread>
#include <vector>
#include "boost/date_time/posix_time/posix_time.hpp"
#include "boost/timer.hpp"
void simplesum(double* a, double* b, std::size_t dim);
int main() {
for (std::size_t num_threads {1}; num_threads <= 4; num_threads++) {
const std::size_t N { 936 };
std::vector <std::size_t> times(num_threads, 0);
auto threadfunction = [&](std::size_t tid)
{
const std::size_t dim { N * N };
double* pA = new double[dim];
double* pB = new double[dim];
for (std::size_t i {0}; i < N; ++i){
pA[i] = i;
pB[i] = 2*i;
}
boost::posix_time::ptime now1 =
boost::posix_time::microsec_clock::universal_time();
for (std::size_t n{0}; n < 1000; ++n){
simplesum(pA, pB, dim);
}
boost::posix_time::ptime now2 =
boost::posix_time::microsec_clock::universal_time();
boost::posix_time::time_duration dur = now2 - now1;
times[tid] += dur.total_milliseconds();
delete[] pA;
delete[] pB;
};
std::vector <std::thread> mythreads;
// start threads
for (std::size_t n {0} ; n < num_threads; ++n)
{
mythreads.emplace_back(threadfunction, n);
}
// wait for threads to finish
for (std::size_t n {0} ; n < num_threads; ++n)
{
mythreads[n].join();
std::cout << " Thread " << n+1 << " of " << num_threads
<< " took " << times[n]<< "msec" << std::endl;
}
}
}
void simplesum(double* a, double* b, std::size_t dim){
for(std::size_t i{0}; i < dim; ++i)
{*(++a) += *(++b);}
}
与gcc一起出局:
Thread 1 of 1 took 532msec
Thread 1 of 2 took 1104msec
Thread 2 of 2 took 1103msec
Thread 1 of 3 took 1680msec
Thread 2 of 3 took 1821msec
Thread 3 of 3 took 1808msec
Thread 1 of 4 took 2542msec
Thread 2 of 4 took 2536msec
Thread 3 of 4 took 2509msec
Thread 4 of 4 took 2515msec
与icc的结局:
Thread 1 of 1 took 663msec
Thread 1 of 2 took 674msec
Thread 2 of 2 took 674msec
Thread 1 of 3 took 681msec
Thread 2 of 3 took 681msec
Thread 3 of 3 took 681msec
Thread 1 of 4 took 688msec
Thread 2 of 4 took 689msec
Thread 3 of 4 took 687msec
Thread 4 of 4 took 688msec
因此,使用icc,一个线程执行计算所需的时间是恒定的(正如我预期的那样;我的CPU有4个物理内核),而gcc则是一个线程增加的时间。用BLAS :: daxpy替换simplesum例程会产生与icc和gcc相同的结果(毫不奇怪,因为大部分时间花在库中),这几乎与上面提到的gcc结果相同。
答案 0 :(得分:18)
答案很简单:你的线程正在争夺内存带宽!
考虑每2个存储执行一次浮点加法(一次初始化,一次加法后)和2次读取(加法中)。提供多个cpu的大多数现代系统实际上必须在多个内核之间共享内存控制器。
以下是在具有2个物理CPU插槽和12个内核(24个带HT)的系统上运行。您的原始代码完全展示了您的问题:
Thread 1 of 1 took 657msec
Thread 1 of 2 took 1447msec
Thread 2 of 2 took 1463msec
[...]
Thread 1 of 8 took 5516msec
Thread 2 of 8 took 5587msec
Thread 3 of 8 took 5205msec
Thread 4 of 8 took 5311msec
Thread 5 of 8 took 2731msec
Thread 6 of 8 took 5545msec
Thread 7 of 8 took 5551msec
Thread 8 of 8 took 4903msec
然而,通过简单地增加算术密度,我们可以看到可扩展性的显着增加。为了演示,我更改了您的添加例程以执行取幂:*(++a) += std::exp(*(++b));
。结果显示几乎完美的缩放:
Thread 1 of 1 took 7671msec
Thread 1 of 2 took 7759msec
Thread 2 of 2 took 7759msec
[...]
Thread 1 of 8 took 9997msec
Thread 2 of 8 took 8135msec
Thread 3 of 8 took 10625msec
Thread 4 of 8 took 8169msec
Thread 5 of 8 took 10054msec
Thread 6 of 8 took 8242msec
Thread 7 of 8 took 9876msec
Thread 8 of 8 took 8819msec
首先,ICC内联simplesum
。证明内联发生很简单:使用icc,我禁用了多文件过程间优化,并将simplesum
移动到自己的翻译单元中。差异是惊人的。表演来自
Thread 1 of 1 took 687msec
Thread 1 of 2 took 688msec
Thread 2 of 2 took 689msec
[...]
Thread 1 of 8 took 690msec
Thread 2 of 8 took 697msec
Thread 3 of 8 took 700msec
Thread 4 of 8 took 874msec
Thread 5 of 8 took 878msec
Thread 6 of 8 took 874msec
Thread 7 of 8 took 742msec
Thread 8 of 8 took 868msec
要
Thread 1 of 1 took 1278msec
Thread 1 of 2 took 2457msec
Thread 2 of 2 took 2445msec
[...]
Thread 1 of 8 took 8868msec
Thread 2 of 8 took 8434msec
Thread 3 of 8 took 7964msec
Thread 4 of 8 took 7951msec
Thread 5 of 8 took 8872msec
Thread 6 of 8 took 8286msec
Thread 7 of 8 took 5714msec
Thread 8 of 8 took 8241msec
这已经解释了为什么库表现不好:ICC无法内联它,因此无论其他什么原因导致ICC比g ++表现更好,它都不会发生。
它还提示ICC可能在这里做什么......如果不是执行simplesum
1000次,而是interchanges the loops,那么
这会增加算术密度而不向函数添加任何指数......如何证明这一点?好吧,首先让我们简单地实现这个优化,看看会发生什么!为了分析,我们将看看g ++的性能。回想一下我们的基准测试结果:
Thread 1 of 1 took 640msec
Thread 1 of 2 took 1308msec
Thread 2 of 2 took 1304msec
[...]
Thread 1 of 8 took 5294msec
Thread 2 of 8 took 5370msec
Thread 3 of 8 took 5451msec
Thread 4 of 8 took 5527msec
Thread 5 of 8 took 5174msec
Thread 6 of 8 took 5464msec
Thread 7 of 8 took 4640msec
Thread 8 of 8 took 4055msec
现在,让我们交换一下
for (std::size_t n{0}; n < 1000; ++n){
simplesum(pA, pB, dim);
}
使用内循环成为外循环的版本:
double* a = pA; double* b = pB;
for(std::size_t i{0}; i < dim; ++i, ++a, ++b)
{
double x = *a, y = *b;
for (std::size_t n{0}; n < 1000; ++n)
{
x += y;
}
*a = x;
}
结果显示我们走在正确的轨道上:
Thread 1 of 1 took 693msec
Thread 1 of 2 took 703msec
Thread 2 of 2 took 700msec
[...]
Thread 1 of 8 took 920msec
Thread 2 of 8 took 804msec
Thread 3 of 8 took 750msec
Thread 4 of 8 took 943msec
Thread 5 of 8 took 909msec
Thread 6 of 8 took 744msec
Thread 7 of 8 took 759msec
Thread 8 of 8 took 904msec
这证明了环路交换优化确实是ICC在此展示的卓越性能的主要来源。
请注意,所有经过测试的编译器(MSVC,ICC,g ++和clang)都不会使用乘法替换循环,这样可以在单线程中将性能提高200倍,在8线程情况下提高15倍。这是由于重复添加的数值不稳定性在用单个乘法替换时可能导致截然不同的结果。使用整数数据类型而不是浮点数据类型进行测试时,会发生此优化。
有趣的是,g ++的真正杀手是不无法执行循环交换。使用-floop-interchange
调用时,g ++也可以执行此优化。但只有当赔率显着受到支持时才会发挥作用。
而不是std::size_t
所有边界都表示为int
s。不是long
,不是unsigned int
,而是int
。我仍然觉得很难相信,但这似乎是一项艰难的要求。
不是递增指针,而是将它们编入索引:a[i] += b[i];
G ++需要被告知-floop-interchange
。简单的-O3
是不够的。
当满足所有三个标准时,g ++性能与ICC提供的相似:
Thread 1 of 1 took 714msec
Thread 1 of 2 took 724msec
Thread 2 of 2 took 721msec
[...]
Thread 1 of 8 took 782msec
Thread 2 of 8 took 1221msec
Thread 3 of 8 took 1225msec
Thread 4 of 8 took 781msec
Thread 5 of 8 took 788msec
Thread 6 of 8 took 1262msec
Thread 7 of 8 took 1226msec
Thread 8 of 8 took 820msec
注意:本实验中使用的g ++版本在x64 Arch linux上为4.9.0。
答案 1 :(得分:0)
好吧,我得出结论,主要的问题是处理器并行地作用于内存的不同部分,因此我假设必须处理大量的缓存未命中,这会进一步降低进程。将实际的和函数放在关键部分
summutex.lock();
simplesum(pA, pB, dim);
summutex.unlock();
解决了缓存未命中的问题,但当然不会产生最佳的加速。无论如何,因为现在其他线程被阻塞,所以simplesum方法也可以使用所有可用线程来求和
void simplesum(double* a, double* b, std::size_t dim, std::size_t numberofthreads){
omp_set_num_threads(numberofthreads);
#pragma omp parallel
{
#pragma omp for
for(std::size_t i = 0; i < dim; ++i)
{
a[i]+=b[i];
}
}
}
在这种情况下,所有线程都在内存上的同一块上运行:它应该在处理器缓存中,如果处理器需要将内存的其他部分加载到其缓存中,则其他线程会从中受益(取决于这是L1还是L2缓存,但我认为细节对于这个讨论并不重要。)
我并不认为这个解决方案是完美的或任何接近最佳的解决方案,但它似乎比原始代码更好。并且它不依赖于我在实际代码中无法做到的一些循环切换技巧。