我想计算一般矩阵矩阵乘积
template<size_t sizeA2>
void matrixMatrixProduct( double* __restrict__ pC,
double const* __restrict__ pA,
double const* __restrict__ pB,
double alpha,
size_t sizeA1,
size_t sizeB2 )
{
size_t outerLoopLimit = sizeA1;
size_t innerLoopLimit = sizeB2;
size_t sizeC2 = sizeB2;
for ( size_t i = 0; i < outerLoopLimit; ++i )
{
#pragma vector aligned
#pragma ivdep
for ( size_t j = 0; j < innerLoopLimit; ++j )
{
#pragma vector aligned
#pragma ivdep
for ( size_t k = 0; k < sizeA2; ++k )
{
pC[i * sizeC2 + j] += alpha * pA[i * sizeA2 + k] * pB[k * sizeB2 + j];
} // end of k-loop
} // end of j-loop
} // end of i-loop
}
现在,我的上下文中的特殊情况是我在编译时知道A的列数(分别是B的行数)。因此,我可以对这些信息进行硬编码,这自然会带来显着的性能提升。
到目前为止,我已经在Fortran中完成了这项工作,但是要求我在所有可能的场景中反复重复编写相同的代码。因此,我想编写此函数的一个 C ++版本,我在编译时传递大小信息。
经过一番尝试,我终于找到了以下解决方案
template<size_t sizeA2,
typename AccessOperatorA = NoTranspose,
typename AccessOperatorB = NoTranspose,
typename AccessOperatorC = NoTranspose >
void matrixMatrixProduct( double* __restrict__ pC,
double const* __restrict__ pA,
double const* __restrict__ pB,
double alpha,
size_t sizeA1,
size_t sizeB2 )
{
size_t outerLoopLimit = sizeA1;
size_t innerLoopLimit = sizeB2;
size_t sizeC2 = sizeB2;
for ( size_t i = 0; i < outerLoopLimit; ++i )
{
#pragma vector aligned
#pragma ivdep
for ( size_t j = 0; j < innerLoopLimit; ++j )
{
#pragma vector aligned
#pragma ivdep
for ( size_t k = 0; k < sizeA2; ++k )
{
AccessOperatorC::get( pC, sizeC2, i, j ) += alpha * AccessOperatorA::get( pA, sizeA2, i, k ) * AccessOperatorB::get( pB, sizeB2, k, j );
} // end of k-loop
} // end of j-loop
} // end of i-loop
}
添加restrict关键字和两个pragma,我可以说服Intel Cpp编译器对j循环进行矢量化,以便在Intel(R)Core(TM)的单个内核上获得大约19 GFlops的最大速度i7-4790 CPU @ 3.60GHz,与Fortran对应部分完全一致。
我现在要做的是扩展功能,以便我可以单独转置三个矩阵A,B和C.为此,我的计划是添加模板访问策略,必要时翻转两个索引。我将代码扩展如下
struct NoTranspose
{
template<typename DataType>
static inline DataType& get( DataType* __restrict__ pointer,
size_t size2,
size_t i,
size_t j )
{
return pointer[i * size2 + j];
}
};
将NoTranspose访问运算符定义为
#include <stdio.h>
#include <iostream>
#include <stdlib.h>
#include <vector>
#include <chrono>
#include <cmath>
struct NoTranspose
{
template<typename DataType>
static inline DataType& get( DataType* __restrict__ pointer,
size_t size2,
size_t i,
size_t j )
{
return pointer[i * size2 + j];
}
};
template<size_t sizeA2,
typename AccessOperatorA = NoTranspose,
typename AccessOperatorB = NoTranspose,
typename AccessOperatorC = NoTranspose >
void matrixMatrixProduct( double* __restrict__ pC,
double const* __restrict__ pA,
double const* __restrict__ pB,
double alpha,
size_t sizeA1,
size_t sizeB2 )
{
size_t outerLoopLimit = sizeA1;
size_t innerLoopLimit = sizeB2;
size_t sizeC2 = sizeB2;
for ( size_t i = 0; i < outerLoopLimit; ++i )
{
#pragma vector aligned
#pragma ivdep
for ( size_t j = 0; j < innerLoopLimit; ++j )
{
#pragma vector aligned
#pragma ivdep
for ( size_t k = 0; k < sizeA2; ++k )
{
AccessOperatorC::get( pC, sizeC2, i, j ) += alpha * AccessOperatorA::get( pA, sizeA2, i, k ) * AccessOperatorB::get( pB, sizeB2, k, j );
} // end of k-loop
} // end of j-loop
} // end of i-loop
}
int main( void )
{
std::vector<int> sizesA1 = { 1, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 30, 40, 50, 60, 70, 80, 90, 100, 125, 150, 175, 200, 250, 300, 350, 400, 500, 600, 700, 800, 900, 1000 };
printf( "%15s \t %15s \t%15s \t %15s\n", "sizeA1", "sizeA2", "Time [s]", "GFlops" );
for ( const auto & sizeA1 : sizesA1 )
{
int numberOfIterations = 1e6;
const int sizeA2 = 2;
size_t sizeB2 = sizeA1;
size_t sizeC1 = sizeA1;
size_t sizeC2 = sizeB2;
size_t lengthA = sizeA1 * sizeA2;
size_t lengthB = sizeA2 * sizeB2;
int lengthC = sizeC1 * sizeC2;
std::vector<double> A1( lengthA, 1.234 );
std::vector<double> B1( lengthB, 1.234 );
std::vector<double> C1( lengthC, 1.234 );
std::vector<double> A2( lengthA, 1.234 );
std::vector<double> B2( lengthB, 1.234 );
std::vector<double> C2( lengthC, 1.234 );
double alpha = 1.234;
auto start = std::chrono::high_resolution_clock::now( );
for ( int i = 0; i < numberOfIterations; ++i )
{
double* pA;
double* pB;
double* pC;
//Force cache reload
if ( i % 2 )
{
pA = A1.data( );
pB = B1.data( );
pC = C1.data( );
}
else
{
pA = A2.data( );
pB = B2.data( );
pC = C2.data( );
}
matrixMatrixProduct<sizeA2>( pC, pA, pB, alpha, sizeA1, sizeB2 );
}
auto end = std::chrono::high_resolution_clock::now( );
std::chrono::duration<double> elapsed = end - start;
double numberOfFlops = 1.0 * numberOfIterations * lengthC * ( 3 + 2 ); //two adds and three mult!
double flops = (double) numberOfFlops / ( elapsed.count( ) );
printf( "%15d \t %15d \t %15g \t %15e\n", sizeA1, sizeA2, elapsed.count( ), flops / ( 1.0e9 ) );
}
return 0;
}
原则上,此代码会编译并给出正确的答案,但速度只有60%!
根据我的理解,问题似乎是对齐信息没有被转移到访问模板的get函数中,尽管get函数是根据perf进行内联的。报告。然而,icpc似乎决定只在xmm寄存器上工作,这会导致性能损失。
因此,我的问题是:如何让编译器正确地编写扩展代码?
感谢任何帮助。
为了完整起见,我添加了完整的MWE
hdf5 not supported (please install/reinstall h5py)
Scipy not supported!