C ++:如何将对齐和限制属性转移到模板访问函数

时间:2016-08-17 16:06:23

标签: c++ performance memory-alignment avx avx2

我想计算一般矩阵矩阵乘积

template<size_t sizeA2>
void matrixMatrixProduct( double* __restrict__ pC,
                          double const* __restrict__ pA,
                          double const* __restrict__ pB,
                          double alpha,
                          size_t sizeA1,
                          size_t sizeB2 )
{

  size_t outerLoopLimit = sizeA1;
  size_t innerLoopLimit = sizeB2;

  size_t sizeC2 = sizeB2;

  for ( size_t i = 0; i < outerLoopLimit; ++i )
  {
#pragma vector aligned
#pragma ivdep
    for ( size_t j = 0; j < innerLoopLimit; ++j )
    {
#pragma vector aligned
#pragma ivdep
      for ( size_t k = 0; k < sizeA2; ++k )
      {
        pC[i * sizeC2 + j] += alpha * pA[i * sizeA2 + k] * pB[k * sizeB2 + j];
      } // end of k-loop
    } // end of j-loop
  } // end of i-loop

}

现在,我的上下文中的特殊情况是我在编译时知道A的列数(分别是B的行数)。因此,我可以对这些信息进行硬编码,这自然会带来显着的性能提升。

到目前为止,我已经在Fortran中完成了这项工作,但是要求我在所有可能的场景中反复重复编写相同的代码。因此,我想编写此函数的一个 C ++版本,我在编译时传递大小信息。

经过一番尝试,我终于找到了以下解决方案

template<size_t sizeA2,
         typename AccessOperatorA = NoTranspose,
         typename AccessOperatorB = NoTranspose,
         typename AccessOperatorC = NoTranspose >
void matrixMatrixProduct( double* __restrict__ pC,
                          double const* __restrict__ pA,
                          double const* __restrict__ pB,
                          double alpha,
                          size_t sizeA1,
                          size_t sizeB2 )
{

  size_t outerLoopLimit = sizeA1;
  size_t innerLoopLimit = sizeB2;

  size_t sizeC2 = sizeB2;

  for ( size_t i = 0; i < outerLoopLimit; ++i )
  {
#pragma vector aligned
#pragma ivdep
    for ( size_t j = 0; j < innerLoopLimit; ++j )
    {
#pragma vector aligned
#pragma ivdep
      for ( size_t k = 0; k < sizeA2; ++k )
      {
        AccessOperatorC::get( pC, sizeC2,  i, j ) += alpha * AccessOperatorA::get( pA, sizeA2, i, k ) * AccessOperatorB::get( pB, sizeB2, k, j );
      } // end of k-loop
    } // end of j-loop
  } // end of i-loop

}

添加restrict关键字和两个pragma,我可以说服Intel Cpp编译器对j循环进行矢量化,以便在Intel(R)Core(TM)的单个内核上获得大约19 GFlops的最大速度i7-4790 CPU @ 3.60GHz,与Fortran对应部分完全一致。

我现在要做的是扩展功能,以便我可以单独转置三个矩阵A,B和C.为此,我的计划是添加模板访问策略,必要时翻转两个索引。我将代码扩展如下

struct NoTranspose
{
  template<typename DataType>
  static inline DataType& get( DataType* __restrict__ pointer,
                               size_t size2,
                               size_t i,
                               size_t j )
  {
    return pointer[i * size2 + j];
  }
};

将NoTranspose访问运算符定义为

#include <stdio.h>
#include <iostream>
#include <stdlib.h>
#include <vector>
#include <chrono>
#include <cmath>

struct NoTranspose
{
  template<typename DataType>
  static inline DataType& get( DataType* __restrict__ pointer,
                               size_t size2,
                               size_t i,
                               size_t j )
  {
    return pointer[i * size2 + j];
  }
};

template<size_t sizeA2,
         typename AccessOperatorA = NoTranspose,
         typename AccessOperatorB = NoTranspose,
         typename AccessOperatorC = NoTranspose >
void matrixMatrixProduct( double* __restrict__ pC,
                          double const* __restrict__ pA,
                          double const* __restrict__ pB,
                          double alpha,
                          size_t sizeA1,
                          size_t sizeB2 )
{

  size_t outerLoopLimit = sizeA1;
  size_t innerLoopLimit = sizeB2;

  size_t sizeC2 = sizeB2;

  for ( size_t i = 0; i < outerLoopLimit; ++i )
  {
#pragma vector aligned
#pragma ivdep
    for ( size_t j = 0; j < innerLoopLimit; ++j )
    {
#pragma vector aligned
#pragma ivdep
      for ( size_t k = 0; k < sizeA2; ++k )
      {
        AccessOperatorC::get( pC, sizeC2,  i, j ) += alpha * AccessOperatorA::get( pA, sizeA2, i, k ) * AccessOperatorB::get( pB, sizeB2, k, j );
      } // end of k-loop
    } // end of j-loop
  } // end of i-loop

}

int main( void )
{

  std::vector<int> sizesA1 = { 1, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 30, 40, 50, 60, 70, 80, 90, 100, 125, 150, 175, 200, 250, 300, 350, 400, 500, 600, 700, 800, 900, 1000 };

  printf( "%15s \t %15s \t%15s \t %15s\n", "sizeA1", "sizeA2", "Time [s]", "GFlops" );

  for ( const auto & sizeA1 : sizesA1 )
  {
    int numberOfIterations = 1e6;

    const int sizeA2 = 2;

    size_t sizeB2 = sizeA1;

    size_t sizeC1 = sizeA1;
    size_t sizeC2 = sizeB2;

    size_t lengthA = sizeA1 * sizeA2;
    size_t lengthB = sizeA2 * sizeB2;
    int lengthC = sizeC1 * sizeC2;

    std::vector<double> A1( lengthA, 1.234 );
    std::vector<double> B1( lengthB, 1.234 );
    std::vector<double> C1( lengthC, 1.234 );
    std::vector<double> A2( lengthA, 1.234 );
    std::vector<double> B2( lengthB, 1.234 );
    std::vector<double> C2( lengthC, 1.234 );

    double alpha = 1.234;

    auto start = std::chrono::high_resolution_clock::now( );
    for ( int i = 0; i < numberOfIterations; ++i )
    {
      double* pA;
      double* pB;
      double* pC;

      //Force cache reload
      if ( i % 2 )
      {
        pA = A1.data( );
        pB = B1.data( );
        pC = C1.data( );
      }
      else
      {
        pA = A2.data( );
        pB = B2.data( );
        pC = C2.data( );
      }

      matrixMatrixProduct<sizeA2>( pC, pA, pB, alpha, sizeA1, sizeB2 );

    }
    auto end = std::chrono::high_resolution_clock::now( );
    std::chrono::duration<double> elapsed = end - start;

    double numberOfFlops = 1.0 * numberOfIterations * lengthC * ( 3 + 2 ); //two adds and three mult!
    double flops = (double) numberOfFlops / ( elapsed.count( ) );

    printf( "%15d \t %15d \t %15g \t %15e\n", sizeA1, sizeA2, elapsed.count( ), flops / ( 1.0e9 ) );

  }

  return 0;
}

原则上,此代码会编译并给出正确的答案,但速度只有60%!

根据我的理解,问题似乎是对齐信息没有被转移到访问模板的get函数中,尽管get函数是根据perf进行内联的。报告。然而,icpc似乎决定只在xmm寄存器上工作,这会导致性能损失。

因此,我的问题是:如何让编译器正确地编写扩展代码?

感谢任何帮助。

为了完整起见,我添加了完整的MWE

hdf5 not supported (please install/reinstall h5py)
Scipy not supported!

0 个答案:

没有答案