Question

这是我前一个问题的第二个问题 Faster way to do multi dimensional matrix addition? 在遵循@Peter Cordes的建议后，我将我的代码矢量化，现在速度提高了50倍。然后我再次做了gprof，发现这个功能占用了大部分时间。

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
 69.97      1.53     1.53                             cal_score(int, std::string, int const*, int, double)

double cal_score(int l, string seq, const int *__restrict__ pw,int cluster,double alpha)
{
  const int cols =4;
  const int *__restrict__ pwcluster = pw + ((long)cluster) * l * cols;
  double score = 0;
  char s;
  string alphabet="ACGT";   
  int count=0;
  for(int k=0;k<cols;k++)           
    count=count+pwcluster[k];

  for (int i = 0; i < l; i++){
    long row_offset = cols*i;
    s=seq[i];
    //#pragma omp simd 
    for(int k=0;k<cols;k++) {
            if (s==alphabet[k])
                score=score+log(    ( pwcluster[row_offset+k]+alpha )/(count+4*alpha)       );
    }
  }
  return score;
}

我是第一次进行代码优化，所以不知道如何继续。那么有没有办法更好地编写这个函数。所以我可以获得更快的速度。输入seq是长度为l的字符'ACGT'的序列。 pw是大小为2 * l * 4或[p] [q] [r]的一维数组，簇是p。

Answer 1

这是重写它的另一种方法。这会将字符串转换为查找表而不是搜索，并将WM_QUIT调用次数减少10倍。

这也会将log更改为通过引用传递的seq，而不是通过值传递的const char*。（那会复制整个字符串）。

std::string

此compiles to fairly good code，但没有unsigned char transTable[128]; void InitTransTable(){ memset(transTable, 0, sizeof(transTable)); transTable['A'] = 0; transTable['C'] = 1; transTable['G'] = 2; transTable['T'] = 3; } static int tslen = 0; // static instead of global lets the compiler keep tseq in a register inside the loop static unsigned char* tseq = NULL; // reusable buffer for translations. Not thread-safe double cal_score( int l , const unsigned char* seq // if you want to pass a std::string, do it by const &, not by value , const int *__restrict__ pw , int cluster , double alpha ) { int i, j, k; // make sure tseq is big enough if (tseq == NULL){ tslen = std::max(4096, l+1024); tseq = new unsigned char[tslen]; memset(tseq, 0, tslen); } else if (l > tslen-1){ delete tseq; tslen = l + 4096; tseq = new unsigned char[tslen]; memset(tseq, 0, tslen); } // translate seq into tseq // (decrementing i so the beginning of tseq will be hot in cache when we're done) for (i = l; --i >= 0;) tseq[i] = transTable[seq[i]]; const int cols = 4; const int *__restrict__ pwcluster = pw + ((long)cluster) * l * cols; double score = 0; // count up pwcluster int count=0; for(k = 0; k < cols; k++) count += pwcluster[k]; double count4alpha = (count + 4*alpha); long row_offset = 0; for (i = 0; i < l;){ double product = 1; for (j = 0; j < 10 && i < l; j++, i++, row_offset += cols){ k = tseq[i]; product *= (pwcluster[row_offset + k] + alpha) / count4alpha; } score += log(product); } return score; }除法不能被乘法替换。

它不会自动矢量化，因为我们只加载-ffast-math的每四个元素中的一个。

Answer 2

我对迈克的好主意和代码做了一些改进。

我还制作了矢量化版本（需要SSE4.1）。它更容易出现错误，但值得尝试，因为你应该从打包的乘法中获得显着的加速。将它移植到AVX应该会带来另一个大的加速。

查看godbolt上的所有代码，包括从ASCII到0..3碱基的矢量化转换（使用pshufb LUT）。

我的更改：

不要提前翻译。它应该与FP循环的工作完全重叠，而不是强迫它在FP工作开始之前等待一个微小的转换循环完成。
简化计数器变量（gcc制作更好的代码：它实际上将j保留在寄存器中，而不是优化它。或者它完全展开内部循环进入一个巨大的循环。）
将(count + 4*alpha)的缩放完全拉出循环：而不是除以（或乘以倒数），减去对数。由于log（）增长非常缓慢，我们可能无限期推迟这一点而不会在最终score中失去太多精确度。

替代方案只会减去每N次迭代，但是循环必须弄清楚它是否提前终止。至少，我们可以乘以1.0 / (count + 4*alpha)，而不是分开。如果没有-ffast-math，编译器就无法为您执行此操作。
让调用者为我们计算pwcluster：它可能会计算它自己使用，我们可以删除其中一个函数args（cluster）。
< / LI> 与仅撰写row_offset相比，
i*cols的代码略差一些。如果你喜欢指针增量作为数组索引的替代方法，gcc会在内部循环中直接递增pwcluster更好的代码。
将l重命名为len：除了非常小的范围外，单字母变量名称都是错误的样式。（就像一个循环，或一个只做一件事的非常小的函数），即使那时，只有在没有一个好的简短而有意义的名字的情况下。例如p并不比ptr更有意义，但len会告诉您这意味着什么，而不仅仅是它是什么。

进一步观察：

在整个程序中以翻译格式存储序列对于此以及任何其他想要将DNA碱基用作数组索引或计数器的代码更好。

您还可以使用SSSE3 pshufb向/从ASCII转换核苷酸编号（0..3）进行矢量化。（参见我在godbolt上的代码）。
将您的矩阵存储在float而不是int可能会更好。由于您的代码现在大部分时间都花在此函数上，如果它不必继续从int转换为float，它将运行得更快。在Haswell上，cvtss2sd（单一＆gt;双）显然比ctvsi2sd（int-＆gt; double）具有更好的吞吐量，但在Skylake上没有。（SKL上的ss2sd比HSW慢）。

以double格式存储矩阵可能会更快，但加倍的缓存足迹可能是杀手级的。使用float代替double进行此计算也可以避免转化费用。但您可以使用log()推迟double进行更多迭代。
在手动展开的内循环中使用多个product变量（p1，p2等）会暴露出更多的并行性。在循环结束时将它们相乘。（我最终制作了一个带有两个向量累加器的矢量化版本。）
对于Skylake或Broadwell，您可以使用VPGATHERDD进行矢量化。从ASCII到0..3的矢量化转换在这里会有所帮助。
即使不使用收集指令，将两个整数加载到向量中并使用压缩转换指令也会很好。压缩转换指令比标量转换指令快。我们有很多次要做，并且肯定可以利用SIMD向量一次做两次或四次。见下文。

我的改进的简单版本：

请参阅godbolt的完整代码，链接在此答案的顶部。

double cal_score_simple(
    int len                            // one-letter variable names are only good in the smallest scopes, like a loop
  , const unsigned char* seq           // if you want to pass a std::string, do it by const &, not by value
  , const int *__restrict__ pwcluster  // have the caller do the address math for us, since it probably already does it anyway
  , double alpha )
{
  // note that __restrict__ isn't needed because we don't write into any pointers
  const int cols = 4;
  const int logdelay_factor = 4;  // accumulate products for this many iterations before doing a log()

  int count=0;    // count the first row of pwcluster
  for(int k = 0; k < cols; k++)
    count += pwcluster[k];

  const double log_c4a = log(count + 4*alpha);

  double score = 0;
  for (int i = 0; i < len;){
    double product = 1;
    int inner_bound = std::min(len, i+logdelay_factor);

    while (i < inner_bound){
      unsigned int k = transTable[seq[i]];        // translate on the fly
      product *= (pwcluster[i*cols + k] + alpha); // * count4alpha_inverse; // scaling deferred
      // TODO: unroll this with two or four product accumulators to allow parallelism
      i++;
    }

    score += log(product);  // - log_c4a * j;
  }

  score -= log_c4a * len;   // might be ok to defer this subtraction indefinitely, since log() accumulates very slowly
  return score;
}

这个编译得非常好，有一个非常紧凑的内循环：

.L6:
    movzx   esi, BYTE PTR [rcx]   # D.74129, MEM[base: _127, offset: 0B]
    vxorpd  xmm1, xmm1, xmm1    # D.74130
    add     rcx, 1    # ivtmp.44,
    movzx   esi, BYTE PTR transTable[rsi] # k, transTable
    add     esi, eax  # D.74133, ivtmp.45
    add     eax, 4    # ivtmp.45,
    vcvtsi2sd       xmm1, xmm1, DWORD PTR [r12+rsi*4]     # D.74130, D.74130, *_38
    vaddsd  xmm1, xmm1, xmm2    # D.74130, D.74130, alpha
    vmulsd  xmm0, xmm0, xmm1    # product, product, D.74130
    cmp     eax, r8d  # ivtmp.45, D.74132
    jne     .L6       #,

使用指针增量而不是使用i*cols进行索引会从循环中删除一个add，将其降低到10个融合域uops（在此循环中为11）。因此，它对循环缓冲区的前端吞吐量无关紧要，但执行端口的uop较少。 Resource stalls can make that matter，即使总的uop吞吐量不是直接的瓶颈。

手动矢量化SSE版本：

未经过测试，而不是经过仔细编写。我很容易在这里犯错。如果您在使用AVX的计算机上运行此功能，您绝对应该制作AVX版本。使用vextractf128作为横向产品或总和的第一步，然后与我在此处相同。

使用向量化log()函数计算两个（或四个AVX）log()在向量中并行生成，您可以在结尾处进行水平求和，而不是更频繁的水平积在每个标量log()之前。我确定有人写过，但我现在不打算花时间去搜索它。

// TODO: AVX version
double cal_score_SSE(
    int len                            // one-letter variable names are only good in the smallest scopes, like a loop
  , const unsigned char* seq           // if you want to pass a std::string, do it by const &, not by value
  , const int *__restrict__ pwcluster  // have the caller do the address math for us, since it probably already does it anyway
  , double alpha
  )
{
  const int cols = 4;
  const int logdelay_factor = 16;  // accumulate products for this many iterations before doing a log()

  int count=0;    // count the first row of pwcluster
  for(int k = 0; k < cols; k++) count += pwcluster[k];

  //const double count4alpha_inverse = 1.0 / (count + 4*alpha);
  const double log_c4a = log(count + 4*alpha);

#define COUNTER_TYPE int

  //// HELPER FUNCTION: make a vector of two (pwcluster[i*cols + k] + alpha)
  auto lookup_two_doublevec = [&pwcluster, &seq, &alpha](COUNTER_TYPE pos) {
        unsigned int k0 = transTable[seq[pos]];
        unsigned int k1 = transTable[seq[pos+1]];
        __m128i pwvec = _mm_cvtsi32_si128( pwcluster[cols*pos + k0] );
           pwvec = _mm_insert_epi32(pwvec, pwcluster[cols*(pos+1) + k1], 1);
        // for AVX: repeat the previous lines, and _mm_unpack_epi32 into one __m128i,
        // then use _mm256_cvtepi32_pd (__m128i src)

        __m128d alphavec = _mm_set1_pd(alpha);
        return _mm_cvtepi32_pd(pwvec) + alphavec;
        //p1d = _mm_add_pd(p1d, _mm_set1_pd(alpha));
  };

  double score = 0;
  for (COUNTER_TYPE i = 0; i < len;){
    double product = 1;
    COUNTER_TYPE inner_bound = i+logdelay_factor;
    if (inner_bound >= len) inner_bound = len;
    // possibly do a whole vector of transTable translations; probably doesn't matter

    if (likely(inner_bound < len)) {
      // We can do 8 or 16 elements without checking the loop counter
      __m128d p1d = lookup_two_doublevec(i+0);
      __m128d p2d = lookup_two_doublevec(i+2);

      i+=4;  // start with four element loaded into two vectors, not multiplied by anything
      static_assert(logdelay_factor % 4 == 0, "logdelay_factor must be a multiple of 4 for vectorization");

      while (i < inner_bound) {
        // The *= syntax requires GNU C vector extensions, which is how __m128d is defined in gcc
        p1d *= lookup_two_doublevec(i+0);
        p2d *= lookup_two_doublevec(i+2);
        i+=4;
      }
      // we have two vector accumulators, holding two products each
      p1d *= p2d;            // combine to one vector

      //p2d = _mm_permute_pd(p1d, 1);  // if you have AVX.  It's no better than movhlps, though.
      // movhlps  p2d, p1d   // extract the high double, using p2d as a temporary
      p2d = _mm_castps_pd( _mm_movehl_ps(_mm_castpd_ps(p2d), _mm_castpd_ps(p1d) ) );

      p1d = _mm_mul_sd(p1d, p2d);   // multiply the last two elements, now that we have them extracted to separate vectors
      product = _mm_cvtsd_f64(p1d);
      // TODO: find a vectorized log() function for use here, and do a horizontal add down to a scalar outside the outer loop.
    } else {
      // Scalar for the last unknown number of iterations
      while (i < inner_bound){
        unsigned int k = transTable[seq[i]];
        product *= (pwcluster[i*cols + k] + alpha); // * count4alpha_inverse; // scaling deferred
        i++;
      }
    }

    score += log(product);  // - log_c4a * j;  // deferred
  }

  score -= log_c4a * len;   // May be ok to defer this subtraction indefinitely, since log() accumulates very slowly
  // if not, subtract log_c4a * logdefer_factor in the vector part,
  // and (len&15)*log_c4a out here at the end.  (i.e. len %16)
  return score;
}

载体化的ASCII->整数DNA碱基

理想情况下，在读取序列时进行一次转换，并在内部将它们存储在0/1/2/3数组中，而不是A / C / G / T ASCII字符串。

如果我们不必检查错误（无效字符），可以使用pshufb手动进行矢量化。在迈克的代码中，我们在FP循环之前翻译整个输入，这可以为代码的这一部分提供大的加速。

为了实时翻译，我们可以使用向量：

在外循环中翻译一个包含16个输入字符的块，
将其存储到16字节缓冲区
然后从内循环中进行标量加载。

由于gcc似乎完全展开了向量循环，这将用6个向量指令（包括加载和存储）替换16 movzx个指令。

#include <immintrin.h>
__m128i nucleotide_ASCII_to_number(__m128i input) {

  // map A->0, C->1, G->2, T->3.
    // low 4 bits aren't unique     low 4 bits *are* unique
  /* 'A' = 65 = 0b100 0001    >>1 : 0b10 0000
   * 'C' = 67 = 0b100 0011    >>1 : 0b10 0001
   * 'G' = 71 = 0b100 0111    >>1 : 0b10 0011
   * 'T' = 87 = 0b101 0111    >>1 : 0b10 1011   // same low 4 bits for lower-case
   * 
   * We right-shift by one, mask, and use that as indices into a LUT
   * We can use pshufb as a 4bit LUT, to map all 16 chars in parallel
   */

  __m128i LUT = _mm_set_epi8(0xff, 0xff, 0xff, 0xff,   3, 0xff, 0xff, 0xff,
                             0xff, 0xff, 0xff, 0xff,   2, 0xff,    1,    0);
  // Not all "bogus" characters map to 0xFF, but 0xFF in the output only happens on invalid input

  __m128i shifted = _mm_srli_epi32(input, 1);   // And then mask, to emulate srli_epi8
  __m128i masked  = _mm_and_si128(shifted, _mm_set1_epi8(0x0F));
  __m128i nucleotide_codes = _mm_shuffle_epi8(LUT,  masked);
  return nucleotide_codes;
}

   // compiles to:
    vmovdqa xmm1, XMMWORD PTR .LC2[rip]       # the lookup table
    vpsrld  xmm0, xmm0, 1       # tmp96, input,
    vpand   xmm0, xmm0, XMMWORD PTR .LC1[rip]     # D.74111, tmp96,
    vpshufb xmm0, xmm1, xmm0  # tmp100, tmp101, D.74111
    ret

更快速地计算序列的可能性？

2 个答案:

进一步观察：

我的改进的简单版本：

手动矢量化SSE版本：

载体化的ASCII->整数DNA碱基