Question

阅读StackOverflow上的帖子（http://stackoverflow.com/questions/1502081/im-trying-to-optimize-this-c-code-using-4-way-loop-unrolling），现在标记为已关闭，我遇到了一个答案（实际上是评论），其中说：“两个内部循环可能通过使用UInt64和位移来提高速度”

以下是他发布的代码：

char rotate8_descr[] = "rotate8: rotate with 8x8 blocking";

    void rotate8(int dim, pixel *src, pixel *dst) 
    {

    int i, j, ii, jj;

    for(ii = 0; ii < dim; ii += 8)
           for(jj = 0; jj < dim; jj += 8)
                  for (i = ii; i < ii + 8; i++)   
                      for (j = jj; j < jj + 8; j++)
                          dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
    }

有人可以解释一下这将如何应用？我有兴趣知道如何在此代码或类似代码上应用bitshifting，以及为什么这会对性能有所帮助。此外，如何针对缓存使用优化此代码？有什么建议吗？

假设这段代码是Double Tiled / Blocked（大牌= 32，并且里面有16块牌），并且还应用了Loop Invariant Code Motion ..它仍然会受益于bitshifting和UInt64吗？

如果没有，那么其他什么建议会起作用？

谢谢！

Answer 1

如果像素较小，你可以使用8个Uint64寄存器（它们很大并且有很多它们）来累积旋转矩阵的结果。

sizeof(pixel) == 1和小端机器的示例：

for (int y = 0; y < 8; ++y){
 // for every line, we get 8 pixels from row y into src0.
 // they should go in the last colomn of the result
 // so after 8 iterations they'll get exactly in the 8ht byte 
  Uint64 src0 = *(Uint64*)(src + dim * y);
  dst0 = (dst0 << 8) | ( src0 & 0xff); // this was pixel src[y][0]
  dst1 = (dst1 << 8) | ((src0 >> 8) & 0xff); // and pixel src[y][1]
  etc...
};
// now the 8 dst0..dst7 registers contain rows 0..7 of the result. 
// putting them there
*(Uint64*)(dst) = dst0;
*(Uint64*)(dst + dim) = dst1;
etc..

好的部分是它更容易展开和重新排序，并且内存访问更少。

Bitshifting和UInt64如何工作？

1 个答案: