Question

编辑：现在我意识到我没有很好地解释我的算法。我会再试一次。

我正在做的是与两个向量的点积非常相似的东西，但是有区别。我有两个向量：一个位向量和一个相同长度的浮点向量。所以我需要计算总和： float [0] * bit [0] + float [1] * bit [1] + .. + float [N-1] * bit [N-1]，但与经典点积的区别在于我需要在每个设置位之后跳过一些固定数量的元素。

示例：

vector of floats = {1.5, 2.0, 3.0, 4.5, 1.0}
vector of bits   = {1, 0, 1, 0, 1 }
nSkip = 2

在这种情况下，sum计算如下：

sum = floats[0]*bits[0]
bits[0] == 1, so skipping 2 elements (at positions 1 and 2)
sum = sum + floats[3]*bits[3]
bits[3] == 0, so no skipping
sum = sum + floats[4]*bits[4]
result = 1.5*1+4.5*0+1.0*1 = 2.5

以下代码多次调用不同的数据，所以我需要优化它以便在我的Core i7上尽可能快地运行（我不太关心与其他任何东西的兼容性）。它在某种程度上进行了优化，但仍然很慢，但我不知道如何进一步改进它。位数组实现为64位无符号整数的数组，它允许我使用bitscanforward来查找下一个设置位。

代码：

unsigned int i = 0;
float fSum = 0;
do
{
  unsigned int nAddr = i / 64;
  unsigned int nShift = i & 63;
  unsigned __int64 v = bitarray[nAddr] >> nShift;
  unsigned long idx;
  if (!_BitScanForward64(&idx, v))
  {
    i+=64-nShift; 
    continue;
  }
  i+= idx;
  fSum  += floatarray[i];
  i+= nSkip;
}   while(i<nEnd);

Profiler显示3个最慢的热点：

1. v = bitarray[nAddr] >> nShift (memory access with shift)
2. _BitScanForward64(&idx, v) 
3. fSum += floatarray[i]; (memory access)

但可能有不同的做法。我正在考虑在位向量中的每个设置位之后重置nSkip位然后计算经典的点积 - 没有尝试但老实说不相信它会更快更多的内存访问。

Answer 1

你在循环中有太多的操作。你也只有一个循环，所以每个标志字（64位无符号整数）需要发生的许多操作都会发生63次。

将分区视为一项昂贵的操作，并在优化性能代码时尝试不要这么做。

内存访问在需要多长时间内也被认为是昂贵的，因此这也应仅限于所需的访问。

允许您提前退出的测试通常很有用（尽管有时测试本身相对于您要避免的操作而言是昂贵的，但这可能不是这种情况。

使用嵌套循环应该简化这一过程。外部循环应该在64位字级工作，内部循环应该在位级工作。

我注意到我之前的建议中有一个错误。由于这里的除法是64，这是2的幂，这实际上并不是一个昂贵的操作，但我们仍然需要尽可能多的操作尽可能多的操作。

/* this is completely untested, but incorporates the optimizations
   that I outlined as well as a few others.
   I process the arrays backwards, which allows for elimination of
   comparisons of variables against other variables, which is much
   slower than comparisons of variables against 0, which is essentially
   free on many processors when you have just operated or loaded the
   value to a register.
   Going backwards at the bit level also allows for the possibility that
   the compiler will take advantage of the comparison of the top bit
   being the same as test for negative, which is cheap and mostly free
   for all but the first time through the inner loop (for each time
   through the outer loop.
 */
double acc = 0.0;

unsigned i_end = nEnd-1;
unsigned i_bit;
int i_word_end;

if (i_end == 0)
{
     return acc;
}
i_bit = i_end % 64;
i_word = i_end / 64;

do
{
    unsigned __int64 v = bitarray[i_word_end];
    unsigned i_upper = i_word_end << 64;
    while (v)
    {
         if (v & 0x80000000000000)
         {
              // The following code is semantically the same as
              // unsigned i = i_bit_end + (i_word_end * sizeof(v));
              unsigned i = i_bit_end | i_upper;
              acc += floatarray[i];
         }
         v <<= 1;
         i--;
     }
     i_bit_end = 63;
     i_word_end--;
} while (i_word_end >= 0);

Answer 2

我认为你应该首先检查“如何提问”。你不会为此获得许多赞成，因为你要求我们为你做的工作而不是引入一个特定的问题。

我不明白为什么你在两个地方而不是一个（i）增加相同的变量。还要认为你应该只声明一次变量，而不是每次迭代。

如何优化C代码：寻找下一个设置位并找到相应数组元素的总和

2 个答案:

我注意到我之前的建议中有一个错误。由于这里的除法是64，这是2的幂，这实际上并不是一个昂贵的操作，但我们仍然需要尽可能多的操作尽可能多的操作。