Question

在我正在分析的应用程序中，我发现在某些情况下，此功能可以占用总执行时间的10％。

我已经看到多年来使用偷偷摸摸的浮点技巧进行了更快的sqrt实现的讨论，但我不知道现代CPU上是否有这样的东西过时了。

正在使用MSVC ++ 2008编译器供参考......虽然我认为sqrt不会增加太多开销。

另见此处有关modf功能的类似讨论。

编辑：作为参考，this是一种广泛使用的方法，但它实际上更快吗？这些天SQRT有多少个周期？

Answer 1

是的，即使没有诡计也是可能的：

1）牺牲速度的准确性：sqrt算法是迭代的，用较少的迭代重新实现。

2）查找表：要么只是迭代的起点，要么与插值相结合，让你一直到那里。

3）缓存：你总是在使用相同的有限值集吗？如果是这样，缓存可以很好地工作。我发现这在图形应用程序中非常有用，其中对于大量相同大小的形状计算相同的东西，因此可以有效地缓存结果。

Answer 2

这里有一个很棒的比较表： http://assemblyrequired.crashworks.org/timing-square-root/

长话短说，SSE2的ssqrts比FPU fsqrt快约2倍，近似+迭代速度比此快4倍（总体的8倍）。

另外，如果您尝试使用单精度sqrt，请确保实际上是您获得的。我听说至少有一个编译器将float参数转换为double，调用双精度sqrt，然后转换回float。

Answer 3

通过更改算法，您很可能通过更改实施获得更多速度提升：尝试减少拨打sqrt()而不是拨打电话快点。（如果您认为这是不可能的 - 您提到的sqrt()的改进就是：用于计算平方根的算法的改进。）

由于经常使用它，因此标准库的sqrt()实现可能对于一般情况而言几乎是最佳的。除非你有一个受限制的域（例如，如果你需要更少的精度）算法可以采取一些快捷方式，所以不太可能有人想出一个更快的实现。

请注意，由于该函数使用了10％的执行时间，即使您设法实现仅占std::sqrt()时间的75％的实现，这仍然只会带来您的执行时间下降 2.5％。对于大多数应用程序，用户甚至不会注意到这一点，除非他们使用手表进行测量。

Answer 4

您需要sqrt的准确度如何？您可以非常快速地获得合理的近似值：请参阅Quake3的出色inverse square root功能（请注意代码是GPL，因此您可能不希望直接集成它）。

Answer 5

不知道你是否解决了这个问题，但之前我已经读过它了，看起来最快的事情就是用内联汇编版本替换sqrt函数;

您可以看到一系列替代品here的说明。

最好的是这段魔法：

double inline __declspec (naked) __fastcall sqrt(double n)
{
    _asm fld qword ptr [esp+4]
    _asm fsqrt
    _asm ret 8
}

它比具有相同精度的标准sqrt调用快约4.7倍。

Answer 6

这是一种只有8KB的查找表的快速方法。错误是结果的约0.5％。您可以轻松放大表格，从而减少错误。运行速度比常规sqrt（）快5倍

// LUT for fast sqrt of floats. Table will be consist of 2 parts, half for sqrt(X) and half for sqrt(2X).
const int nBitsForSQRTprecision = 11;                       // Use only 11 most sagnificant bits from the 23 of float. We can use 15 bits instead. It will produce less error but take more place in a memory. 
const int nUnusedBits   = 23 - nBitsForSQRTprecision;       // Amount of bits we will disregard
const int tableSize     = (1 << (nBitsForSQRTprecision+1)); // 2^nBits*2 because we have 2 halves of the table.
static short sqrtTab[tableSize]; 
static unsigned char is_sqrttab_initialized = FALSE;        // Once initialized will be true

// Table of precalculated sqrt() for future fast calculation. Approximates the exact with an error of about 0.5%
// Note: To access the bits of a float in C quickly we must misuse pointers.
// More info in: http://en.wikipedia.org/wiki/Single_precision
void build_fsqrt_table(void){
    unsigned short i;
    float f;
    UINT32 *fi = (UINT32*)&f;

    if (is_sqrttab_initialized)
        return;

    const int halfTableSize = (tableSize>>1);
    for (i=0; i < halfTableSize; i++){
         *fi = 0;
         *fi = (i << nUnusedBits) | (127 << 23); // Build a float with the bit pattern i as mantissa, and an exponent of 0, stored as 127

         // Take the square root then strip the first 'nBitsForSQRTprecision' bits of the mantissa into the table
         f = sqrtf(f);
         sqrtTab[i] = (short)((*fi & 0x7fffff) >> nUnusedBits);

         // Repeat the process, this time with an exponent of 1, stored as 128
         *fi = 0;
         *fi = (i << nUnusedBits) | (128 << 23);
         f = sqrtf(f);
         sqrtTab[i+halfTableSize] = (short)((*fi & 0x7fffff) >> nUnusedBits);
    }
    is_sqrttab_initialized = TRUE;
}

// Calculation of a square root. Divide the exponent of float by 2 and sqrt() its mantissa using the precalculated table.
float fast_float_sqrt(float n){
    if (n <= 0.f) 
        return 0.f;                           // On 0 or negative return 0.
    UINT32 *num = (UINT32*)&n;
    short e;                                  // Exponent
    e = (*num >> 23) - 127;                   // In 'float' the exponent is stored with 127 added.
    *num &= 0x7fffff;                         // leave only the mantissa 

    // If the exponent is odd so we have to look it up in the second half of the lookup table, so we set the high bit.
    const int halfTableSize = (tableSize>>1);
    const int secondHalphTableIdBit = halfTableSize << nUnusedBits;
    if (e & 0x01) 
        *num |= secondHalphTableIdBit;  
    e >>= 1;                                  // Divide the exponent by two (note that in C the shift operators are sign preserving for signed operands

    // Do the table lookup, based on the quaternary mantissa, then reconstruct the result back into a float
    *num = ((sqrtTab[*num >> nUnusedBits]) << nUnusedBits) | ((e + 127) << 23);
    return n;
}

是否可以滚动明显更快的sqrt版本

6 个答案: