导出平方根近似的误差界限

时间:2016-03-06 03:29:22

标签: floating-point floating-accuracy

this answer关于四元数归一化的问题中,作者提供了一些计算倒数平方根的代码,使用2.0 / (1.0 + qmagsq)作为1.0 / std::sqrt(qmagsq)的近似值得到非常接近1的值:

double qmagsq = quat.square_magnitude();
if (std::abs(1.0 - qmagsq) < 2.107342e-08) {
    quat.scale (2.0 / (1.0 + qmagsq));
} else {
    quat.scale (1.0 / std::sqrt(qmagsq));
}

然后,作者提供了以下解释:

  

对于介于0和2之间的qmagsq的值,此近似值中的误差小于(1-qmagsq)^2 / 8。神奇的数字2.107342e-08表示此错误超过IEEE的ULP的一半。

据推测,这是因为sqrt(8 * 2^-(1+52) / 2)约为2.10734243e-8,其中2^-(1+52) / 2的精度是double的一半。

如何将(1-qmagsq)^2 / 8作为qmagsq的值在0和2之间的近似误差的上限得出?

修改

有人指出,作者does not actually hold提供的错误界限为qmagsq的值介于0和1之间。结果,问题变得更加开放:

如何推导出这种近似的误差界限,可以用来确定近似值误差小于IEEE双倍ULP的范围?

1 个答案:

答案 0 :(得分:2)

计算机已经变得足够快,以至于可以在整个输入域上详尽地测试单参数函数的特定断言以获取单精度,并在合理的小区间内测试双精度。实际的界限通常在 248 个测试向量左右。我假设使用 IEEE-754 兼容平台,默认舍入模式为舍入到最近或偶数,并且所有代码都是在编译器可以召集的最严格遵守 IEEE-754 的情况下构建的(对于我的英特尔编译器,例如/fp:strict)。

问题中的说法是快速替换在 unity 附近实现了 0.5 ulp 或更少的误差。换句话说,结果使用 IEEE-754 舍入模式正确舍入到最近或偶数。有两种方法可以测试该断言:要么使用正确舍入的 rsqrt() 实现作为参考,并在 1 ulp 步骤中迭代该参数直到发现不匹配,要么使用多精度库作为参考,并在快速替代方案的 ulp 误差超过 0.5 ulp 时停止。在后一种情况下,我们需要比双精度精度高两倍多一点的参考结果,以避免双舍入效应。对于倒数平方根,2n+3 位的引用就足够了:

Cristina Iordache 和 David W. Matula:“关于除法、平方根、倒数和平方根倒数的无限精确舍入”。 第 14 届 IEEE 计算机算术研讨会论文集,澳大利亚阿德莱德,1999 年 4 月 14-16 日,第 233-240 页

下面的 ISO-C99 代码使用第一种方法。它从统一开始搜索,然后在零方向或两个方向上搜索,在第一个不匹配处停止。输出如下:

arg = (1.0 + 2.2204460492503131e-016)  quick_rsqrt =  0x1.0000000000000p+0 (1.0000000000000000e+000)  rsqrt_rn =  0x1.fffffffffffffp-1 (9.9999999999999989e-001)   
arg = (1.0 - 1.2166747276332046e-008)  quick_rsqrt =  0x1.0000001a20bd7p+0 (1.0000000060833736e+000)  rsqrt_rn =  0x1.0000001a20bd8p+0 (1.0000000060833738e+000)

我也尝试了第二种方法,并得到了匹配的结果。 (1.0 + 2.2204460492503131e-16)处的快速替换误差为0.9995 ulps,(1.0 - 1.2166747276332046e-8)处的误差为0.5002 ulps。

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>

/* function under test */
double quick_rsqrt (double a)
{
    return 2.0 / (1.0 + a);
}

/* starting approximation for reciprocal square root */
double simple_rsqrt (double a)
{
    return 1.0 / sqrt (a);
}

/* most significant 32 bits of bit representation of IEEE-754 binary64 */
uint32_t hi_uint32_of_double (double a)
{
    uint64_t t;
    memcpy (&t, &a, sizeof t);
    return (uint32_t)(t >> 32);
}

/* least significant 32 bits of bit representation of IEEE-754 binary64 */
uint32_t lo_uint32_of_double (double a)
{
    uint64_t t;
    memcpy (&t, &a, sizeof t);
    return (uint32_t)t;
}

/* construct IEEE-754 binary64 from upper and lower half of its bit representation */
double mk_double_from_hilo_uint32 (uint32_t hi, uint32_t lo)
{
    double r;
    uint64_t t = ((uint64_t)hi << 32)  + ((uint64_t)lo);
    memcpy (&r, &t, sizeof r);
    return r;
}

/* reciprocal square root, rounded to-nearest-or-even */
double rsqrt_rn (double a)
{
    double y, h, l, e;
    uint32_t alo, ahi, temp;
    int32_t diff;
    
    ahi = hi_uint32_of_double (a);
    alo = lo_uint32_of_double (a);
    if ((ahi - 0x00100000u) < 0x7fe00000u) { // positive normals
        /* scale argument towards unity */
        temp = (ahi & 0x3fffffff) | 0x3fe00000;
        diff = temp - ahi; // exponent difference
        a = mk_double_from_hilo_uint32 (temp, alo); 
        /* initial rsqrt approximation */
        y = simple_rsqrt (a);
        /* refine reciprocal square root approximation */
        h = y * y;
        l = fma (y, y, -h);
        e = fma (l, -a, fma (h, -a, 1.0));
        /* round according to Peter Markstein, "IA-64 and Elementary Functions" */
        y = fma (fma (0.375, e, 0.5), e * y, y);
        /* scale result near unity to correct range */
        diff = diff >> 1; // adjust exponent; ensure arithmetic right shift which is not guaranteed by ISO-C99
        a = mk_double_from_hilo_uint32 (hi_uint32_of_double (y) + diff, lo_uint32_of_double (y));
    } else if (a == 0.0) { // zeros
        a = mk_double_from_hilo_uint32 ((ahi & 0x80000000) | 0x7ff00000, 0x00000000);
    } else if (a < 0.0) { // negatives
        a = mk_double_from_hilo_uint32 (0xfff80000, 0x00000000);
    } else if (isinf (a)) { // infinities
        a = mk_double_from_hilo_uint32 (ahi & 0x80000000, 0x00000000);
    } else if (isnan (a)) { // NaNs
        a = a + a;
    } else { // positive subnormals
        /* scale argument towards unity */
        a = a * mk_double_from_hilo_uint32 (0x7fd00000, 0);
        /* initial rsqrt approximation */
        y = simple_rsqrt (a);
        /* refine reciprocal square root approximation */
        h = y * y;
        l = fma (y, y, -h);
        e = fma (l, -a, fma (h, -a, 1.0));
        /* round according to Peter Markstein, "IA-64 and Elementary Functions" */
        y = fma (fma (0.375, e, 0.5), e * y, y);
        /* scale result near unity to correct range */
        a = mk_double_from_hilo_uint32 (hi_uint32_of_double (y) + 0x1ff00000, lo_uint32_of_double (y));
    }
    return a;
}

int main (void)
{
    double x, ref, res;

    /* Try arguments greater than unity */
    x = 1.0;
    do {
        res = quick_rsqrt (x);
        ref = rsqrt_rn (x);
        if (res != ref) {
            printf ("arg = (1.0 + %23.16e)  quick_rsqrt = %21.13a (%23.16e)  rsqrt_rn = %21.13a (%23.16e)\n", 
                    x - 1.0, res, res, ref, ref);
            break;
        }
        x = nextafter (x, 2.0);
    } while (x < 2.0);

    /* Try arguments less than unity */
    x = 1.0;
    do {
        res = quick_rsqrt (x);
        ref = rsqrt_rn (x);
        if (res != ref) {
            printf ("arg = (1.0 - %23.16e)  quick_rsqrt = %21.13a (%23.16e)  rsqrt_rn = %21.13a (%23.16e)\n", 
                    1.0 - x, res, res, ref, ref);
            break;
        }
        x = nextafter (x, 0.0);
    } while (x > 0.0);

    return EXIT_SUCCESS;
}