将numeric_limits <uint64_t> :: max()的uint64_t除以浮点表示

时间:2016-06-14 13:01:14

标签: c++ precision floating-accuracy

给定uint64_t值,是否可以将其除以std::numeric_limits<uint64_t>::max(),以便得到值的浮点表示(0.01.0,代表{{1}到0)?

大于max的数字可以归结为未定义的行为,只要等于或小于max的每个数字被正确地划分为其浮点&#34;对应的&#34; (或浮点类型能够表示的最接近的数字而不是实数值)

我不确定将一个(或两个)方面投射到2^64-1会产生所有有效输入的正确值,因为标准不保证long double拥有尾数为64位。这有可能吗?

3 个答案:

答案 0 :(得分:3)

不需要多精度算术。在使用小于64位的有效数(又名尾数)除以n max = std::numeric_limits<uint64_t>::max()的浮点运算中,可以以精确舍入的方式计算(即计算结果与目标浮点格式中精确算术比的最接近近似相同,如下所示:

  

N / N <子>最大   = n /(2 64 -1)   = n / 2 64 /(1-2 -64 )   = n / 2 64 *(1 + 2 -64 +2 -128 + ...)   = n / 2 64 + 任何不适合有效数字

因此结果是

  

n / n max = n / 2 64

以下C ++测试程序实现了计算比率 n / n max 的天真和准确方法:

#include <climits>
#include <cmath>
#include <iostream>
#include <limits>
#include <type_traits>


template<typename F, typename U>
F map_to_unit_range_naive(U n)
{
    static_assert(std::is_floating_point<F>::value, "Result type must be a floating point type");
    static_assert(std::is_unsigned<U>::value, "Input type must be an unsigned integer type");
    return F(n)/F(std::numeric_limits<U>::max());
}

template<typename F, typename U>
F map_to_unit_range_accurate(U n)
{
    static_assert(std::is_floating_point<F>::value, "Result type must be a floating point type");
    static_assert(std::is_unsigned<U>::value, "Input type must be an unsigned integer type");
    const int UBITS = sizeof(U) * CHAR_BIT;
    return std::ldexp(F(n), -UBITS);
}

template<class F, class U>
double error_mapping_to_unit_range(U n)
{
    const F r1 = map_to_unit_range_accurate<F>(n);
    const F r2 = map_to_unit_range_naive<F>(n);
    return (1-r2/r1);
}

#define CHECK_MAPPING_TO_UNIT_RANGE(n, result_type)                     \
    std::cout << "map_to_unit_range<" #result_type ">(" #n "): err="    \
              << error_mapping_to_unit_range<result_type>(n)*100 << "%" \
              << std::endl;

int main()
{
    CHECK_MAPPING_TO_UNIT_RANGE(123u,         float);
    CHECK_MAPPING_TO_UNIT_RANGE(123ul,        float);
    CHECK_MAPPING_TO_UNIT_RANGE(1234567890u,  float);
    CHECK_MAPPING_TO_UNIT_RANGE(1234567890ul, float);
    std::cout << "\n";
    CHECK_MAPPING_TO_UNIT_RANGE(123ul,        double);
    CHECK_MAPPING_TO_UNIT_RANGE(1234567890ul, double);
    return 0;
}

该计划表明,天真的方法与精心设计的代码相同:

map_to_unit_range<float>(123u): err=0%
map_to_unit_range<float>(123ul): err=0%
map_to_unit_range<float>(1234567890u): err=0%
map_to_unit_range<float>(1234567890ul): err=0%

map_to_unit_range<double>(123ul): err=0%
map_to_unit_range<double>(1234567890ul): err=0%

这一开始可能看起来令人惊讶,但它有一个简单的解释 - 如果浮点类型不能精确地表示积分值2 N -1,那么它将它舍入为2 N ,有效地导致下一步的准确划分(根据上述公式)。

请注意,当浮点类型的精度超过整数类型的大小时(因此可以精确表示2 N -1),不满足公式的前提,并且“准确”的方法不再是这样:

int main()
{
    CHECK_MAPPING_TO_UNIT_RANGE(123u,        double);
    CHECK_MAPPING_TO_UNIT_RANGE(1234567890u, double);
    return 0;
}

输出:

map_to_unit_range<double>(123u): err=-2.32831e-08%
map_to_unit_range<double>(1234567890u): err=-2.32831e-08%

这里的“错误”来自“准确”的方法。


币:

非常感谢@interjay@Jonathan Mee对此答案以前版本的全面同行评审。

答案 1 :(得分:1)

我认为最简单,最严格的便携式方式是boost::multiprecision::cpp_bin_float_quad

#include <boost/multiprecision/cpp_bin_float.hpp>

#include <limits>
#include <cstdint>
#include <iostream>
#include <iomanip>


int main()
{
    using Float = boost::multiprecision::cpp_bin_float_quad;

    for (std::uint64_t i = 0 ; i < 64 ; ++i)
    {
        auto v = std::uint64_t(1) << i;
        auto x = Float(v);

        x /= std::numeric_limits<std::uint64_t>::max();

        // demonstrate lossless round-trip
        auto y = x * std::numeric_limits<std::uint64_t>::max();

        std::cout << std::setprecision(std::numeric_limits<Float>::digits10)
        << (x * 100) << "% : "
        << std::hex << y.convert_to<std::uint64_t>()
        << std::endl;
    }
}

预期结果:

5.42101086242752217033113759205528e-18% : 1
1.08420217248550443406622751841106e-17% : 2
2.16840434497100886813245503682211e-17% : 4
4.33680868994201773626491007364422e-17% : 8
8.67361737988403547252982014728845e-17% : 10
1.73472347597680709450596402945769e-16% : 20
3.46944695195361418901192805891538e-16% : 40
6.93889390390722837802385611783076e-16% : 80
1.38777878078144567560477122356615e-15% : 100
2.7755575615628913512095424471323e-15% : 200
5.55111512312578270241908489426461e-15% : 400
1.11022302462515654048381697885292e-14% : 800
2.22044604925031308096763395770584e-14% : 1000
4.44089209850062616193526791541169e-14% : 2000
8.88178419700125232387053583082337e-14% : 4000
1.77635683940025046477410716616467e-13% : 8000
3.55271367880050092954821433232935e-13% : 10000
7.1054273576010018590964286646587e-13% : 20000
1.42108547152020037181928573293174e-12% : 40000
2.84217094304040074363857146586348e-12% : 80000
5.68434188608080148727714293172696e-12% : 100000
1.13686837721616029745542858634539e-11% : 200000
2.27373675443232059491085717269078e-11% : 400000
4.54747350886464118982171434538157e-11% : 800000
9.09494701772928237964342869076313e-11% : 1000000
1.81898940354585647592868573815263e-10% : 2000000
3.63797880709171295185737147630525e-10% : 4000000
7.27595761418342590371474295261051e-10% : 8000000
1.4551915228366851807429485905221e-09% : 10000000
2.9103830456733703614858971810442e-09% : 20000000
5.8207660913467407229717943620884e-09% : 40000000
1.16415321826934814459435887241768e-08% : 80000000
2.32830643653869628918871774483536e-08% : 100000000
4.65661287307739257837743548967072e-08% : 200000000
9.31322574615478515675487097934145e-08% : 400000000
1.86264514923095703135097419586829e-07% : 800000000
3.72529029846191406270194839173658e-07% : 1000000000
7.45058059692382812540389678347316e-07% : 2000000000
1.49011611938476562508077935669463e-06% : 4000000000
2.98023223876953125016155871338926e-06% : 8000000000
5.96046447753906250032311742677853e-06% : 10000000000
1.19209289550781250006462348535571e-05% : 20000000000
2.38418579101562500012924697071141e-05% : 40000000000
4.76837158203125000025849394142282e-05% : 80000000000
9.53674316406250000051698788284564e-05% : 100000000000
0.000190734863281250000010339757656913% : 200000000000
0.000381469726562500000020679515313826% : 400000000000
0.000762939453125000000041359030627651% : 800000000000
0.0015258789062500000000827180612553% : 1000000000000
0.00305175781250000000016543612251061% : 2000000000000
0.00610351562500000000033087224502121% : 4000000000000
0.0122070312500000000006617444900424% : 8000000000000
0.0244140625000000000013234889800848% : 10000000000000
0.0488281250000000000026469779601697% : 20000000000000
0.0976562500000000000052939559203394% : 40000000000000
0.195312500000000000010587911840679% : 80000000000000
0.390625000000000000021175823681358% : 100000000000000
0.781250000000000000042351647362715% : 200000000000000
1.56250000000000000008470329472543% : 400000000000000
3.12500000000000000016940658945086% : 800000000000000
6.25000000000000000033881317890172% : 1000000000000000
12.5000000000000000006776263578034% : 2000000000000000
25.0000000000000000013552527156069% : 4000000000000000
50.0000000000000000027105054312138% : 8000000000000000

使用boost::multiprecision::float128 可以获得更好的性能但是它只适用于gcc(指定-std = g ++ NN)或英特尔编译器。

答案 2 :(得分:1)

我会从你的问题中暗示:

  

我不确定将一个(或两个)方面投射到long double会产生所有有效输入的正确值,因为标准不保证long double拥有尾数为64位。这有可能吗?

你要问的是:

uint64_t可以表示的任何值是否可以在被转换为long double的尾数并回到uint64_t之后存活?

答案取决于实施。关键在于long double用于它的尾数的位数。幸运的是,C ++ 11为您提供了一种方法:numeric_limits<long double>::digits例如:

const auto ui64max = numeric_limits<uint64_t>::max();
const auto foo = ui64max - 1;
const auto bar = static_cast<long double>(foo) / ui64max;

cout << "Max Digits For Roundtrip Guarantee: " << numeric_limits<long double>::digits << "\nMax Digits In uint64_t: " << numeric_limits<uint64_t>::digits << "\nConverting: " << foo << "\nTo long double Mantissa: " << bar << "\nRoundtrip Back To uint64_t: " <<  static_cast<uint64_t>(bar * ui64max) << endl;

Live Example

您可以在编译时使用以下内容验证此事实:

static_assert(numeric_limits<long double>::digits >= numeric_limits<uint64_t>::digits, "long double has insufficient mantissa precision in this implementation");

有关支持往返问题的数学的更多信息,请参阅此处:Float Fractional Precision