直觉告诉我,由于32位可以代表固定数量的不同值,因此浮点数可以代表任何给定范围的固定数量的值。这是真的?可以通过转换方式表示的值的数量是否有任何损失?
说我选择一个在[10 30 ,10 35 ]范围内的数字。显然,我可以在此范围内获得的精度是有限的,但是与[0.0,1000.0]等更合理的范围相比,可以在此范围内表示的值的数量是否存在任何差异?
答案 0 :(得分:2)
此答案假定float
映射到IEEE-754(2008)标准指定的binary32
类型。对于归一化的 binary32
操作数,即在[2 -126 ,2 128 )中,总有2 23 每个binade编码,因为存储的有效位数为23。在一般情况下,确定binary32
编码的数量会比较棘手,例如由于舍入效应:并非所有10的幂都可以完全可以代表。这同样会影响binade中起点和终点的位置,我们需要考虑[0,2 -126 ]中的次法线。
但是对于一阶,我们可以估计[10 30 ,10 35 ]中的大致个编码与[10 -2 ,10 3 ],因此间隔[0,10 3 ]将包含更多binary32
数字大于间隔[10 30 ,10 35 ]。
建立精确计数的懒惰方法是在给定间隔内蛮力计数编码的数量。 C和C ++标准数学库提供函数nextafterf
,该函数沿给定的方向将给定的binary32
操作数递增或递减到其最接近的邻居。因此,我们可以简单地计算出在指定间隔内能够执行此操作的次数。使用此方法的ISO-C99程序如下所示。在现代硬件上只需几秒钟即可为我们提供所需的答案:
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>
/* count the binary32 numbers in the closed interval [start, stop] */
void countem (float start, float stop)
{
float x;
uint32_t count;
count = 0;
x = start;
while (x <= stop) {
count++;
x = nextafterf (x, INFINITY);
}
printf ("there are %u binary32 numbers in [%15.8e, %15.8e]\n", count, start, stop);
}
int main (void)
{
countem (0.0f, 1000.0f);
countem (1e-2f, 1e3f);
countem (1e30f, 1e35f);
return EXIT_SUCCESS;
}
该程序确定:
there are 1148846081 binary32 numbers in [0.00000000e+000, 1.00000000e+003]
there are 139864311 binary32 numbers in [9.99999978e-003, 1.00000000e+003]
there are 139468867 binary32 numbers in [1.00000002e+030, 1.00000004e+035]
答案 1 :(得分:1)
在给定范围内,浮点数可以表示多少个值?
...因为32位可以表示固定数量的不同值,所以
float
可以表示任何给定范围的固定数量的值。这是真的吗?
是-是的。在整个typical float
范围内,可以表示大约2 32 个不同的值。
能够通过转换方式表示的值数量是否有损失?
non sequitur。 float
没有定义如何将其他数字表示形式转换为float
。 printf(), scanf(), atof(), strtof(), (float) some_integer, (some_integer_type) some_float
和编译器本身都执行转换。 C对转换必须进行的程度没有把握。高质量的库和编译器有望发挥最佳性能。对于源代码或"1.2345"
之类的“字符串”数字,存在无限多个可能的值,它们映射到大约2 32 个不同的值。是的,发生了损失。
...的范围为[1030,1035]。 ...与[0.0,1000.0]等更合理的范围相比,可以在此范围内表示的值数量是否有差异?
是的。 float
的值是distributed logarithmically,而不是线性的 。在[1030,1035]之间,与[1.030,1.035]或[1.030e-3,1.035e-3]之间的float
不同。所有float
中约有25%在[0.0 ... 1.0]
范围内,因此[0.0, 1000.0]
中的值比[1030, 1035]
多很多倍
答案 2 :(得分:0)
这是出于信息目的而提供的-它可用于提供更容易使用的信息,例如提供计数的代码,各种范围的样本或讨论-但我没有时间了,因此想要保留信息远。
对于IEEE-754基本的32位二进制浮点数,非负可表示值的数目N( x )小于或等于非负 x 是:
因此 a < x ≤ b 中可表示的值 x 的数量为N( b )-N( a )。
说明:
答案 3 :(得分:0)
这是计算所有有限范围内float
中可表示的值数量的代码。它期望使用IEEE-754算法。我改编自my previous C++ answer。
这有两种将浮点数转换为其编码的实现方式(一种是通过复制位,另一种是通过数学方式对其进行操作)。之后,距离计算非常简单(必须调整负值,然后将距离简单地减去)。
#include <float.h>
#include <inttypes.h>
#include <limits.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <tgmath.h>
/* Define a value with only the high bit of a uint32_t set. This is also the
encoding of floating-point -0.
*/
static const uint32_t HighBit = UINT32_MAX ^ UINT32_MAX>>1;
// Return the encoding of a floating-point number by copying its bits.
static uint32_t EncodingBits(float x)
{
uint32_t result;
memcpy(&result, &x, sizeof result);
return result;
}
// Return the encoding of a floating-point number by using math.
static uint32_t EncodingMath(float x)
{
static const int SignificandBits = FLT_MANT_DIG;
static const int MinimumExponent = FLT_MIN_EXP;
// Encode the high bit.
uint32_t result = signbit(x) ? HighBit : 0;
// If the value is zero, the remaining bits are zero, so we are done.
if (x == 0) return result;
/* The C library provides a little-known routine to split a floating-point
number into a significand and an exponent. Note that this produces a
normalized significand, not the actual significand encoding. Notably,
it brings significands of subnormals up to at least 1/2. We will
adjust for that below. Also, this routine normalizes to [1/2, 1),
whereas IEEE 754 is usually expressed with [1, 2), but that does not
bother us.
*/
int xe;
float xf = frexp(fabs(x), &xe);
// Test whether the number is subnormal.
if (xe < MinimumExponent)
{
/* For a subnormal value, the exponent encoding is zero, so we only
have to insert the significand bits. This scales the significand
so that its low bit is scaled to the 1 position and then inserts it
into the encoding.
*/
result |= (uint32_t) ldexp(xf, xe - MinimumExponent + SignificandBits);
}
else
{
/* For a normal value, the significand is encoded without its leading
bit. So we subtract .5 to remove that bit and then scale the
significand so its low bit is scaled to the 1 position.
*/
result |= (uint32_t) ldexp(xf - .5, SignificandBits);
/* The exponent is encoded with a bias of (in C++'s terminology)
MinimumExponent - 1. So we subtract that to get the exponent
encoding and then shift it to the position of the exponent field.
Then we insert it into the encoding.
*/
result |= ((uint32_t) xe - MinimumExponent + 1) << (SignificandBits-1);
}
return result;
}
/* Return the encoding of a floating-point number. For illustration, we
get the encoding with two different methods and compare the results.
*/
static uint32_t Encoding(float x)
{
uint32_t xb = EncodingBits(x);
uint32_t xm = EncodingMath(x);
if (xb != xm)
{
fprintf(stderr, "Internal error encoding %.99g.\n", x);
fprintf(stderr, "\tEncodingBits says %#" PRIx32 ".\n", xb);
fprintf(stderr, "\tEncodingMath says %#" PRIx32 ".\n", xm);
exit(EXIT_FAILURE);
}
return xb;
}
/* Return the distance from a to b as the number of values representable in
float from one to the other. b must be greater than or equal to a. 0 is
counted only once.
*/
static uint32_t Distance(float a, float b)
{
uint32_t ae = Encoding(a);
uint32_t be = Encoding(b);
/* For represented values from +0 to infinity, the IEEE 754 binary
floating-points are in ascending order and are consecutive. So we can
simply subtract two encodings to get the number of representable values
between them (including one endpoint but not the other).
Unfortunately, the negative numbers are not adjacent and run the other
direction. To deal with this, if the number is negative, we transform
its encoding by subtracting from the encoding of -0. This gives us a
consecutive sequence of encodings from the greatest magnitude finite
negative number to the greatest finite number, in ascending order
except for wrapping at the maximum uint32_t value.
Note that this also maps the encoding of -0 to 0 (the encoding of +0),
so the two zeroes become one point, so they are counted only once.
*/
if (HighBit & ae) ae = HighBit - ae;
if (HighBit & be) be = HighBit - be;
// Return the distance between the two transformed encodings.
return be - ae;
}
static void Try(float a, float b)
{
printf("[%.99g, %.99g] contains %" PRIu32 " representable values.\n",
a, b, Distance(a, b) + 1);
}
int main(void)
{
if (sizeof(float) != sizeof(uint32_t))
{
fprintf(stderr, "Error, uint32_t must be the same size as float.\n");
exit(EXIT_FAILURE);
}
/* Prepare some test values: smallest positive (subnormal) value, largest
subnormal value, smallest normal value.
*/
float S1 = FLT_TRUE_MIN;
float N1 = FLT_MIN;
float S2 = N1 - S1;
// Test 0 <= a <= b.
Try( 0, 0);
Try( 0, S1);
Try( 0, S2);
Try( 0, N1);
Try( 0, 1./3);
Try(S1, S1);
Try(S1, S2);
Try(S1, N1);
Try(S1, 1./3);
Try(S2, S2);
Try(S2, N1);
Try(S2, 1./3);
Try(N1, N1);
Try(N1, 1./3);
// Test a <= b <= 0.
Try(-0., -0.);
Try(-S1, -0.);
Try(-S2, -0.);
Try(-N1, -0.);
Try(-1./3, -0.);
Try(-S1, -S1);
Try(-S2, -S1);
Try(-N1, -S1);
Try(-1./3, -S1);
Try(-S2, -S2);
Try(-N1, -S2);
Try(-1./3, -S2);
Try(-N1, -N1);
Try(-1./3, -N1);
// Test a <= 0 <= b.
Try(-0., +0.);
Try(-0., S1);
Try(-0., S2);
Try(-0., N1);
Try(-0., 1./3);
Try(-S1, +0.);
Try(-S1, S1);
Try(-S1, S2);
Try(-S1, N1);
Try(-S1, 1./3);
Try(-S2, +0.);
Try(-S2, S1);
Try(-S2, S2);
Try(-S2, N1);
Try(-S2, 1./3);
Try(-N1, +0.);
Try(-N1, S1);
Try(-N1, S2);
Try(-N1, N1);
Try(-1./3, 1./3);
Try(-1./3, +0.);
Try(-1./3, S1);
Try(-1./3, S2);
Try(-1./3, N1);
Try(-1./3, 1./3);
return 0;
}
答案 4 :(得分:0)
也许我在这里忽略了一些东西,但是在看IEEE-754 binary32的位模式
您知道它被解码为:
(-1) b 31 (1 + Sum( b 23-i < / sub> 2 -i ; i = 22 ... 0))×2 e -127 < / p>
然后您会看到最低的指数是0,最高的指数是255。如果将整数乘以2 127 ,则会看到两个分数相同的浮点数的顺序由指数 e 的顺序定义,该顺序为整数。因此,如果您想从低到高对IEEE-754 binary32数字进行排序,那么 这实际上意味着浮点数的顺序与相同位模式创建的相应整数的顺序相同。因此,如果您想知道两个浮点数之间的距离,则只需要相互减去相应的整数:(这假设+0和-0将被同等对待): 图片取自维基百科:https://en.wikipedia.org/wiki/Single-precision_floating-point_format
/* count the binary32 numbers in the closed half-open interval [start, stop[ */
int distance (float start, float stop)
{
return *(reinterpret_cast<int *>(&stop)) - *(reinterpret_cast<int *>(&start));
}