如果你将一个大的int转换为float会发生什么

时间:2014-09-06 14:36:17

标签: c bit-manipulation

这是一个普遍的问题,当我使用gcc 4.4将一个非常大/小的SIGNED整数转换为浮点数时会发生什么。

我在进行投射时会看到一些奇怪的行为。以下是一些例子:

MUSTBE是用这种方法获得的:

float f = (float)x;
unsigned int r;
memcpy(&r, &f, sizeof(unsigned int));

./btest -f float_i2f -1 0x80800001
input:          10000000100000000000000000000001
absolute value: 01111111011111111111111111111111

exponent:       10011101
mantissa:       00000000011111101111111111111111  (right shifted absolute value)

EXPECT:         11001110111111101111111111111111  (sign|exponent|mantissa)
MUST BE:        11001110111111110000000000000000  (sign ok, exponent ok,
                                                     mantissa???)

./btest -f float_i2f -1 0x3f7fffe0

EXPECT:    01001110011111011111111111111111
MUST BE:   01001110011111100000000000000000

./btest -f float_i2f -1 0x80004999                                                                  


EXPECT:    11001110111111111111111101101100
MUST BE:   11001110111111111111111101101101    (<- 1 added at the end)

令我困扰的是,尾数在两个示例中都不同,如果我只是将我的整数值向右移动。例如,最后的零。它们来自哪里?

我只在大/小值上看到这种行为。 -2 ^ 24,2 ^ 24范围内的值可以正常工作。

我想知道是否有人可以告诉我这里发生了什么。对非常大/小的价值采取了哪些步骤。

这是一个问题的补充:function to convert float to int (huge integers)这不像这里的一般。

EDIT 代码:

unsigned float_i2f(int x) {
  if (x == 0) return 0;
  /* get sign of x */
  int sign = (x>>31) & 0x1;

  /* absolute value of x */
  int a = sign ? ~x + 1 : x;

  /* calculate exponent */
  int e = 158;
  int t = a;
  while (!(t >> 31) & 0x1) {
    t <<= 1;
    e--;
  };

  /* calculate mantissa */
  int m = (t >> 8) & ~(((0x1 << 31) >> 8 << 1));
  m &= 0x7fffff;

  int res = sign << 31;
  res |= (e << 23);
  res |= m;

  return res;
}

编辑2:

在Adams的评论和对Write Great Code一书的引用之后,我用舍入更新了我的例行程序。我仍然得到一些舍入错误(幸好现在只有1位关闭)。

现在,如果我进行测试运行,我会得到大部分好结果,但有几个舍入错误:

input:  0xfefffff5
result: 11001011100000000000000000000101
GOAL:   11001011100000000000000000000110  (1 too low)

input:  0x7fffff
result: 01001010111111111111111111111111
GOAL:   01001010111111111111111111111110  (1 too high)

unsigned float_i2f(int x) {
  if (x == 0) return 0;
  /* get sign of x */
  int sign = (x>>31) & 0x1;

  /* absolute value of x */
  int a = sign ? ~x + 1 : x;

  /* calculate exponent */
  int e = 158;
  int t = a;
  while (!(t >> 31) & 0x1) {
    t <<= 1;
    e--;
  };

  /* mask to check which bits get shifted out when rounding */
  static unsigned masks[24] = {
    0, 1, 3, 7, 
    0xf, 0x1f, 
    0x3f, 0x7f, 
    0xff, 0x1ff, 
    0x3ff, 0x7ff, 
    0xfff, 0x1fff, 
    0x3fff, 0x7fff, 
    0xffff, 0x1ffff, 
    0x3ffff, 0x7ffff, 
    0xfffff, 0x1fffff, 
    0x3fffff, 0x7fffff
  };

  /* mask to check wether round up, or down */
  static unsigned HOmasks[24] = {
    0,
    1, 2, 4, 0x8, 0x10, 0x20, 0x40, 0x80,
    0x100, 0x200, 0x400, 0x800, 0x1000, 0x2000, 0x4000, 0x8000, 0x10000, 0x20000, 0x40000, 0x80000, 0x100000, 0x200000, 0x400000
  };

  int S = a & masks[8];
  int m = (t >> 8) & ~(((0x1 << 31) >> 8 << 1));
  m &= 0x7fffff;

  if (S > HOmasks[8]) {
    /* round up */
    m += 1;
  } else if (S == HOmasks[8]) {
    /* round down */
    m = m + (m & 1);
  }

  /* special case where last bit of exponent is also set in mantissa
   * and mantissa itself is 0 */
  if (m & (0x1 << 23)) {
    e += 1;
    m = 0;
  }

  int res = sign << 31;
  res |= (e << 23);
  res |= m;
  return res;
}

有人知道问题出在哪里吗?

2 个答案:

答案 0 :(得分:3)

32位float使用指数的某些位,因此无法准确表示所有32位整数值。

64位double可以准确存储任何32位整数值。

Wikipedia在IEEE 754浮点上有一个缩写条目,以及IEEE 754-1985处浮点数内部的许多细节 - 当前标准是IEEE 754:2008。它注意到32位浮点数对于符号使用一位,对指数使用8位,为尾数留下23个显式位和1个隐式位,这就是为什么绝对值高达2 24 的原因可能是完全代表。


  

我认为很明显32位整数不能完全存储到32位浮点数中。我的问题是:如果我存储一个大于2 ^ 24或更小-2 ^ 24的整数会发生什么?我怎么能复制它?

一旦绝对值大于2 24 ,整数值就不能精确地表示在32位float的尾数的24位有效数字中,所以只有前导24位数字可靠。浮点舍入也开始了。

您可以使用与此类似的代码进行演示:     #包括     #include

typedef union Ufloat
{
    uint32_t    i;
    float       f;
} Ufloat;

static void dump_value(uint32_t i, uint32_t v)
{
    Ufloat u = { .i = v };
    printf("0x%.8" PRIX32 ": 0x%.8" PRIX32 " = %15.7e = %15.6A\n", i, v, u.f, u.f);
}

int main(void)
{
    uint32_t lo = 1 << 23;
    uint32_t hi = 1 << 28;
    Ufloat u;

    for (uint32_t v = lo; v < hi; v <<= 1)
    {
        u.f = v;
        dump_value(v, u.i);
    }

    lo = (1 << 24) - 16;
    hi = lo + 64;

    for (uint32_t v = lo; v < hi; v++)
    {
        u.f = v;
        dump_value(v, u.i);
    }

    return 0;
}

示例输出:

0x00800000: 0x4B000000 =   8.3886080e+06 =  0X1.000000P+23
0x01000000: 0x4B800000 =   1.6777216e+07 =  0X1.000000P+24
0x02000000: 0x4C000000 =   3.3554432e+07 =  0X1.000000P+25
0x04000000: 0x4C800000 =   6.7108864e+07 =  0X1.000000P+26
0x08000000: 0x4D000000 =   1.3421773e+08 =  0X1.000000P+27
0x00FFFFF0: 0x4B7FFFF0 =   1.6777200e+07 =  0X1.FFFFE0P+23
0x00FFFFF1: 0x4B7FFFF1 =   1.6777201e+07 =  0X1.FFFFE2P+23
0x00FFFFF2: 0x4B7FFFF2 =   1.6777202e+07 =  0X1.FFFFE4P+23
0x00FFFFF3: 0x4B7FFFF3 =   1.6777203e+07 =  0X1.FFFFE6P+23
0x00FFFFF4: 0x4B7FFFF4 =   1.6777204e+07 =  0X1.FFFFE8P+23
0x00FFFFF5: 0x4B7FFFF5 =   1.6777205e+07 =  0X1.FFFFEAP+23
0x00FFFFF6: 0x4B7FFFF6 =   1.6777206e+07 =  0X1.FFFFECP+23
0x00FFFFF7: 0x4B7FFFF7 =   1.6777207e+07 =  0X1.FFFFEEP+23
0x00FFFFF8: 0x4B7FFFF8 =   1.6777208e+07 =  0X1.FFFFF0P+23
0x00FFFFF9: 0x4B7FFFF9 =   1.6777209e+07 =  0X1.FFFFF2P+23
0x00FFFFFA: 0x4B7FFFFA =   1.6777210e+07 =  0X1.FFFFF4P+23
0x00FFFFFB: 0x4B7FFFFB =   1.6777211e+07 =  0X1.FFFFF6P+23
0x00FFFFFC: 0x4B7FFFFC =   1.6777212e+07 =  0X1.FFFFF8P+23
0x00FFFFFD: 0x4B7FFFFD =   1.6777213e+07 =  0X1.FFFFFAP+23
0x00FFFFFE: 0x4B7FFFFE =   1.6777214e+07 =  0X1.FFFFFCP+23
0x00FFFFFF: 0x4B7FFFFF =   1.6777215e+07 =  0X1.FFFFFEP+23
0x01000000: 0x4B800000 =   1.6777216e+07 =  0X1.000000P+24
0x01000001: 0x4B800000 =   1.6777216e+07 =  0X1.000000P+24
0x01000002: 0x4B800001 =   1.6777218e+07 =  0X1.000002P+24
0x01000003: 0x4B800002 =   1.6777220e+07 =  0X1.000004P+24
0x01000004: 0x4B800002 =   1.6777220e+07 =  0X1.000004P+24
0x01000005: 0x4B800002 =   1.6777220e+07 =  0X1.000004P+24
0x01000006: 0x4B800003 =   1.6777222e+07 =  0X1.000006P+24
0x01000007: 0x4B800004 =   1.6777224e+07 =  0X1.000008P+24
0x01000008: 0x4B800004 =   1.6777224e+07 =  0X1.000008P+24
0x01000009: 0x4B800004 =   1.6777224e+07 =  0X1.000008P+24
0x0100000A: 0x4B800005 =   1.6777226e+07 =  0X1.00000AP+24
0x0100000B: 0x4B800006 =   1.6777228e+07 =  0X1.00000CP+24
0x0100000C: 0x4B800006 =   1.6777228e+07 =  0X1.00000CP+24
0x0100000D: 0x4B800006 =   1.6777228e+07 =  0X1.00000CP+24
0x0100000E: 0x4B800007 =   1.6777230e+07 =  0X1.00000EP+24
0x0100000F: 0x4B800008 =   1.6777232e+07 =  0X1.000010P+24
0x01000010: 0x4B800008 =   1.6777232e+07 =  0X1.000010P+24
0x01000011: 0x4B800008 =   1.6777232e+07 =  0X1.000010P+24
0x01000012: 0x4B800009 =   1.6777234e+07 =  0X1.000012P+24
0x01000013: 0x4B80000A =   1.6777236e+07 =  0X1.000014P+24
0x01000014: 0x4B80000A =   1.6777236e+07 =  0X1.000014P+24
0x01000015: 0x4B80000A =   1.6777236e+07 =  0X1.000014P+24
0x01000016: 0x4B80000B =   1.6777238e+07 =  0X1.000016P+24
0x01000017: 0x4B80000C =   1.6777240e+07 =  0X1.000018P+24
0x01000018: 0x4B80000C =   1.6777240e+07 =  0X1.000018P+24
0x01000019: 0x4B80000C =   1.6777240e+07 =  0X1.000018P+24
0x0100001A: 0x4B80000D =   1.6777242e+07 =  0X1.00001AP+24
0x0100001B: 0x4B80000E =   1.6777244e+07 =  0X1.00001CP+24
0x0100001C: 0x4B80000E =   1.6777244e+07 =  0X1.00001CP+24
0x0100001D: 0x4B80000E =   1.6777244e+07 =  0X1.00001CP+24
0x0100001E: 0x4B80000F =   1.6777246e+07 =  0X1.00001EP+24
0x0100001F: 0x4B800010 =   1.6777248e+07 =  0X1.000020P+24
0x01000020: 0x4B800010 =   1.6777248e+07 =  0X1.000020P+24
0x01000021: 0x4B800010 =   1.6777248e+07 =  0X1.000020P+24
0x01000022: 0x4B800011 =   1.6777250e+07 =  0X1.000022P+24
0x01000023: 0x4B800012 =   1.6777252e+07 =  0X1.000024P+24
0x01000024: 0x4B800012 =   1.6777252e+07 =  0X1.000024P+24
0x01000025: 0x4B800012 =   1.6777252e+07 =  0X1.000024P+24
0x01000026: 0x4B800013 =   1.6777254e+07 =  0X1.000026P+24
0x01000027: 0x4B800014 =   1.6777256e+07 =  0X1.000028P+24
0x01000028: 0x4B800014 =   1.6777256e+07 =  0X1.000028P+24
0x01000029: 0x4B800014 =   1.6777256e+07 =  0X1.000028P+24
0x0100002A: 0x4B800015 =   1.6777258e+07 =  0X1.00002AP+24
0x0100002B: 0x4B800016 =   1.6777260e+07 =  0X1.00002CP+24
0x0100002C: 0x4B800016 =   1.6777260e+07 =  0X1.00002CP+24
0x0100002D: 0x4B800016 =   1.6777260e+07 =  0X1.00002CP+24
0x0100002E: 0x4B800017 =   1.6777262e+07 =  0X1.00002EP+24
0x0100002F: 0x4B800018 =   1.6777264e+07 =  0X1.000030P+24

输出的第一部分表明仍然可以精确存储某些整数值;具体而言,2的幂可以准确存储。实际上,更精确地(但不那么简明),绝对值的二进制表示不超过24位有效数字(任何尾随数字为零)的任何整数都可以精确表示。这些值不一定能完全打印出来,但这与存储它们完全不同。

输出的第二个(较大的)部分表明最多2 24 -1,可以精确表示整数值。 2 24 本身的值也是完全可表示的,但2 24 +1不是,因此它看起来与2 24 相同。相比之下,2 24 +2可以仅用24个二进制数字表示,然后是1,因此可以精确表示。对于大于2的增量,重复 ad nauseam 。看起来'round even'模式有效;这就是为什么结果显示1值然后3值。

(我顺便提一下,没有办法规定传递给double的{​​{1}} - 由默认参数促销的规则从printf()转换而来< / em>(ISO / IEC 9899:2011§6.5.2.2函数调用,¶6)打印为float - 逻辑上将使用float()修饰符,但未定义。)

答案 1 :(得分:1)

C / C ++浮点数往往与IEEE 754浮点标准兼容(例如在gcc中)。零来自rounding rules

向右移动一个数字会使右侧的某些位消失。我们称他们为guard bits。现在让我们调用HO最高位和LO我们号码的最低位。现在假设guard bits仍然是我们号码的一部分。例如,如果我们有3 guard bits,则意味着我们LO位的值为8(如果已设置)。现在如果:

  1. guard bits&gt;的值0.5 * LO

    的值

    将数字四舍五入到可能较小的值,忽略符号

  2. &#39;保护位&#39; == 0.5 * LO

    的值
    • 如果LO == 0
    • ,请使用当前数字值
    • 数字+ = 1否则
  3. guard bits&lt; LO的值0.5 * weights: 128 64 32 16 8 4 2 1 binary num: 0 0 0 0 1 1 1 1

    的值
    • 使用当前数字值
  4.   

    为什么3个保护位意味着LO值为8?

    假设我们有一个二进制8位数:

    weights:      x x x 128 64 32 16 8 | 4 2 1
    binary num:   0 0 0   0  0  0  0 1 | 1 1 1
    

    让它向右移3位:

    LO

    如你所见,有3个保护位,LO位最终位于第4位,权重为8.这只是为了舍入的目的。权重必须被标准化&#39;之后,unsigned number; //our number unsigned bitsToShift; //number of bits to shift assert(bitsToShift < 8); //8 bits unsigned guardMasks[8] = {0, 1, 3, 7, 0xf, 0x1f, 0x3f} unsigned LOvalues[8] = {0, 1, 2, 4, 0x8, 0x10, 0x20, 0x40} //divided by 2 for faster comparison unsigned guardBits = number & guardMasks[bitsToShift]; //value of the guard bits number = number >> bitsToShift; if(guardBits > LOvalues[bitsToShift]) { ... } else if (guardBits == LOvalues[bitsToShift]) { ... } else { //guardBits < LOvalues[bitsToShift] ... } 位的权重再次变为1。

      

    如果保护位&gt;如何检查位操作? 0.5 *值??

    最快的方法是使用查找表。假设我们正在处理一个8位数字:

    {{1}}

    参考:由Randall Hyde编写的Great Code,第1卷