s |指数|尾数

Question

我正在硬件上设计一个单精度浮点乘法器，并且使用python来验证我的设计。这是单精度浮点格式。

s |指数|尾数

相乘的结果是

（S1 ^ S2）| E1 + E2-127 |尾数1 *尾数2

在执行两个尾数的乘法运算后，在四舍五入数字时遇到问题。您知道尾数是24位（23和1个隐藏位），将两个24位相乘将得到48位。尾数字段只能包含23位，因此我尝试将其截断，如下所示： example of mantissa truncating。但是，与python float32乘法相比，结果似乎不正确。
所以我在这里问python如何处理尾数乘法。谢谢。

Answer 1

Python（和许多其他语言）使用IEEE754默认舍入模式，称为“舍入到最接近的关系到偶数”。您可以在这里了解更多信息：https://en.wikipedia.org/wiki/IEEE_754#Rounding_rules

Answer 2

在这个答案中，我将说明如何针对基本的32位二进制浮点使用IEEE-754舍入到最近的关系到偶数规则舍入。这可能是您想要的。尽管许多Python实现使用IEEE-754，但是Python文档并不需要这样做，即使在确实使用IEEE-754的实现中，也可能与IEEE-754标准有所不同，因此您不应该将Python作为权威参考。

给出一个48位有效数¹，让我们对从最高有效位的0到最低有效位的47位进行编号。（这与通常的方向相反，但是对于本次讨论很方便。）

如果乘积在浮点格式的正常范围内，则有效位数将被四舍五入：

请考虑第24位及其后的尾随位。如果位24为0，则向下舍入（仅使用位0至23）。如果位24为1并且任何尾随位为1，则将其向上舍入（取0到23的位并加1）。如果位24为1并且所有尾随位均为0，则如果位23为偶数（零）则向下舍入，如果位23为奇数（一）则向上舍入。
四舍五入时，这可能会使位0到23溢出（如果它们是1111…111，则加法从位0进行）。在这种情况下，将它们右移一位，然后将其加到指数上即可。

如上所述，以上是针对正常情况的。处理所有情况：

测试零：如果有效位数为零，则返回零（带有计算的符号）。
标准化有效位数：如果有效位数字段的位0为0，则将有效位数左移一位，然后从指数中减去1。重复此步骤，直到有效位数的位0不为0为止。
测试溢出：：如果真实指数超过127（有偏指数超过254），则产生无穷大（带有计算的符号）。
限制指数：如果真实指数（我称为 e ）小于-126（有偏指数小于1），则将有效位数右移通过-126- e 位，在左侧插入0。将指数设置为-126。如上所述对有效位数进行四舍五入，位数从0到47 +（-126- e ）。
测试溢出：：如果舍入使指数增加到127，则产生无穷大（带有计算的符号）。
产生正常结果：：如果有效位的位0为1，则将有效位字段用于位1到23，将指数字段使用 e +127。
产生次正态结果：否则，有效位的位0为0。在这种情况下，有效位字段仍使用位1至23，但指数字段使用零。此指数字段表示次标准数字。

请注意，“限制指数”步骤似乎会使“有效数标准化”步骤撤消，从而使前一步骤变得不必要，但是此组合可处理所有产生正常和非正规结果的正常和非正规操作数。

脚注

¹“有效位数”是IEEE-754标准中用于浮点数小数部分的术语。 “ Mantissa”是对数的分数部分的旧术语。（有效数字是线性的。尾数是对数的。）

Answer 3

过去，我已经将此精确的事情作为求职申请的练习：

Python脚本可生成随机浮点并创建包含IEEE754值和乘积的测试向量：

import bitstring, random 

spread = 10000000
n = 100000

def ieee754(flt):
    b = bitstring.BitArray(float=flt, length=32)
    return b

if __name__ == "__main__":
    with open("TestVector", "w") as f:
        for i in range(n):
            a = ieee754(random.uniform(-spread, spread))
            b = ieee754(random.uniform(-spread, spread))

            # calculate desired product based on 32-bit ieee 754 
            # representation of the floating point of each operand
            ab = ieee754(a.float * b.float)

            f.write(a.bin + "_" + b.bin + "_" + ab.bin + "\n")

Verilog代码使用最近的偶数舍入进行32位浮点乘法：

/*   Input and outputs are in single-precision IEEE 754 FP format:
     Sign     -   1 bit, position 32
     Exponent -  8 bits, position 31-24
     Mantissa - 23 bits, position 22-1

Not implemented: special cases like inf, NaN
*/

// indices of components of IEEE 754 FP
`define SIGN    31
`define EXP     30:23
`define M       22:0

`define P       24    // number of bits for mantissa (including 
`define G       23    // guard bit index
`define R       22    // round bit index
`define S       21:0  // sticky bits range

`define BIAS    127


module fp_mul(input  wire clk,
              input  wire[31:0] a,
              input  wire[31:0] b,
              output wire[31:0] y);

    reg [`P-2:0] m;
    reg [7:0] e;
    reg s;

    reg [`P*2-1:0] product;
    reg G;
    reg R;
    reg S;
    reg normalized;

    reg state;
    reg next_state = 0;

    parameter STEP_1 = 1'b0, STEP_2 = 1'b1;


    always @(posedge clk) begin
        state <= next_state;
    end


    always @(state) begin

        case(state)
            STEP_1: begin
                // mantissa is product of a and b's mantissas, 
                // with a 1 added as the MSB to each
                product = {1'b1, a[`M]} * {1'b1, b[`M]};

                // get sticky bits by ORing together all bits right of R
                S = |product[`S]; 

                // if the MSB of the resulting product is 0
                // normalize by shifting right    
                normalized = product[47];
                if(!normalized) product = product << 1; 

                next_state = STEP_2;            
                end 

            STEP_2: begin
                // if either mantissa is 0, result is 0 
                if(!a[`M] | !b[`M]) begin 
                    s = 0; e = 0; m = 0;
                end else begin
                    // sign is xor of signs
                    s = a[`SIGN] ^ b[`SIGN];

                    // mantissa is upper 22-bits of product w/ nearest-even rounding
                    m = product[46:24] + (product[`G] & (product[`R] | S));

                    // exponent is sum of a and b's exponents, minus the bias 
                    // if the mantissa was shifted, increment the exponent to balance it
                    e = a[`EXP] + b[`EXP] - `BIAS + normalized;
                end 

                next_state = STEP_1;
                end
        endcase
    end

    // output is concatenation of sign, exponent, and mantissa  
    assign y = {s, e, m};
endmodule

希望有帮助！

python如何在乘法后舍入float32数字？

s |指数|尾数

（S1 ^ S2）| E1 + E2-127 |尾数1 *尾数2

3 个答案:

脚注