Question

我试图使用汇编语言中的shift和add方法将两个16位数相乘，并将hi部分存储在dx寄存器中，将低部分存储在ax寄存器中。被乘数和乘数在堆栈上传递对于我的一些测试，我可以得到正确的答案，但对于一些人来说，持有较高部分的部分dx是错误的。例如，如果我做0001次0001 我得到了答案，dx = 0002 ax = 0002，当答案应该是dx = 0000 ax = 0002。

这是我的代码。我似乎无法知道我的代码出错了。我甚至手工做了这个例子，并且没有看到dx = 0002部分是如何到达那里的。

;---------------------------------------
; Multiply data
;---------------------------------------

h         dw        0                   ; this holds the high order bits

mltplier  dw        0                   ; this holds the mulitplier

     .code
;---------------------------------------
; Multiply code
;---------------------------------------
_multiply:                             ;
     push      bp                  ; save bp
     mov       bp,sp               ; anchor bp into the stack
     mov       bx,[bp+4]           ; load multiplicand from the stack
     mov       cx,[bp+6]           ; load multiplier   from the stack
     mov       [mltplier],cx       ;
     mov       cx,0Fh              ; make counter of 16
     mov       ax,0                ;
     mov       dx,0                ;

;  calculate multiplicand * multiplier
;  return result in dx:ax
_loop:
     shr       [mltplier],1        ; shift right by 1
     jnc       shift               ; if the number shifted out was not a 1           
                                   ;then we don't need to add anything
     clc                           ;clear carry flag
     add       ax,bx               ; add bx to ax, the low bits
     add       dx,[h]              ; add var to dx, the high bits
shift:                                 ;
     shl       [h],1               ; shift the high order bits left
     shl       bx,1                ; shift the low order bits left
     adc       [h],0               ; add to the high bits
     clc                           ;clear carry flag
    loop       _loop               ; loop the process
     pop       bp                  ; restore bp
     ret                           ; return with result in dx:ax
                                   ;
     end                           ; end source code
;---------------------------------------

Answer 1

WeatherVane的评论可能解决了错误的答案。

关于效率的一些注释：

通过XORing自身归零寄存器。它比mov r, 0和is better in every way需要更少的指令字节。（首选XOR超过sub same,same或其他选项，因为更多CPU将xor same,same识别为与旧值无关。）
clc之后您不需要jnc。仅当进位已被清除时，clc才可到达。 clc指令之前的loop也没用，因为您在下一个CF之前运行了设置或清除adc的其他说明。
将变量保存在内存中的速度很慢。在shr [mltplier],1或mltplier中保留si，而不是di。（如果你可以在循环中使用寄存器而不是内存位置，那么推送/弹出以保存/恢复寄存器一次以进行整个函数调用是值得的。同样，也要将[h]保留在寄存器中。

如果需要溢出到内存，通常更喜欢堆栈，而不是全局变量，因此您的函数是可重入且线程安全的。 ESP。对于mltplier，你可以使用调用者放在堆栈上的值，而不是复制它。

dec cx / jne

loop is slow on modern x86 CPUs。例如Haswell的循环开销约为7倍。您可以通过循环乘法器！= 0来保存寄存器并加速循环，而不是总是在循环中进行16次跳转。然后，您可以将mult保留在cx中并循环使用test cx, cx / jne。
在{（h）中使用di：

 shl       [h],1               ; shift the high order bits left
 shl       bx,1                ; shift the low order bits left
 adc       [h],0               ; add to the high bits

可能是：

 shl       bx, 1               ; shift the low order bits left
 adc       di, di              ; shift the high order bits left and add the carry

如果您的目标是386 CPU，shld双寄存器移位也可以，两个指令可以并行运行，而不是依赖于另一个指令：

 shld      di, bx, 1
 shl       bx, 1

在英特尔Sandybridge系列CPU上，

shld r,r,i比adc便宜。（1 uop vs. 2）

请参阅Agner Fog's instruction tables and guides以及x86代码wiki中的其他链接。

Answer 2

这显示了如何将两个16位值相乘以获得32位值（在两个16位寄存器中）。

#include <stdio.h>

unsigned multiply16x16(unsigned short m, unsigned short n) {
    __asm {
        xor     ax,ax       ; clear the product
        xor     dx,dx
        mov     cx,16       ; set up loop counter
    nextbit:
        shl     ax,1        ; shift 32-bit product left
        adc     dx,dx
        shl     [m],1       ; get m.s. bit of multiplier
        jnc     noadd       ; ignore if not set
        add     ax,[n]      ; add multiplicand to product
        adc     dx,0        ; with carry
    noadd:
        loop    nextbit     ; loop counter stops when cx  0
        mov     [m],ax      ; store in 16-bit operands
        mov     [n],dx
    }
    return (n << 16) + m;   // align and return as 32-bit unsigned
}

int main(void){
    unsigned short m, n;

    m=3; n=5;
    printf("%u\n", multiply16x16 (m,n));

    m=65535; n=2;
    printf("%u\n", multiply16x16 (m,n));

    m=987; n=654;
    printf("%u\n", multiply16x16 (m,n));

    m=123; n=456;
    printf("%u\n", multiply16x16 (m,n));

    m=65535; n=65535;
    printf("%u\n", multiply16x16 (m,n));

    return 0;
}

节目输出：

Answer 3

更有效和有趣的方法是通过从有符号的乘法（mul）中综合禁止的无符号乘法（imul）来颠覆练习。

翻转无符号整数的MSB，相当于减去8000h模2 ^ 16，将值映射到有符号整数的范围而不会下溢。因此，允许计算(a-8000h)*(b-8000h)，并添加回a*8000h + b*8000h - 4000000h会产生a*b

multiply:
    push bp
    mov bp,sp
    mov ax,[bp+4]
    xor ax,8000h
    mov dx,[bp+6]
    xor dx,8000h
    imul dx
    sub dx,4000h
    mov cx,[bp+4]
    add cx,[bp+6]
    rcr cx,1    ;Recovery lost carry while dividing
    jnc @f      ;by two and adding back the a+b term
    add ax,8000h
@@: adc dx,cx
    pop bp
    ret

（对于记录，由于篇幅较长，这更像是以答案形式发布的评论。）

将两个16位数相乘并在dx：ax中存储32位应答，而不在程序集8086中使用mul指令

3 个答案: