Question

在我的80x86汇编程序中，我试图计算（（（（（（2 ^ 0 + 2 ^ 1）* 2 ^ 2）+ 2 ^ 3）* 2 ^ 4）+ 2 ^ 5）...（2 ^ n），其中每个偶数指数前面都有一个乘法和每个奇数指数前面都有一个加号。我有代码，但是我的结果与预期结果不符。将5代入n时，结果应为354，但我得到330。

任何人和所有建议将不胜感激。

.586
.model flat

include io.h

.stack 4096

.data
number dword ?
prompt byte "enter the power", 0
string byte 40 dup (?), 0
result byte 11 dup (?), 0
lbl_msg byte "answer", 0
bool dword ?
runtot dword ?

.code
_MainProc proc
    input prompt, string, 40
    atod string
    push eax


    call power



    add esp, 4

    dtoa result, eax
    output lbl_msg, result

    mov eax, 0
    ret

_MainProc endp

power proc
    push ebp
    mov ebp, esp

    push ecx

    mov bool, 1     ;initial boolean value
    mov eax, 1
    mov runtot, 2   ;to keep a running total
    mov ecx, [ebp + 8]

    jecxz done

loop1:
    add eax, eax        ;power of 2
    test bool, ecx      ;test case for whether exp is odd/even
    jnz oddexp          ;if boolean is 1
    add runtot, eax     ;if boolean is 0
    loop loop1

oddexp:
    mov ebx, eax        ;move eax to seperate register for multiplication
    mov eax, runtot     ;move existing total for multiplication
    mul ebx             ;multiplication of old eax to new eax/running total
    loop loop1

done:
    mov eax, runtot     ;move final runtotal for print
    pop ecx
    pop ebp
    ret




power endp



end

Answer 1

您使用静态变量和分支使代码过于复杂。

这些是2的幂，您可以（并且应该）仅向左移n而不是实际构造2^n并使用mul指令。

add eax,eax是乘以2（也就是左移1）的最佳方法，但是目前还不清楚为什么要对EAX中的值执行此操作。是乘积结果（您可能应该在runtot之后将其存储回mul中），或者是在偶数迭代后将其左移1。

如果您尝试制作一个2^i变量（通过强度降低优化，每次迭代移位1而不是i移位），那么您的错误是使用{{ 1}}及其设置，位于mul块中。

就像杰斯特指出的那样，如果第一个oddexp掉线了，它将掉入loop loop1中。在执行循环尾部复制时，请确保考虑到如果循环结束于那条尾部，那么从每条尾部会掉线。

拥有一个oddexp:的静态变量bool也是没有意义的，您只能将其用作1的操作数。对人类读者而言，这意味着有时需要更换口罩。 test作为检查低位是否为零/非零的一种方法更加清晰。

您也不需要test ecx,1的静态存储，只需使用一个寄存器（例如EAX，无论如何最终要获得结果）。 32位x86具有7个寄存器（不包括堆栈指针）。

这就是我要做的。未经测试，但我通过展开2简化了很多工作。然后对奇/偶的测试就消失了，因为交替模式已硬编码到循环结构中。

我们在循环中增加和比较/分支两次，因此展开并不会消除循环开销，只需将循环分支之一更改为一个runtot即可从中间离开循环。

这不是不是最有效的编写方式；可以通过从if() break开始向下计数另一个计数器来优化循环中间的增量和提前退出检查，如果剩下不到2步，则离开循环。（然后将其整理在结尾）

我没有检查奇数加法步骤是否产生了进位。我认为并非如此，因此将其实现为;; UNTESTED power proc ; fastcall calling convention: arg: ECX = unsigned int n ; clobbers: ECX, EDX ; returns: EAX push ebx ; save a call-preserved register for scratch space mov eax, 1 ; EAX = 2^0 running total / return value test ecx,ecx jz done mov edx, ecx ; EDX = n mov ecx, 1 ; ECX = i=1..n loop counter and shift count loop1: ; do{ // unrolled by 2 ; add 2^odd power mov ebx, 1 shl ebx, cl ; 2^i ; xor ebx, ebx; bts ebx, ecx add eax, ebx ; total += 2^i inc ecx cmp ecx, edx jae done ; if (++i >= n) break; ; multiply by 2^even power shl eax, cl ; total <<= i; // same as total *= (1<<i) inc ecx ; ++i cmp ecx, edx jb loop1 ; }while(i<n); done: pop ebx ret（设置位bts eax, ecx）可能是安全的。有效地是“或”而不是“添加”，但只要先前已清除该位，它们就等效。

为了使asm看起来更像源代码并避免使用晦涩的指令，我用i实现了1<<i来为shl生成2^i，而不是更有效-on-Intel total += 2^i / xor ebx,ebx。（由于x86标志处理遗留的行李，在Intel Sandybridge系列上，变量计数的移位为3 oups：如果count = 0，标志必须保持不变）。但是在AMD Ryzen上情况更糟，bts ebx, ecx为2 uops，而bts reg,reg为1。

更新：shl reg,cl 确实在加法时产生一个进位，因此在这种情况下我们不能对该位进行OR或BTS运算。但是，更多分支可以实现优化。

Using calc：

i=3

前几个输出是：

; define shiftadd_power(n) { local res=1; local i; for(i=1;i<=n;i++){ res+=1<<i; i++; if(i>n)break; res<<=i;} return res;}
shiftadd_power(n) defined
; base2(2)

; shiftadd_power(0)
        1 /* 1 */
...

剥离前3个迭代将启用BTS优化，您只需设置该位即可，而不是实际创建n shiftadd(n) (base2) 0 1 1 11 2 1100 3 10100 ; 1100 + 1000 carries 4 101000000 5 101100000 ; 101000000 + 100000 set a bit that was previously 0 6 101100000000000 7 101100010000000 ; increasing amounts of trailing zero around the bit being flipped by ADD并添加。

我们可以仅将2^n的起点硬编码为较大的n，而不用只是剥离它们，并优化找出{{1 }} 案件。我根据将i=3位模式右移3、2或0得出了一个无分支公式。

还要注意，对于n> = 18，最后的移位计数严格大于寄存器宽度的一半，并且奇数n<3中的2 ^ i没有低位 。因此，只有最后1或2次迭代才能影响结果。对于奇数0b1100，它可以归结为i，对于偶数1<<n，它可以归结为n。简化为0。

对于n，最多设置2位。从result = 0开始并进行最后3或4次迭代应足以获得正确的总数。实际上，对于任何(n&1) << n，我们只需要执行最后的n=14..17迭代，其中n足以使从k起的总移位计数> = 32。早期迭代设置的所有位都将移出。（我没有为此特殊情况添加分支。）

通过使用BTS在EAX中设置一些位，避免了需要额外的临时寄存器来构造i，因此我们不必保存/恢复EBX。因此，这是一笔很小的奖金节省。

请注意，这次使用;; UNTESTED ;; special cases for n<3, and for n>=18 ;; enabling an optimization in the main loop (BTS instead of add) ;; funky overflow behaviour for n>31: large odd n gives 1<<(n%32) instead of 0 power_optimized proc ; fastcall calling convention: arg: ECX = unsigned int n <= 31 ; clobbers: ECX, EDX ; returns: EAX mov eax, 14h ; 0b10100 = power(3) cmp ecx, 3 ja n_gt_3 ; goto main loop or fall through to hard-coded low n je early_ret ;; n=0, 1, or 2 => 1, 3, 12 (0b1, 0b11, 0b1100) mov eax, 0ch ; 0b1100 to be right-shifted by 3, 2, or 0 cmp ecx, 1 ; count=0,1,2 => CF,ZF,neither flag set setbe cl ; count=0,1,2 => cl=1,1,0 adc cl, cl ; 3,2,0 (cl = cl+cl + (count<1) ) shr eax, cl early_ret: ret large_n: ; odd n: result = 1<<n. even n: result = 0 mov eax, ecx and eax, 1 ; n&1 shl eax, cl ; n>31 will wrap the shift count so this "fails" ret ; if you need to return 0 for all n>31, add another check n_gt_3: ;; eax = running total for i=3 already cmp ecx, 18 jae large_n mov edx, ecx ; EDX = n mov ecx, 4 ; ECX = i=4..n loop counter and shift count loop1: ; do{ // unrolled by 2 ; multiply by 2^even power shl eax, cl ; total <<= i; // same as total *= (1<<i) inc edx cmp ecx, edx jae done ; if (++i >= n) break; ; add 2^odd power. i>3 so it won't already be set (thus no carry) bts eax, edx ; total |= 1<<i; inc ecx ; ++i cmp ecx, edx jb loop1 ; }while(i<n); done: ret（偶数）而不是1<<i进入主循环。所以我换了加号和班次。

我仍然没办法将i=4 / i=1从循环中间拉出来。像cmp而不是jae之类的东西会设置循环退出条件，但是需要进行检查以完全不运行i = 4或5的循环。对于大计数吞吐量，许多CPU可以每2个时钟周期维持1个已采取的分支+ 1个未采取的分支，不会比循环承载的dep链（通过lea edx, [ecx-2]和mov）产生更严重的瓶颈。但是分支预测会有所不同，它使用更多的分支顺序缓冲区条目来记录更多可能的回滚/快速恢复点。

在我的汇编程序中，我正在尝试计算（（（（（2 ^ 0 + 2 ^ 1）* 2 ^ 2）+ 2 ^ 3）* 2 ^ 4）+ 2 ^ 5）的等式

1 个答案: