Question

我已经编写了一个汇编程序来显示AT＆amp;之后的数字的阶乘。 t syntax.But它不工作。我的代码是

.text 

.globl _start

_start:
movq $5,%rcx
movq $5,%rax


Repeat:                     #function to calculate factorial
   decq %rcx
   cmp $0,%rcx
   je print
   imul %rcx,%rax
   cmp $1,%rcx
   jne Repeat
# Now result of factorial stored in rax
print:
     xorq %rsi, %rsi

  # function to print integer result digit by digit by pushing in 
       #stack
  loop:
    movq $0, %rdx
    movq $10, %rbx
    divq %rbx
    addq $48, %rdx
    pushq %rdx
    incq %rsi
    cmpq $0, %rax
    jz   next
    jmp loop

  next:
    cmpq $0, %rsi
    jz   bye
    popq %rcx
    decq %rsi
    movq $4, %rax
    movq $1, %rbx
    movq $1, %rdx
    int  $0x80
    addq $4, %rsp
    jmp  next
bye:
movq $1,%rax
movq $0, %rbx
int  $0x80


.data
   num : .byte 5

这个程序什么都不打印，我也使用gdb来直观地将它工作直到循环功能，但是当它接下来一些随机值开始进入各种寄存器时。帮我调试以便它可以打印析数。

Answer 1

有几件事：

0）我猜这是64b linux环境，但你应该这样说（如果不是，我的一些点将无效）

1）int 0x80是32b调用，但你使用64b寄存器，所以你应该使用syscall（和不同的参数）

2）int 0x80, eax=4要求ecx包含存储内容的内存地址，同时在ecx =非法内存访问（第一次调用）中为其指定ASCII字符应返回错误，即eax为负值）。或者使用strace <your binary>应该显示错误的参数+返回错误。

3）为什么addq $4, %rsp？对我来说没有任何意义，你正在损害rsp，所以下一个pop rcx会弹出错误的价值，最后你会跑出去＃34; up＆＃34;进入堆栈。

...也许还有一些，我没有调试它，这个列表只是通过阅读源代码（所以我可能会对某些事情做错，尽管这种情况很少见。）

BTW您的代码正在运行。它只是做你没想到的。但工作正常，正如CPU的设计和您在代码中所写的那样。这是否能实现您想要的，或者有意义的，不同的主题，但不要责怪硬件或汇编程序。

...我可以快速猜测如何修复例程（只需要部分修复，仍然需要在64b linux下重写syscall）：

  next:
    cmpq $0, %rsi
    jz   bye
    movq %rsp,%rcx    ; make ecx to point to stack memory (with stored char)
      ; this will work if you are lucky enough that rsp fits into 32b
      ; if it is beyond 4GiB logical address, then you have bad luck (syscall needed)
    decq %rsi
    movq $4, %rax
    movq $1, %rbx
    movq $1, %rdx
    int  $0x80
    addq $8, %rsp     ; now rsp += 8; is needed, because there's no POP
    jmp  next

再次没有尝试过自己，只是从脑袋里写下来，所以让我知道它是如何改变的。

Answer 2

正如@ ped7g指出的那样，你做了几件事：在64位代码中使用int 0x80 32位ABI，并传递字符值而不是指向write()的指针系统调用。

以下是如何在64位Linux中打印整数，这是一种简单且有效的方法。请参阅Why does GCC use multiplication by a strange number in implementing integer division?以避免div r64进行除法10，因为那很慢（21 to 83 cycles on Intel Skylake）。乘法逆将使这个函数实际上有效，而不仅仅是＃34;有些＆＃34;。（但当然还有优化空间......）

系统调用很昂贵（write(1, buf, 1)可能需要数千个周期），并且在循环内执行syscall会使寄存器变得非常不便，而且效率低，效率低。我们应该按照打印顺序（最低地址的最高位数字）将字符写入一个小缓冲区，然后对其进行单write()次系统调用。

但是我们需要一个缓冲区。 64位整数的最大长度只有20位十进制数，所以我们可以使用一些堆栈空间。在x86-64 Linux中，我们可以使用RSP以下的堆栈空间（最高128B），而无需预留＆＃34;它通过修改RSP。这称为red-zone。

使用GAS可以轻松使用.h文件中定义的常量，而不是硬编码系统调用号。请注意mov $__NR_write, %eax附近的int 0x80。功能。 The x86-64 SystemV ABI passes system-call arguments in similar registers to the function-calling convention。（因此它与32位#include <asm/unistd_64.h> // This is a standard glibc header file // It contains no C code, only only #define constants, so we can include it from asm without syntax errors. .p2align 4 .globl print_integer #void print_uint64(uint64_t value) print_uint64: lea -1(%rsp), %rsi # We use the 128B red-zone as a buffer to hold the string # a 64-bit integer is at most 20 digits long in base 10, so it fits. movb $'\n', (%rsi) # store the trailing newline byte. (Right below the return address). # If you need a null-terminated string, leave an extra byte of room and store '\n\0'. Or push $'\n' mov $10, %ecx # same as mov $10, %rcx but 2 bytes shorter # note that newline (\n) has ASCII code 10, so we could actually have used movb %cl to save code size. mov %rdi, %rax # function arg arrives in RDI; we need it in RAX for div .Ltoascii_digit: # do{ xor %edx, %edx div %rcx # rax = rdx:rax / 10. rdx = remainder # store digits in MSD-first printing order, working backwards from the end of the string add $'0', %edx # integer to ASCII. %dl would work, too, since we know this is 0-9 dec %rsi mov %dl, (%rsi) # *--p = (value%10) + '0'; test %rax, %rax jnz .Ltoascii_digit # } while(value != 0) # If we used a loop-counter to print a fixed number of digits, we would get leading zeros # The do{}while() loop structure means the loop runs at least once, so we get "0\n" for input=0 # Then print the whole string with one system call mov $__NR_write, %eax # SYS_write, from unistd_64.h mov $1, %edi # fd=1 # %rsi = start of the buffer mov %rsp, %rdx sub %rsi, %rdx # length = one_past_end - start syscall # sys_write(fd=1 /*rdi*/, buf /*rsi*/, length /*rdx*/); 64-bit ABI # rax = return value (or -errno) # rcx and r11 = garbage (destroyed by syscall/sysret) # all other registers = unmodified (saved/restored by the kernel) # we don't need to restore any registers, and we didn't modify RSP. ret ABI的寄存器完全不同。）

.p2align 4
.globl _start
_start:
    mov    $10120123425329922, %rdi
#    mov    $0, %edi    # Yes, it does work with input = 0
    call   print_uint64

    xor    %edi, %edi
    mov    $__NR_exit, %eax
    syscall                             # sys_exit(0)

为了测试这个功能，我把它放在同一个文件中来调用它并退出：

$ gcc -Wall -nostdlib print-integer.S && ./a.out 
10120123425329922
$ strace ./a.out  > /dev/null
execve("./a.out", ["./a.out"], 0x7fffcb097340 /* 51 vars */) = 0
write(1, "10120123425329922\n", 18)     = 18
exit(0)                                 = ?
+++ exited with 0 +++
$ file ./a.out 
./a.out: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, BuildID[sha1]=69b865d1e535d5b174004ce08736e78fade37d84, not stripped

我把它构建成一个静态二进制文件（没有libc）：

div

相关：Linux x86-32扩展精度循环，从每个32位＆＃34;肢体＆＃34;打印9个十进制数字：请参阅.toascii_digit: in my Extreme Fibonacci code-golf answer。它针对代码大小进行了优化（即使以牺牲速度为代价），但评论很好。

它像你一样使用loop，因为它比使用快速乘法逆更小。它使用int 0x80作为外部循环（超过多个整数以获得扩展精度），同样适用于code-size at the cost of speed。

它使用32位print_uint64 ABI，然后打印到一个缓冲区中，该缓冲区保持着＃34; old＆＃34; Fibonacci值，而不是当前值。

获得高效asm的另一种方法是来自C编译器。对于数字循环，请查看g C或clang为此C源生成的内容（这基本上就是asm正在做的事情）。 Godbolt编译器浏览器可以轻松尝试不同的选项和不同的编译器版本。

请参阅gcc7.2 -O3 asm output，它几乎是void itoa_end(unsigned long val, char *p_end) { const unsigned base = 10; do { *--p_end = (val % base) + '0'; val /= base; } while(val); // write(1, p_end, orig-current); }中循环的替代品（因为我选择了args进入相同的寄存器）：

syscall

我通过注释mul %rcx指令并在函数调用周围进行重复循环来测试Skylake i7-6700k的性能。 shr $3, %rdx / div %rcx的版本比10120123425329922的版本快5倍，用于将长数字字符串（mul）存储到缓冲区中。 div版本每个时钟运行0.25个指令，而mul版本每个时钟运行2.65个指令（尽管需要更多指令）。

可能值得展开2，并除以100并将其余部分分成2位数。如果shr + val延迟的更简单的版本瓶颈，这将提供更好的指令级并行性。使{{1}}为零的乘法/移位操作链将是一半的长度，在每个短的独立依赖链中有更多的工作来处理0-99的余数。

使用AT＆amp; T语法将整数打印为字符串，使用Linux系统调用而不是printf

2 个答案: