Question

我有一些使用callq调用另一个汇编代码的汇编代码。在调用retq时，程序会因分段错误而崩溃。

    .globl  main
main:                   # def main():
    pushq   %rbp        #
    movq    %rsp, %rbp  #

    callq   input       # get input
    movq    %rax, %r8

    callq   r8_digits_to_stack
    # program is not getting here before the segmentation fault
    jmp     exit_0

# put the binary digits of r8 on the stack, last digit first (lowest)
# uses: rcx, rbx
r8_digits_to_stack:
    movq    %r8, %rax       # copy for popping digits off

    loop_digits_to_stack:
        cmpq    $0, %rax    # if our copy is zero, we're done!
        jle     return

        movq    %rax, %rcx  # make another copy to extract digit with
        andq    $1, %rcx    # get last digit
        pushq   %rcx        # push last digit to stack
        sarq    %rax        # knock off last digit for next loop
        jmp     loop_digits_to_stack

# return from wherever we were last called
return:
    retq

# exit with code 0
exit_0:
    movq    $0, %rax    # return 0
    popq    %rbp
    retq

其中input是一个将键盘输入返回%rax的C函数。

我认为这可能与我正在操纵堆栈的事实有关，那就是这种情况？

Answer 1

我认为你的一条回归路径并没有弹出rbp。

pushq   %rbp
movq    %rsp, %rbp

pop     %rbp

完全。 gcc的默认值为-fomit-frame-pointer。

或者修改你的非归零路径也可以弹出rbp。

实际上，你已经搞砸了，因为你的功能似乎是为了把东西放在堆叠上而永远不会脱掉它。如果你想发明自己的ABI，堆栈指针下面的空间可以用来返回数组，那很有意思，但是你必须跟踪它们的大小，这样你就可以调整{{1返回指向rsp之前的返回地址。

我建议不要将返回地址加载到寄存器中，并将ret替换为ret或其他内容。这会抛弃现代CPU中的调用/返回地址预测逻辑，并导致停顿与分支错误预测相同。（见http://agner.org/optimize/）。 CPU讨厌不匹配的呼叫/转发。我现在找不到要链接的特定页面。

有关其他有用资源的信息，请参阅https://stackoverflow.com/tags/x86/info，包括有关函数通常采用args的ABI文档。

您可以将返回地址复制到刚刚推送的数组下方，然后运行jmp *%rdx，返回％rsp modified。但除非您需要从多个呼叫站点呼叫长时间功能，否则最好只在一个或两个呼叫站点内联它。

如果它太大而无法在太多的呼叫网站上内联，那么您最好的选择是模拟{{1}，而不是使用ret，并将返回地址复制到新位置。 }和call。来电者

call

你需要一个非常好的理由来使用这样的东西。而且你必须通过大量的评论来证明/解释它，因为它不会读者会期待什么。首先，你要把这个数组推到堆栈上怎么办？你会通过减去rsp和rbp来找到它的长度吗？

有趣的是，即使ret必须修改rsp以及进行存储，它在所有最近的CPU上每个时钟吞吐量都有一个。英特尔CPU有一个堆栈引擎，当堆栈操作只被push / pop / call / ret改变时，不必等待在无序引擎中计算rsp。（将push / pop与put args in some registers lea .ret_location(%rip), %rbx jmp my_weird_helper_function .ret_location: # in NASM/YASM, labels starting with . are local labels, and don't show up in the object file. # GNU assembler might only treat symbols starting with .L that way. ... my_weird_helper_function: use args, potentially modifying the stack jmp *%rbx # return混合或者插入额外的uops以使OOO-engine的rsp与堆栈引擎的偏移同步。）Intel / AMD CPU只能做一个无论如何都要按时钟存储，但Intel SnB及更高版本每个时钟可以弹出两次。

所以push / pop实际上并不是实现堆栈数据结构的可怕方式，尤其是。在英特尔。

此外，您的代码结构怪异。 push分为mov 4(%rsp), %rax。这没关系，但是你没有利用从一个街区掉进另一个街区，所以只需要花费main() r8_digits_to_stack就可以获得额外费用巨大的可读性缺点。

让我们假装你的循环是jmp的一部分，因为我已经讨论了如何修改％rsp函数返回它是多么奇怪。

你的循环也可以更简单。如果可能的话，将jcc结构化为最重要的东西。

避免上面的16个寄存器有一个小好处：带有经典寄存器的32位insn不需要REX前缀字节。所以我们假设我们的起始值只有％rax。

main

这个版本仍然有点糟糕，因为在Intel P6 / SnB CPU系列中，在写入较小的部分后使用更宽的寄存器会导致速度减慢。（在SnB之前停止，或者在SnB上以及之后的额外uop）。其他人，包括AMD和Silvermont，不会分别跟踪部分寄存器，因此写入％cl依赖于之前的％rcx值。（写入32位寄存器将上部32置零，这避免了部分寄存器依赖性问题。）从字节到长度零延伸的main将执行Sandybridge隐式执行的操作，并为旧CPU提供加速。

这在英特尔的每次迭代中都不会在一个循环中完全运行，但可能在AMD上运行。 digits_to_stack: # put each bit of %rax into its own 8 byte element on the stack for maximum space-inefficiency movq %rax, %rdx # save a copy xor %ecx, %ecx # setcc is only available for byte operands, so zero %rcx # need a test at the top after transforming while() into do{}while test %rax, %rax # fewer insn bytes to test for zero this way jz .Lend # Another option can be to jmp to the test at the end of the loop, to begin the first iteration there. .align 16 .Lpush_loop: shr $1, %rax # shift the low bit into CF, set ZF based on the result setc %cl # set %cl to 0 or 1, based on the carry flag # movzbl %cl, %ecx # zero-extend pushq %rcx #.Lfirst_iter_entry # test %rax, %rax # not needed, flags still set from shr jnz .Lpush_loop .Lend:并不错，但movzx会影响标记，因此只需根据设置标记的mov/and $1进行循环就很难。

请注意，您的旧版本and会移位符号位，不一定是零，所以对于负输入，您的旧版本将是一个inf循环（当您用完堆栈空间时会出现段错误（push会尝试写入）到未映射的页面））。

retq期间的装配分段故障

1 个答案: