Question

我已经研究了很多关于x86-64的ABI，编写汇编以及研究堆栈和堆的工作方式。

给出以下代码：

#include <linux/seccomp.h>
#include <stdlib.h>
#include <unistd.h>

int main(int argc, char *argv[]) {
    // execute the seccomp syscall (could be any syscall)
    seccomp(...);

    return 0;
}

在Assembly for x86-64中，这将执行以下操作：

对齐堆栈指针（默认情况下，它关闭8个字节）。
为调用seccomp。
执行以下程序集call seccomp。
当seccomp返回时，就我所知，C可能会调用exit(0)。

我想谈谈上面第三步和第四步之间会发生什么。

我目前拥有当前正在运行的进程的堆栈，它在寄存器和堆栈中有自己的数据。用户空间进程如何将执行转交给内核？内核是否只是在调用时接收，然后从同一堆栈中推送并弹出？

我相信我听说系统调用不是立即发生的，而是在某些CPU滴答或中断时发生的。这是真的？例如，在Linux上发生了这种情况？

Answer 1

系统调用不会立即发生，但会在某些CPU滴答或中断时发生

完全错了。在定时器中断之前，CPU不会在那里做任何事情。在大多数架构上，包括x86-64，切换到内核模式需要几十到几百个周期，但不是因为CPU在等待任何事情。这只是一个缓慢的操作。

请注意，glibc几乎在每个系统调用周围提供函数包装器，所以如果你看一下反汇编，你就会看到一个看起来很正常的函数调用。

真正发生的事情（以x86-64为例）：

请参阅从x86标记wiki链接的AMD64 SysV ABI文档。它指定将args放入哪些寄存器，并使用syscall指令进行系统调用。英特尔的insn参考手册（也从标签wiki链接）详细记录了syscall对CPU架构状态所做的每一次更改。如果您对AMD架构师和内核开发人员之间的amd64邮件列表中的设计历史感兴趣，请I dug up some interesting mailing list posts。 AMD在发布第一个AMD64硬件so it was actually usable for Linux (and other kernels)之前更新了行为。

32位x86使用int 0x80指令进行系统调用，或sysenter。 syscall在32位模式下无法使用，sysenter在64位模式下无法使用。您可以在64位代码中运行int 0x80，但仍然可以获得将指针视为32位的32位API。（即不要这样做）。顺便说一句，也许你对由于int 0x80而不得不等待中断的系统调用感到困惑？运行该指令会在现场触发该中断，直接跳转到中断处理程序。 0x80也不是硬件可以触发的中断，因此中断处理程序只能在软件触发的系统调用之后运行。

AMD64系统调用示例：

#include <stdlib.h>
#include <unistd.h>
#include <linux/unistd.h>    // for __NR_write

const char msg[]="hello world!\n";

ssize_t amd64_write(int fd, const char*msg, size_t len) {
  ssize_t ret;
  asm volatile("syscall"  // volatile because we still need the side-effect of making the syscall even if the result is unused
               : "=a"(ret)                   // outputs
               : [callnum]"a"(__NR_write),   // inputs: syscall number in rax,
                "D" (fd), "S"(msg), "d"(len)    // and args, in same regs as the function calling convention
               : "rcx", "r11",               // clobbers: syscall always destroys rcx/r11, but Linux preserves all other regs
                 "memory"                    // "memory" to make sure any stores into buffers happen in program order relative to the syscall 
              );
}

int main(int argc, char *argv[]) {
    amd64_write(1, msg, sizeof(msg)-1);
    return 0;
}

int glibcwrite(int argc, char**argv) {
    write(1, msg, sizeof(msg)-1);  // don't write the trailing zero byte
    return 0;
}

compiles to this asm output, with the godbolt Compiler Explorer:

gcc＆＃39; -masm=intel输出有点像MASM，因为它使用OFFSET键工作来获取标签的地址。

.rodata
msg:
        .string "hello world!\n"

.text
main:   // using an in-line syscall
        mov     eax, 1    # __NR_write
        mov     edx, 13   # string length
        mov     esi, OFFSET FLAT:msg      # string pointer
        mov     edi, eax  # file descriptor = 1 happens to be the same as __NR_write
        syscall
        xor     eax, eax  # zero the return value
        ret

glibcwrite:  // using the normal way that you get from compiler output
        sub     rsp, 8       // keep the stack 16B-aligned for the function call
        mov     edx, 13      // put args in registers
        mov     esi, OFFSET FLAT:msg
        mov     edi, 1
        call    write
        xor     eax, eax
        add     rsp, 8
        ret

glibc的write包装函数只需将1放入eax并运行syscall，然后检查返回值并设置errno。还处理在EINTR和东西上重启系统调用。

// objdump -R -Mintel -d /lib/x86_64-linux-gnu/libc.so.6
...
00000000000f7480 <__write>:
   f7480:       83 3d f9 27 2d 00 00    cmp    DWORD PTR [rip+0x2d27f9],0x0        # 3c9c80 <argp_program_version_hook+0x1f8>
   f7487:       75 10                   jne    f7499 <__write+0x19>
   f7489:       b8 01 00 00 00          mov    eax,0x1
   f748e:       0f 05                   syscall
   f7490:       48 3d 01 f0 ff ff       cmp    rax,0xfffffffffffff001   // I think that's -EINTR
   f7496:       73 31                   jae    f74c9 <__write+0x49>
   f7498:       c3                      ret
   ... more code to handle cases where one of those branches was taken

Answer 2

系统调用不会立即发生，而是在某些CPU滴答或中断
上

当然，你的系统调用的效果可能取决于许多因素，包括滴答声。调度器粒度或定时的分辨率可以限于节拍周期，例如，但是调用本身应该“立即”发生（内联执行）。

用户空间进程如何将执行转交给内核？内核是否只是在调用时接收，然后从同一堆栈中推送并弹出？

架构之间可能略有不同，但一般来说，系统调用参数由libc汇编，然后生成处理器异常以更改上下文。

有关其他详细信息，请参阅：“How system calls work on x86 linux”

当用户空间程序调用系统调用时，执行如何转移回内核空间？

2 个答案:

真正发生的事情（以x86-64为例）：

AMD64系统调用示例：