Question

Answer 1

MSVC内联asm和GNU C内联asm之间存在巨大差异。 GCC语法设计用于最佳输出而不会浪费指令，用于包装单个指令或其他内容。 MSVC语法设计得相当简单，但是如果没有延迟和额外的输入和输出内存往返指令，AFAICT是不可能使用的。

如果您出于性能原因使用内联asm，这使得MSVC内联asm仅在您完全在asm中编写完整循环时才可行，而不是用于在内联函数中包装短序列。下面的示例（使用函数包装idiv）是MSVC不好的事情：~8个额外的存储/加载指令。

MSVC inline asm（由MSVC使用，可能也是icc，也可能在一些商业编译器中提供）：

查看你的asm，找出代码所处的寄存器。
只能通过内存传输数据。例如，编译器存储寄存器中存在的数据以准备mov ecx, shift_count。因此，使用编译器不会为您生成的单个asm指令，包括在路上和路上的往返内存。
更适合初学者，但通常无法避免数据输入/输出。即使除了语法限制之外，当前版本的MSVC中的优化器也不擅长围绕内联asm块进行优化。

GNU C inline asm is not a good way to learn asm。您必须非常了解asm，以便您可以告诉编译器您的代码。你必须了解编译器需要知道什么。该答案还与其他inline-asm指南和Q＆amp; As有关。 x86标记wiki对于asm一般有很多好东西，但只是指向GNU内联asm的链接。（该答案中的内容也适用于非x86平台上的GNU内联asm。）

GNU C inline asm语法由gcc，clang，icc和一些实现GNU C的商业编译器使用：

你必须告诉编译器你的内容。如果不这样做，将导致以非显而易见的难以调试的方式破坏周围的代码。
功能强大但难以阅读，学习和使用语法来告诉编译器如何提供输入，以及在何处查找输出。例如在您的内联asm运行之前，"c" (shift_count)会让编译器将shift_count变量放入ecx。
对于大块代码而言更加笨重，因为asm必须在字符串常量内。所以你通常需要
```
"insn   %[inputvar], %%reg\n\t"       // comment
"insn2  %%reg, %[outputvar]\n\t"
```
非常无情/更难，但允许更低的开销esp。用于包装单个指令。（包装单个指令是原始的设计意图，这就是为什么你必须特别告诉编译器有关早期的clobbers，以阻止它使用相同的寄存器输入和输出，如果这是一个问题。）
< / LI>

示例：全宽整数除法（`div`）

在32位CPU上，将64位整数除以32位整数，或者进行全乘（32x32-> 64），可以从内联asm中受益。 gcc和clang没有为idiv利用(int64_t)a / (int32_t)b，可能是因为如果结果不适合32位寄存器，则指令会出错。因此，与this Q&A about getting quotient and remainder from one div不同，这是内联asm的用例。（除非有通知编译器结果适合的方法，所以idiv不会出错。）

我们将使用调用约定将一些args放在寄存器中（即使在右寄存器中也使用hi），以显示更接近于什么的情况你在内联这样一个小功能时会看到它。

MSVC

使用inline-asm时要注意register-arg调用约定。显然，inline-asm支持的设计/实现非常糟糕the compiler might not save/restore arg registers around the inline asm, if those args aren't used in the inline asm。感谢@RossRidge指出这一点。

// MSVC.  Be careful with _vectorcall & inline-asm: see above
// we could return a struct, but that would complicate things
int _vectorcall div64(int hi, int lo, int divisor, int *premainder) {
    int quotient, tmp;
    __asm {
        mov   edx, hi;
        mov   eax, lo;
        idiv   divisor
        mov   quotient, eax
        mov   tmp, edx;
        // mov ecx, premainder   // Or this I guess?
        // mov   [ecx], edx
    }
    *premainder = tmp;
    return quotient;     // or omit the return with a value in eax
}

更新：显然在eax或edx:eax中留下了一个值，然后从非空函数的末尾开始（没有return）is supported, even when inlining 即可。我认为这只有在asm语句后没有代码的情况下才有效。这避免了输出的存储/重新加载（至少对于quotient），但我们无法对输入做任何事情。在具有堆栈参数的非内联函数中，它们已经存在于内存中，但在这个用例中，我们正在编写一个可以有用内联的小函数。

使用MSVC编译19.00.23026 /O2 on rextester（main()找到exe和dumps the compiler's asm output to stdout的目录。

## My added comments use. ##
; ... define some symbolic constants for stack offsets of parameters
; 48   : int ABI div64(int hi, int lo, int divisor, int *premainder) {
    sub esp, 16                 ; 00000010H
    mov DWORD PTR _lo$[esp+16], edx      ## these symbolic constants match up with the names of the stack args and locals
    mov DWORD PTR _hi$[esp+16], ecx

    ## start of __asm {
    mov edx, DWORD PTR _hi$[esp+16]
    mov eax, DWORD PTR _lo$[esp+16]
    idiv    DWORD PTR _divisor$[esp+12]
    mov DWORD PTR _quotient$[esp+16], eax  ## store to a local temporary, not *premainder
    mov DWORD PTR _tmp$[esp+16], edx
    ## end of __asm block

    mov ecx, DWORD PTR _premainder$[esp+12]
    mov eax, DWORD PTR _tmp$[esp+16]
    mov DWORD PTR [ecx], eax               ## I guess we should have done this inside the inline asm so this would suck slightly less
    mov eax, DWORD PTR _quotient$[esp+16]  ## but this one is unavoidable
    add esp, 16                 ; 00000010H
    ret 8

有大量额外的mov指令，编译器甚至没有接近优化它们。我想也许它会看到并理解内联asm中的mov tmp, edx，并将其存储到premainder。但是，我想这需要在内联asm块之前将堆栈中的premainder加载到寄存器中。

这个函数实际上更糟 _vectorcall，而不是正常的堆栈ABI。寄存器中有两个输入，它将它们存储到内存中，因此内联asm可以从命名变量加载它们。如果这是内联的，那么更多的参数可能会出现在regs中，而且必须将它们全部存储起来，所以asm会有内存操作数！因此，与gcc不同，我们从内联中获得的收益并不大。

在asm块中执行*premainder = tmp意味着用asm编写更多的代码，但确实避免了余数的完全脑都存储/加载/存储路径。这将指令数量减少了2个，减少到11个（不包括ret）。

我试图从MSVC中获取最佳代码，而不是＆＃34;使用它错误＆＃34;并创造一个稻草人的论点。但AFAICT包装非常短的序列非常可怕。 据推测，这是64/32的内在功能 - ＆gt; 32分区允许编译器为这个特定情况生成良好的代码，因此在MSVC上使用内联asm的整个前提可能是一个简单的参数。但它确实向您展示内在函数比MSVC的内联asm更好。

GNU C（gcc / clang / icc）

在内联div64时，Gcc甚至比这里显示的输出更好，因为它通常可以安排前面的代码在edx：eax中生成64位整数。

我无法获得gcc来编译32位vectorcall ABI。 Clang可以，但它在"rm"约束的内联asm中很糟糕（在godbolt链接上尝试它：它通过内存反弹函数arg而不是在约束中使用register选项）。 64位MS调用约定接近32位向量调用，前两个参数在edx，ecx中。不同之处在于，在使用堆栈之前，有两个参数进入regs（并且被调用者不会将堆栈弹出堆栈，这就是ret 8在MSVC输出中的含义。）

// GNU C
// change everything to int64_t to do 128b/64b -> 64b division
// MSVC doesn't do x86-64 inline asm, so we'll use 32bit to be comparable
int div64(int lo, int hi, int *premainder, int divisor) {
    int quotient, rem;
    asm ("idivl  %[divsrc]"
          : "=a" (quotient), "=d" (rem)    // a means eax,  d means edx
          : "d" (hi), "a" (lo),
            [divsrc] "rm" (divisor)        // Could have just used %0 instead of naming divsrc
            // note the "rm" to allow the src to be in a register or not, whatever gcc chooses.
            // "rmi" would also allow an immediate, but unlike adc, idiv doesn't have an immediate form
          : // no clobbers
        );
    *premainder = rem;
    return quotient;
}

compiled with gcc -m64 -O3 -mabi=ms -fverbose-asm。使用-m32，你可以获得3个加载，idiv和商店，正如你可以看到更改godbolt链接中的内容。

mov     eax, ecx  # lo, lo
idivl  r9d      # divisor
mov     DWORD PTR [r8], edx       # *premainder_7(D), rem
ret

对于32位矢量调用，gcc会执行类似

的操作

## Not real compiler output, but probably similar to what you'd get
mov     eax, ecx               # lo, lo
mov     ecx, [esp+12]          # premainder
idivl   [esp+16]               # divisor
mov     DWORD PTR [ecx], edx   # *premainder_7(D), rem
ret   8

与gcc的4相比，MSVC使用了13条指令（不包括ret）。内联，正如我所说，它可能只编译为一条，而MSVC仍然可能使用9条。（它赢得了＃39 ; t需要保留堆栈空间或加载premainder;我假设它仍然需要存储3个输入中的大约2个。然后它在asm中重新加载它们，运行idiv，存储两个输出，并将它们重新加载到asm之外。因此，4个加载/存储用于输入，另外4个用于输出。）

Answer 2

使用gcc编译器，它没有太大的区别。 asm或__asm或__asm__相同，它们只是用来避免冲突名称空间的目的（有用户定义的函数，名称为asm等）。

Answer 3

GCC中

asm vs __asm__

asm不适用于-std=c99，您有两种选择：

使用__asm__
使用-std=gnu99

更多详情：error: ‘asm’ undeclared (first use in this function)

GCC中

__asm vs __asm__

我找不到记录__asm的位置（特别是在https://gcc.gnu.org/onlinedocs/gcc-7.2.0/gcc/Alternate-Keywords.html#Alternate-Keywords没有提到），但是从GCC 8.1 source它们完全相同：

  { "__asm",        RID_ASM,    0 },
  { "__asm__",      RID_ASM,    0 },

所以我只想使用记录的__asm__。

4 个答案:

示例：全宽整数除法（div）

MSVC

GNU C（gcc / clang / icc）

示例：全宽整数除法（`div`）