Question

为什么GCC和Clang会为此代码（x86_64，-O3 -std = c ++ 17）生成如此不同的asm？

#include <atomic>

int global_var = 0;

int foo_seq_cst(int a)
{
    std::atomic<int> ia;
    ia.store(global_var + a, std::memory_order_seq_cst);
    return ia.load(std::memory_order_seq_cst);
}

int foo_relaxed(int a)
{
    std::atomic<int> ia;
    ia.store(global_var + a, std::memory_order_relaxed);
    return ia.load(std::memory_order_relaxed);
}

GCC 9.1：

foo_seq_cst(int):
        add     edi, DWORD PTR global_var[rip]
        mov     DWORD PTR [rsp-4], edi
        mfence
        mov     eax, DWORD PTR [rsp-4]
        ret
foo_relaxed(int):
        add     edi, DWORD PTR global_var[rip]
        mov     DWORD PTR [rsp-4], edi
        mov     eax, DWORD PTR [rsp-4]
        ret

Clang 8.0：

foo_seq_cst(int):                       # @foo_seq_cst(int)
        mov     eax, edi
        add     eax, dword ptr [rip + global_var]
        ret
foo_relaxed(int):                       # @foo_relaxed(int)
        mov     eax, edi
        add     eax, dword ptr [rip + global_var]
        ret

我怀疑这里的mfence是一种矫kill过正，对吗？还是Clang生成的代码在某些情况下可能导致错误？

Answer 1

更现实的example：

#include <atomic>

std::atomic<int> a;

void foo_seq_cst(int b) {
    a = b;
}

void foo_relaxed(int b) {
    a.store(b, std::memory_order_relaxed);
}

gcc-9.1：

foo_seq_cst(int):
        mov     DWORD PTR a[rip], edi
        mfence
        ret
foo_relaxed(int):
        mov     DWORD PTR a[rip], edi
        ret

clang-8.0：

foo_seq_cst(int):                       # @foo_seq_cst(int)
        xchg    dword ptr [rip + a], edi
        ret
foo_relaxed(int):                       # @foo_relaxed(int)
        mov     dword ptr [rip + a], edi
        ret

gcc使用mfence，而clang使用xchg表示std::memory_order_seq_cst。

xchg暗示lock前缀。 lock和mfence都满足std::memory_order_seq_cst的要求，即没有重新排序和总订单。

摘自《 Intel 64和IA-32架构软件开发人员手册》：

MFENCE-内存围栏

对先前发布的所有从内存加载和存储到内存指令执行序列化操作   MFENCE指令。此序列化操作可确保前面的每个加载和存储指令   程序顺序中的MFENCE指令在随后的任何加载或存储指令之前变为全局可见   MFENCE指令。 MFENCE指令针对所有加载和存储指令进行排序，其他   MFENCE指令，任何LFENCE和SFENCE指令以及任何序列化指令（例如CPUID）   指令）。 MFENCE不会序列化指令流。

8.2.3.8 锁定说明具有总订单

内存排序模型可确保所有处理器就所有锁定指令（包括大于8个字节或不自然对齐的指令）的单个执行顺序达成一致。

8.2.3.9   加载和存储未按锁定说明重新排序

内存排序模型可防止通过执行锁定的指令对加载和存储进行重新排序   早晚。

lock was benchmarked to be 2-3x faster than mfence和Linux在可能的情况下从mfence切换到lock。

为什么GCC在不使用Clang的地方插入mfence？

1 个答案: