Question

以下是在x86 / x86_64中实现顺序一致性的四种方法：

LOAD（没有围栏）和STORE + MFENCE
LOAD（没有围栏）和LOCK XCHG
MFENCE + LOAD AND STORE（没有围栏）
LOCK XADD（0）和STORE（没有围栏）

正如这里所写：http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

C / C ++ 11操作x86实现


加载Seq_Cst：MOV（来自内存）

Store Seq Cst：（LOCK）XCHG //   替代方案：MOV（进入记忆），MFENCE

注意：有一个C / C ++ 11到x86的替代映射，而不是锁定（或隔离）Seq Cst存储锁/隔离Seq Cst加载：

加载Seq_Cst：LOCK XADD（0）//替代：MFENCE，MOV（来自内存）

存储Seq Cst：MOV（进入内存）

GCC 4.8.2（x86_64中的GDB）对C++11-std::memory_order_seq_cst使用第一（1）种方法，即LOAD（不带栅栏）和STORE + MFENCE：

std::atomic<int> a;
int temp = 0;
a.store(temp, std::memory_order_seq_cst);
0x4613e8  <+0x0058>         mov    0x38(%rsp),%eax
0x4613ec  <+0x005c>         mov    %eax,0x20(%rsp)
0x4613f0  <+0x0060>         mfence

众所周知，MFENCE = LFENCE + SFENCE。然后我们可以将此代码改写为：LOAD(without fence) and STORE+LFENCE+SFENCE

问题：

为什么我们不需要在LOAD之前使用LFENCE，并且需要在STORE之后使用LFENCE（因为LFENCE仅在LOAD之前才有意义！）？
为什么GCC不使用方法：LOAD（没有围栏）和STORE + SFENCE for std :: memory_order_seq_cst？

Answer 1

唯一的重新排序x86（对于正常的内存访问）是它可能重新排序商店后面的负载。

SFENCE保证围栏之前的所有商店在围栏之后的所有商店之前完成。 LFENCE保证围栏之前的所有荷载在围栏之后的所有荷载之前完成。对于正常的内存访问，默认情况下已提供单个SFENCE或LFENCE操作的排序保证。基本上，LFENCE和SFENCE本身仅适用于x86较弱的内存访问模式。

LFENCE，SFENCE和LFENCE + SFENCE都不会阻止重载相关的存储。 MFENCE确实。

相关参考资料是Intel x86架构手册。

Answer 2

std::atomic<int>::store映射到编译器内部__atomic_store_n。（此处和其他原子操作内在函数在此处记录：Built-in functions for memory model aware atomic operations。）_n后缀使其类型为泛型;后端实际上实现了特定大小的变体，以字节为单位。 x86上的int是AFAIK总是32位长，这意味着我们正在寻找__atomic_store_4的定义。 The internals manual for this version of GCC表示__atomic_store操作对应于名为atomic_store‌mode的机器描述模式;对应于4字节整数的模式是“SI”（that's documented here），因此我们在x86机器描述中寻找称为“atomic_storesi”的东西。这将我们带到config/i386/sync.md，特别是这一点：

(define_expand "atomic_store<mode>"
  [(set (match_operand:ATOMIC 0 "memory_operand")
        (unspec:ATOMIC [(match_operand:ATOMIC 1 "register_operand")
                        (match_operand:SI 2 "const_int_operand")]
                       UNSPEC_MOVA))]
  ""
{
  enum memmodel model = (enum memmodel) (INTVAL (operands[2]) & MEMMODEL_MASK);

  if (<MODE>mode == DImode && !TARGET_64BIT)
    {
      /* For DImode on 32-bit, we can use the FPU to perform the store.  */
      /* Note that while we could perform a cmpxchg8b loop, that turns
         out to be significantly larger than this plus a barrier.  */
      emit_insn (gen_atomic_storedi_fpu
                 (operands[0], operands[1],
                  assign_386_stack_local (DImode, SLOT_TEMP)));
    }
  else
    {
      /* For seq-cst stores, when we lack MFENCE, use XCHG.  */
      if (model == MEMMODEL_SEQ_CST && !(TARGET_64BIT || TARGET_SSE2))
        {
          emit_insn (gen_atomic_exchange<mode> (gen_reg_rtx (<MODE>mode),
                                                operands[0], operands[1],
                                                operands[2]));
          DONE;
        }

      /* Otherwise use a store.  */
      emit_insn (gen_atomic_store<mode>_1 (operands[0], operands[1],
                                           operands[2]));
    }
  /* ... followed by an MFENCE, if required.  */
  if (model == MEMMODEL_SEQ_CST)
    emit_insn (gen_mem_thread_fence (operands[2]));
  DONE;
})

没有详细介绍，大部分内容都是一个C函数体，它将被调用以生成原子存储操作的低级“RTL”中间表示。如果您的示例代码调用了<MODE>mode != DImode，model == MEMMODEL_SEQ_CST和TARGET_SSE2，则会调用gen_atomic_store<mode>_1然后调用gen_mem_thread_fence。后一个函数总是生成mfence。（此文件中有代码可生成sfence，但我相信它仅用于明确编码_mm_sfence（来自<xmmintrin.h>）。）

评论表明某人认为在这种情况下需要MFENCE。我得出结论要么你错误地认为不需要加载围栏，或这是GCC中错过的优化错误。例如， not 是您使用编译器的错误。

Answer 3

请考虑以下代码：

#include <atomic>
#include <cstring>

std::atomic<int> a;
char b[64];

void seq() {
  /*
    movl    $0, a(%rip)
    mfence
  */
  int temp = 0;
  a.store(temp, std::memory_order_seq_cst);
}

void rel() {
  /*
    movl    $0, a(%rip)
   */
  int temp = 0;
  a.store(temp, std::memory_order_relaxed);
}

关于原子变量“a”，seq（）和rel（）在x86架构上都是有序的和原子的，因为：

mov是一个原子指令
mov是一种传统指令，英特尔承诺为传统指令提供有序内存语义，以便与始终使用有序内存语义的旧处理器兼容。

不需要栅栏来将常量值存储到原子变量中。因为std :: memory_order_seq_cst暗示所有内存都是同步的，而不仅仅是保存原子变量的内存。

效果可以通过以下set和get函数来证明：

void set(const char *s) {
  strcpy(b, s);
  int temp = 0;
  a.store(temp, std::memory_order_seq_cst);
}

const char *get() {
  int temp = 0;
  a.store(temp, std::memory_order_seq_cst);
  return b;
}

strcpy是一个库函数，如果在运行时可用，则可以使用较新的sse指令。由于旧处理器中没有sse指令，因此不需要向后兼容性，并且未定义内存顺序。因此，一个线程中strcpy的结果可能在其他线程中不能直接显示。

上面的set和get函数使用原子值来强制执行内存同步，以便strcpy的结果在其他线程中可见。现在围栏很重要，但是在atomic :: store调用中它们的顺序并不重要，因为atomic :: store内部不需要围栅。

Answer 4

SFENCE + LFENCE 不 StoreLoad屏障（MFENCE），因此问题的前提是不正确的。（另请参阅我对来自同一用户Why is (or isn't?) SFENCE + LFENCE equivalent to MFENCE?的同一问题的另一个版本的答案）

SFENCE可以通过（出现在之前的）早期加载。（它只是一个StoreStore屏障）。
LFENCE可以通过更早的商店。（负载不能在任何一个方向上交叉：LoadLoad屏障）。
负载可以通过SFENCE（但商店无法通过LFENCE，因此它是一个LoadStore屏障以及一个LoadLoad屏障）。

LFENCE + SFENCE不包含任何阻止商店缓存直到稍后加载的内容。 MFENCE 确实阻止了这一点。

Preshing's blog post更详细地解释了图表，StoreLoad屏障是如何特殊的，并且有一个实际的工作代码示例，演示了没有MFENCE的重新排序。任何对内存排序感到困惑的人都应该从那个博客开始。

x86有一个strong memory model，其中每个普通商店都有发布语义，每个正常加载都有获取语义。 This post has the details

LFENCE和SFENCE only exist for use with movnt loads/stores，它们被弱排序并绕过缓存。

如果这些链接已经死亡，我的answer on another similar question中会有更多信息。

为什么GCC不使用LOAD（没有fence）和STORE + SFENCE来实现顺序一致性？

4 个答案: