理解记忆障碍

时间:2016-06-13 19:51:26

标签: java x86 volatile memory-barriers

我试图在对java无锁程序员有用的级别上理解内存障碍。我觉得这个级别介于学习挥发性和从x86手册学习存储/加载缓冲区的工作之间。

我花了一些时间阅读一堆博客/烹饪书,并提出了以下摘要。有更多知识渊博的人可以查看摘要,看看我是否错过了或错误地列出了某些内容。

LFENCE:

Name             : LFENCE/Load Barrier/Acquire Fence
Barriers         : LoadLoad + LoadStore
Details          : Given sequence {Load1, LFENCE, Load2, Store1}, the
                   barrier ensures that Load1 can't be moved south and
                   Load2 and Store1 can't be moved north of the
                   barrier. 
                   Note that Load2 and Store1 can still be reordered.

Buffer Effect    : Causes the contents of the LoadBuffer 
                   (pending loads) to be processed for that CPU.This
                   makes program state exposed from other CPUs visible
                   to this CPU before Load2 and Store1 are executed.

Cost on x86      : Either very cheap or a no-op.
Java instructions: Reading a volatile variable, Unsafe.loadFence()

SFENCE

Name             : SFENCE/Store Barrier/Release Fence
Barriers         : StoreStore + LoadStore
Details          : Given sequence {Load1, Store1, SFENCE, Store2,Load2}
                   the barrier ensures that Load1 and Store1 can't be
                   moved south and Store2 can't be moved north of the 
                   barrier.
                   Note that Load1 and Store1 can still be reordered AND 
                   Load2 can be moved north of the barrier.
Buffer Effect    : Causes the contents of the StoreBuffer flushed to 
                   cache for the CPU on which it is issued.
                   This will make program state visible to other CPUs
                   before Store2 and Load1 are executed.
Cost on x86      : Either very cheap or a no-op.
Java instructions: lazySet(), Unsafe.storeFence(), Unsafe.putOrdered*()

MFENCE

Name             : MFENCE/Full Barrier/Fence
Barriers         : StoreLoad
Details          : Obtains the effects of the other three barrier.
                   Given sequence {Load1, Store1, MFENCE, Store2,Load2}, 
                   the barrier ensures that Load1 and Store1 can't be
                   moved south and Store2 and Load2 can't be moved north
                   of the barrier.
                   Note that Load1 and Store1 can still be reordered AND
                   Store2 and Load2 can still be reordered.
 Buffer Effect   : Causes the contents of the LoadBuffer (pending loads) 
                   to be processed for that CPU.
                   AND
                   Causes the contents of the StoreBuffer flushed to
                   cache for the CPU on which it is issued.
 Cost on x86     : The most expensive kind.
Java instructions: Writing to a volatile, Unsafe.fullFence(), Locks

最后,如果SFENCE和MFENCE都消耗了storeBuffer(使cacheline无效并等待来自其他cpus的ack),为什么一个是no-op而另一个是非常昂贵的op?

由于

(来自谷歌机械同情论坛的交叉发布)

1 个答案:

答案 0 :(得分:6)

您正在使用Java,因此所有真正都很重要的是Java内存模型。编译时(包括JIT)optimizations will re-order your memory accesses在Java内存模型的限制内,而不是JVM恰好是JIT编译的更强大的x86内存模型。 (参见我对How does memory reordering help processors and compilers?的答案)

尽管如此,了解x86可以为您的理解提供一个具体的基础,但不要陷入认为x86上的Java像x86上的程序集一样工作的陷阱。 (或者整个世界都是x86。许多其他架构都是弱排序的,比如Java内存模型。)

除了使用LFENCE弱排序缓存旁路存储之外,

x86 SFENCEmovnt是无操作的内存排序。正常负载是隐式的acquire-loads, and normal stores are implicitly release-stores

根据英特尔指令集参考手册,您的表格中存在错误SFENCE is "not ordered with respect to load instructions"。它只是 StoreStore屏障,而不是LoadStore屏障。

(该链接是英特尔pdf的html转换。请参阅标签wiki以获取官方版本的链接。)

lfence是一个LoadLoad和LoadStore屏障,因此您的表格是正确的。

但CPU并没有提前“缓冲”负载。他们执行这些操作并在结果可用时立即开始使用无序执行的结果。 (通常在加载结果准备好之前,使用加载结果的指令已被解码并发出,即使在L1缓存命中时也是如此)。这是加载和存储之间的根本区别。

SFENCE很便宜,因为它实际上不必耗尽存储缓冲区。这是实现它的一种方法,它以性能为代价保持硬件简单。

MFENCE价格昂贵,因为它是阻止StoreLoad重新排序的唯一障碍。请参阅Jeff Preshing的Memory Reordering Caught in the Act获取解释,以及实际演示StoreLoad重新排序的测试程序硬件

Jeff Preshing的博客文章是理解lock-free programming和内存排序语义的金牌。我通常将我的SO博客链接到内存排序问题。如果你有兴趣阅读我写的更多内容(主要是C ++ / asm,而不是Java),你可以使用搜索来找到这些答案。

有趣的事实:x86上的任何原子读 - 修改 - 写操作也是一个完整的内存屏障。隐含在lock上的xchg [mem], reg前缀也是一个完整障碍。在lock add [esp], 0可用之前,mfence是内存障碍的常用习惯,否则就是无操作。 (堆栈内存在L1中几乎总是很热,而不是共享)。

因此,在x86上,无论您请求的内存排序语义如何,递增原子计数器都具有相同的性能。 (例如c ++ 11 memory_order_relaxed vs. memory_order_seq_cst(顺序一致性))。但是,使用任何适合的内存顺序语义,因为其他体系结构可以在没有完全内存障碍的情况下执行原子操作。当你不需要时强制编译器/ JVM使用内存屏障是一种浪费。