Java无锁性能JMH

时间:2015-10-11 13:04:44

标签: java multithreading nonblocking memory-barriers jmh

我有一个JMH多线程测试:

@State(Scope.Benchmark)
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Fork(value = 1, jvmArgsAppend = { "-Xmx512m", "-server", "-XX:+AggressiveOpts","-XX:+UnlockDiagnosticVMOptions",
        "-XX:+UnlockExperimentalVMOptions", "-XX:+PrintAssembly", "-XX:PrintAssemblyOptions=intel",
        "-XX:+PrintSignatureHandlers"})
@Measurement(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@Warmup(iterations = 3, time = 2, timeUnit = TimeUnit.SECONDS)
public class LinkedQueueBenchmark {
private static final Unsafe unsafe = UnsafeProvider.getUnsafe();
private static final long offsetObject;
private static final long offsetNext;

private static final int THREADS = 5;
private static class Node {
    private volatile Node next;
    public Node() {}
}

static {
    try {
        offsetObject = unsafe.objectFieldOffset(LinkedQueueBenchmark.class.getDeclaredField("object"));
        offsetNext = unsafe.objectFieldOffset(Node.class.getDeclaredField("next"));
    } catch (Exception ex) { throw new Error(ex); }
}

protected long t0,t1,t2,t3,t4,t5,t6,t7;
private volatile Node object = new Node(null);


@Threads(THREADS)
@Benchmark
public Node doTestCasSmart() {
    Node current, o = new Node();
    for(;;) {
        current = this.object;
        if (unsafe.compareAndSwapObject(this, offsetObject, current, o)) {
            //current.next = o; //Special line:
            break;
        } else {
            LockSupport.parkNanos(1);
        }
    }
    return current;
}
}
  1. 在目前的变种中,我有表现~55 ops / us
  2. 但是,如果我取消注释"特殊行",或者将其替换为unsafe.putOrderedObject(在任何方向 - current.next = o o.next =当前),性能~2 ops / us。
  3. 据我所知,这是CPU缓存发生的事情,也许它正在清理存储缓冲区。如果我把它替换为基于锁定的方法,没有CAS,性能将是11-20 ops / us 我尝试使用LinuxPerfAsmProfiler和PrintAssembly,在第二种情况下我看到:

    ....[Hottest Regions]...............................................................................
     25.92%   17.93%  [0x7f1d5105fe60:0x7f1d5105fe69] in SpinPause (libjvm.so)
     17.53%   20.62%  [0x7f1d5119dd88:0x7f1d5119de57] in ParMarkBitMap::live_words_in_range(HeapWord*, oopDesc*) const (libjvm.so)
     10.81%    6.30%  [0x7f1d5129cff5:0x7f1d5129d0ed] in ParallelTaskTerminator::offer_termination(TerminatorTerminator*) (libjvm.so)
      7.99%    9.86%  [0x7f1d3c51d280:0x7f1d3c51d3a2] in com.jad.generated.LinkedQueueBenchmark_doTestCasSmart::doTestCasSmart_thrpt_jmhStub 
    

    有人可以向我解释一下究竟发生了什么吗?它为什么这么慢?这里存储装载障碍?为什么putOrdered不起作用?以及如何解决它?

1 个答案:

答案 0 :(得分:9)

规则:而不是寻找"高级"答案,你应该首先寻找愚蠢的错误。

require './browser.rb' describe Browser do before do @browser = Browser.new end describe "#is_internet_accessible?" do context "internet down" do it "returns false" do expect(@browser.is_internet_accessible?).to be(false) end end end end SpinPauseParMarkBitMap::live_words_in_range(HeapWord*, oopDesc*)来自GC线程。这很可能意味着大多数工作基准测试都是GC。确实,运行"特殊线"取消注释ParallelTaskTerminator::offer_termination(TerminatorTerminator*)产量:

-prof gc

因此,在43秒的跑步中,你花了30秒做GC。或者,即使普通# Run complete. Total time: 00:00:43 Benchmark Mode Cnt Score Error Units LQB.doTestCasSmart thrpt 5 5.930 ± 3.867 ops/us LQB.doTestCasSmart:·gc.time thrpt 5 29970.000 ms 也会显示它:

-verbose:gc

2.8s完整的GC,这很糟糕。大约5s花在GC上,在一个由5s运行时间限制的迭代中。那太糟糕了。

为什么?好吧,你正在那里建立链表。当然,队列的头部是无法到达的,应该收集从头部到Iteration 3: [Full GC (Ergonomics) 408188K->1542K(454656K), 0.0043022 secs] [GC (Allocation Failure) 60422K->60174K(454656K), 0.2061024 secs] [GC (Allocation Failure) 119054K->118830K(454656K), 0.2314572 secs] [GC (Allocation Failure) 177710K->177430K(454656K), 0.2268396 secs] [GC (Allocation Failure) 236310K->236054K(454656K), 0.1718049 secs] [GC (Allocation Failure) 294934K->294566K(454656K), 0.2265855 secs] [Full GC (Ergonomics) 294566K->147408K(466432K), 0.7139546 secs] [GC (Allocation Failure) 206288K->205880K(466432K), 0.2065388 secs] [GC (Allocation Failure) 264760K->264312K(466432K), 0.2314117 secs] [GC (Allocation Failure) 323192K->323016K(466432K), 0.2183271 secs] [Full GC (Ergonomics) 323016K->322663K(466432K), 2.8058725 secs] 的所有内容。但收藏不是即时的。队列越长,消耗的内存越多,GC遍历它的工作量就越大。这是一个积极的反馈循环,削弱了执行力。由于队列元素无论如何都是可收集的,因此这个反馈循环永远不会到达OOME。在新的object字段中存储初始object将最终对OOME进行测试。

因此,您的问题与head或内存障碍或队列性能无关。我想你需要重新考虑你实际测试的是什么。设计测试以使每次putOrdered呼叫的瞬态内存占用保持不变是一门艺术。