Question

我正在运行一个实验来确定包装方法的性能开销。我已经读过JIT编译器和/或JVM优化小方法，但我似乎始终会产生3-5％的性能损失。

代码如下：

import java.util.* ;

public class WrappingTest1{
    private WrappingTest1(){
        // Empty.
    }

    private static void findPrimes(
        final Long maxValue ,
        final List< Long > foundPrimes
    ){
        if(
            maxValue > 2L
        ){
            Boolean isPrime ;
            foundPrimes.clear() ;

            for(
                Long i = 2L ;
                i <= maxValue ;
                i += 1L
            ){
                isPrime = true ;
                for(
                    Long j = 2L ;
                    j < i ;
                    j += 1L
                ){
                    if(
                        ( i % j ) == 0
                    ){
                        isPrime = false ;
                    }
                }
                if(
                    isPrime
                ){
                    foundPrimes.add(
                        i
                    ) ;
                }
            }
        }
    }

    private static void wrapper(
        final Long input ,
        final List< Long > output
    ){
        findPrimes(
            input ,
            output
        ) ;
    }

    public static void main(
        final String[] args
    ){
        ArrayList< Long > primes ;
        Long startTime ;
        Long endTime ;
        Double duration ;
        Double meanDuration ;
        Long primeRange ;
        Long warmupIterations ;
        Long benchmarkIterations ;

        primes = new ArrayList<>() ;
        meanDuration = 0.0 ;
        primeRange = 100L ;
        warmupIterations = 20000L ;
        benchmarkIterations = 100000L ;

        System.out.println(
            "Experiment started."
        ) ;

        // Unwrapped warmup.
        for(
            Long i = 0L ;
            i < warmupIterations ;
            i += 1L
        ){
            findPrimes(
                primeRange ,
                primes
            ) ;
        }

        // Unwrapped benchmark.
        startTime = System.nanoTime() ;
        for(
            Long i = 0L ;
            i < benchmarkIterations ;
            i += 1L
        ){
            findPrimes(
                primeRange ,
                primes
            ) ;
        }
        endTime = System.nanoTime() ;
        duration = ( endTime.doubleValue() - startTime.doubleValue() ) / 1E9 ;
        System.out.println(
            "Unwrapped runtime: " + duration + " seconds."
        ) ;

        // Wrapped warmup.
        for(
            Long i = 0L ;
            i < warmupIterations ;
            i += 1L
        ){
            wrapper(
                primeRange ,
                primes
            ) ;
        }

        // Wrapped benchmark.
        startTime = System.nanoTime() ;
        for(
            Long i = 0L ;
            i < benchmarkIterations ;
            i += 1L
        ){
            wrapper(
                primeRange ,
                primes
            ) ;
        }
        endTime = System.nanoTime() ;
        duration = ( endTime.doubleValue() - startTime.doubleValue() ) / 1E9 ;
        System.out.println(
            "Wrapped runtime: " + duration + " seconds."
        ) ;

        System.out.println(
            "Experiment completed."
        ) ;
    }
}

结果如下：

Experiment started.
Unwrapped runtime: 4.851473465 seconds.
Wrapped runtime: 5.078349508 seconds.
Experiment completed.

为什么会这样？如何让JVM内联包装的方法，或以其他方式优化它以便忽略包装器？

感谢。

Answer 1

您看到的结果仅仅是错误基准测试的结果。

我接受了你的代码并添加了一个简单的for循环来重复开始和结束消息之间的所有内容10次。这些是我得到的结果：

$ java WrappingTest1 
Experiment started.
Unwrapped runtime: 5.953860592 seconds.
Wrapped runtime: 6.71209059 seconds.
Unwrapped runtime: 6.644746614 seconds.
Wrapped runtime: 7.233488553 seconds.
Unwrapped runtime: 7.241345634 seconds.
Wrapped runtime: 7.001538713 seconds.
Unwrapped runtime: 7.168569403 seconds.
Wrapped runtime: 7.067152677 seconds.
Unwrapped runtime: 7.103292898 seconds.
Wrapped runtime: 7.106762664 seconds.
Unwrapped runtime: 7.128529706 seconds.
Wrapped runtime: 7.0960061 seconds.
Unwrapped runtime: 7.197759685 seconds.
Wrapped runtime: 7.185561511 seconds.
Unwrapped runtime: 7.14927243 seconds.
Wrapped runtime: 7.163805608 seconds.
Unwrapped runtime: 7.151459682 seconds.
Wrapped runtime: 7.156398072 seconds.
Unwrapped runtime: 7.112928442 seconds.
Wrapped runtime: 7.245795652 seconds.
Experiment completed.

正如您所看到的，在3或4次迭代之后，时间稳定到包装和展开版本基本上占用相同时间的点。

（至少在我眼里。确定，你需要多次重复并对时间进行统计分析。0.1秒范围内的定时抖动可能是由于GC定时，或者是不相关的事情，如处理网络数据包，鼠标抖动，浏览器噪音等。）

看一下代码和时序模式，我认为早期不稳定的可能原因是堆热。在堆大小和GC行为达到大致稳定状态之前需要进行几次迭代。

因此，不支持您关于不内联包装的结论。

但如果你真的想确定，你可以告诉JIT编译器转储本机代码并查看内联证据的说明。

Answer 2

乐观，选择final Long maxValue进行迭代，然后使用100作为maxValue。

如果使用整数替换循环中的Longs，则可能会获得10倍的加速。

第二和第三，急剧改进，循环到Math.sqrt（i），并且只有，如果它还没有被证明不是素数：

 for (int j = 2; j <= Math.sqrt(i) && isPrime; ++j)

我知道，你并不是一个有效的主要寻找算法，但它完全是关于包装器的微基准标记，但是要学习的基本教训是，这种假设大多数时候不是应用程序的瓶颈。

除此之外，你应该尝试相反的方向，首先包裹，然后打开包装。要简化这些更改，您应该分解循环和时间。

import java.util.* ;

public class WrappingTest1
{
    PrimeFinder[] pfs = new PrimeFinder[2];
    int primeRange = 1000;

    private WrappingTest1 ()
    {
        // pfs[1] = new UnwrappedFinder ();
        // pfs[0] = new WrappedFinder (pfs[1]);
        pfs[0] = new UnwrappedFinder ();
        pfs[1] = new WrappedFinder (pfs[0]);
    }

    void test ()
    {
        for (PrimeFinder pf: pfs)
            runblock (pf);
    }

    void loopy (int iterations, PrimeFinder pf, ArrayList <Integer> primes)
    {
        for (int i = 0; i < iterations; ++i)
            pf.findPrimes (primeRange, primes);
    }

    void runblock (PrimeFinder pf)
    {
        int warmupIterations = 20000;
        int benchmarkIterations = 100000;
        ArrayList <Integer> primes = new ArrayList<Integer> (50000) ;

        // warmup.
        loopy (warmupIterations, pf, primes);
        // enchmark.
        Long startTime = System.nanoTime();
        loopy (benchmarkIterations, pf, primes);
        Long endTime = System.nanoTime() ;

        Double duration = (endTime.doubleValue () - startTime.doubleValue ()) / 1E9 ;
        System.out.printf ("%s runtime: %4.2f seconds.\n", pf.name(), duration);
        // had to make sure, that we're really producing valid primes:
        // and that they survive the code changes.
        for (int p: primes) {
            System.out.printf ("%d ", p);
        }
        System.out.println ("bye");
    }

    abstract class PrimeFinder {
        abstract void findPrimes (final int maxValue, final List <Integer> foundPrimes);
        abstract String name ();
    }

    class UnwrappedFinder extends PrimeFinder {
        String name () {return "Unwrapped";}
        void findPrimes (final int maxValue, final List <Integer> foundPrimes)
        {
            if (maxValue > 2)
            {
                foundPrimes.clear () ;
                for (int i = 2; i <= maxValue; ++i)
                {
                    Boolean isPrime = true;
                    for (int j = 2; j <= Math.sqrt(i) && isPrime; ++j)
                        if ((i % j) == 0)
                            isPrime = false;
                    if (isPrime)
                        foundPrimes.add (i);
                }
            }
        }
    }

    class WrappedFinder extends PrimeFinder {
        String name () {return "  Wrapped";}
        private PrimeFinder pf;
        public WrappedFinder (PrimeFinder ppf)
        {
            pf = ppf;
        }
        void findPrimes (final int input, final List <Integer> output) {
            pf.findPrimes (input, output);
        }
    }

    public static void main (final String[] args)
    {
        System.out.println ("Experiment started.");
        WrappingTest1 wt1 = new WrappingTest1 ();
        wt1.test ();
        System.out.println ("Experiment completed.") ;
    }
}

使用PrimeRange = 100运行我的代码，但是1M迭代，我得到：

Unwrapped runtime: 2,34 seconds.
Unwrapped runtime: 2,53 seconds.
Unwrapped runtime: 2,50 seconds.
Unwrapped runtime: 2,50 seconds.
Unwrapped runtime: 2,49 seconds.
Unwrapped runtime: 2,52 seconds.
Unwrapped runtime: 2,59 seconds.
Unwrapped runtime: 2,60 seconds.
Unwrapped runtime: 2,58 seconds.
Unwrapped runtime: 2,52 seconds.

  Wrapped runtime: 2,36 seconds.
  Wrapped runtime: 2,36 seconds.
  Wrapped runtime: 2,36 seconds.
  Wrapped runtime: 2,36 seconds.
  Wrapped runtime: 2,37 seconds.
  Wrapped runtime: 2,37 seconds.
  Wrapped runtime: 2,36 seconds.
  Wrapped runtime: 2,41 seconds.
  Wrapped runtime: 2,37 seconds.
  Wrapped runtime: 2,37 seconds.

所以，令人惊讶的是包装版本更快。嗯。改变PrimeFinder [] pfs中的顺序，它们更紧密地结合在一起，包裹：2.46，展开2.52。

Answer 3

看起来减速的原因不是包装/解包方法调用的优化（或缺乏），而是缺乏主方法本身的优化。使用MyIbatis进行分析表明，根据默认设置（-XX:-PrintCompilation），在10K预热迭代后优化了包装/解包方法调用。但是，只有在大约70K的预热迭代之后，JVM分析报告才会编译main方法。因此，如果预热迭代次数小于70K，则展开的运行时间明显低于包装的运行时;但如果预热迭代次数为70K及以上，则两个运行时间相似。当然，这仅适用于指定的基准 - 个别程序可能会有不同的结果。

为什么小包装方法没有得到优化？

3 个答案: