Question

最近我有一件很奇怪的事情-一种方法在分析器下非常慢，没有明显的原因。它只包含long的少量操作，但被频繁调用-它的总体使用量约为程序总时间的30-40％，而其他部分似乎“重”得多。

我通常在x32 JVM上运行非内存密集型程序，但是假设我在64位类型上遇到问题，我尝试在x64 JVM上运行相同的程序-“实时场景”中的总体性能提高了2-3倍。之后，我为特定方法的操作创建了JMH基准测试，并为x32和x64 JVM的差异（多达50倍）感到震惊。

我会“接受”大约慢2倍的x32 JVM（较小的字长），但是我不知道30-50倍可能来自何处。 您能解释一下这种巨大差异吗？

回复评论：

我重写了测试代码以“返回某些内容”并避免“消除死代码”-看来对于“ x32”它并没有改变任何东西，但是“ x64”上的某些方法却明显变慢了。
两个测试均在“客户端”下运行。在“-服务器”下运行没有明显效果。

所以看来我的问题的答案是

“测试代码”是错误的：由于“没有返回值”，它允许JVM进行“死代码消除”或任何其他优化，并且看起来“ x32 JVM”比“ x64 JVM”执行的优化更少-在x32和x64之间造成了如此重大的“假”差异
“正确测试代码”的性能差异高达2到5倍-这似乎是合理的

以下是结果（注意：? 10??是Windows上未打印的特殊字符-以科学计数法表示为0.001 s / op以下的东西10e-??）

x32 1.8.0_152

Benchmark                Mode  Score Units    Score (with 'return')
IntVsLong.cycleInt       avgt  0.035  s/op    0.034   (?x slower vs. x64)
IntVsLong.cycleLong      avgt  0.106  s/op    0.099   (3x slower vs. x64) 
IntVsLong.divDoubleInt   avgt  0.462  s/op    0.459
IntVsLong.divDoubleLong  avgt  1.658  s/op    1.724   (2x slower vs. x64)
IntVsLong.divInt         avgt  0.335  s/op    0.373
IntVsLong.divLong        avgt  1.380  s/op    1.399
IntVsLong.l2i            avgt  0.101  s/op    0.197   (3x slower vs. x64)  
IntVsLong.mulInt         avgt  0.067  s/op    0.068
IntVsLong.mulLong        avgt  0.278  s/op    0.337   (5x slower vs. x64)
IntVsLong.subInt         avgt  0.067  s/op    0.067   (?x slower vs. x64)
IntVsLong.subLong        avgt  0.243  s/op    0.300   (4x slower vs. x64)

x64 1.8.0_152

Benchmark                Mode  Score Units    Score (with 'return')
IntVsLong.cycleInt       avgt ? 10??  s/op   ? 10??
IntVsLong.cycleLong      avgt  0.035  s/op    0.034
IntVsLong.divDoubleInt   avgt  0.045  s/op    0.788 (was dead)
IntVsLong.divDoubleLong  avgt  0.033  s/op    0.787 (was dead)
IntVsLong.divInt         avgt ? 10??  s/op    0.302 (was dead)
IntVsLong.divLong        avgt  0.046  s/op    1.098 (was dead)
IntVsLong.l2i            avgt  0.037  s/op    0.067
IntVsLong.mulInt         avgt ? 10??  s/op    0.052 (was dead)
IntVsLong.mulLong        avgt  0.040  s/op    0.067
IntVsLong.subInt         avgt ? 10??  s/op   ? 10??
IntVsLong.subLong        avgt  0.075  s/op    0.082

这是（固定的）基准代码

import org.openjdk.jmh.annotations.Benchmark;

public class IntVsLong {

    public static int N_REPEAT_I  = 100_000_000;
    public static long N_REPEAT_L = 100_000_000;

    public static int CONST_I = 3;
    public static long CONST_L = 3;
    public static double CONST_D = 3;

    @Benchmark
    public void cycleInt() throws InterruptedException {
        for( int i = 0; i < N_REPEAT_I; i++ ) {
        }
    }

    @Benchmark
    public void cycleLong() throws InterruptedException {
        for( long i = 0; i < N_REPEAT_L; i++ ) {
        }
    }

    @Benchmark
    public int divInt() throws InterruptedException {
        int r = 0;
        for( int i = 0; i < N_REPEAT_I; i++ ) {
            r += i / CONST_I;
        }
        return r;
    }

    @Benchmark
    public long divLong() throws InterruptedException {
        long r = 0;
        for( long i = 0; i < N_REPEAT_L; i++ ) {
            r += i / CONST_L;
        }
        return r;
    }

    @Benchmark
    public double divDoubleInt() throws InterruptedException {
        double r = 0;
        for( int i = 1; i < N_REPEAT_L; i++ ) {
            r += CONST_D / i;
        }
        return r;
    }

    @Benchmark
    public double divDoubleLong() throws InterruptedException {
        double r = 0;
        for( long i = 1; i < N_REPEAT_L; i++ ) {
            r += CONST_D / i;
        }
        return r;
    }

    @Benchmark
    public int mulInt() throws InterruptedException {
        int r = 0;
        for( int i = 0; i < N_REPEAT_I; i++ ) {
            r += i * CONST_I;
        }
        return r;
    }

    @Benchmark
    public long mulLong() throws InterruptedException {
        long r = 0;
        for( long i = 0; i < N_REPEAT_L; i++ ) {
            r += i * CONST_L;
        }
        return r;
    }

    @Benchmark
    public int subInt() throws InterruptedException {
        int r = 0;
        for( int i = 0; i < N_REPEAT_I; i++ ) {
            r += i - r;
        }
        return r;
    }

    @Benchmark
    public long subLong() throws InterruptedException {
        long r = 0;
        for( long i = 0; i < N_REPEAT_L; i++ ) {
            r += i - r;
        }
        return r;
    }

    @Benchmark
    public long l2i() throws InterruptedException {
        int r = 0;
        for( long i = 0; i < N_REPEAT_L; i++ ) {
            r += (int)i;
        }
        return r;
    }

}

Answer 1

有很多变量需要检查。

如果我们仅看一下使用64位的处理器，则可以在同一步骤中对CPU寄存器进行更多操作，因为它使用每个字节而不是每个注册表四个字节。这样可以提高操作性能和内存分配。另外，某些CPU仅启用仅在64位模式下运行的高级功能

如果您使用相同的CPU进行测试，则需要上升，您需要考虑到要执行32位指令，CPU需要在虚拟模式或受保护模式下运行，而虚拟模式或受保护模式的运行速度要比真正的32位CPU慢。另外，某些指令集扩展可能无法使用32位模式启用，例如SSE-SIMD或AVX taht可能会提高某些操作速度。

如果您正在使用像Windows 10这样的现代操作系统，也需要考虑到该操作系统使用WOW64（x86模拟器）运行32位应用程序的情况

帮助文档：

Running 32 bit Applications on Windows 64 Bit
Wikipedia X86-64（请参阅操作模式）

整数性能-x32与x64 jvm相差30-50倍？

1 个答案: