为什么添加DoubleStructs的速度比添加double的速度要长于long的速度慢得多?

时间:2019-03-14 10:04:56

标签: c# performance .net-core x86-64 benchmarking

假设

对于任何简单的操作,包含单个基元的readonly struct应该与基元本身一样快。

测试

下面的所有测试均在Windows 7 x64上运行.NET Core 2.2,并且对代码进行了优化。在.NET 4.7.2上进行测试时,我也得到类似的结果。

测试:多头

使用long类型测试此前提,看来这成立:

// =============== SETUP ===================

public readonly struct LongStruct
{
    public readonly long Primitive;

    public LongStruct(long value) => Primitive = value;

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static LongStruct Add(in LongStruct lhs, in LongStruct rhs)
        => new LongStruct(lhs.Primitive + rhs.Primitive);
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static long LongAdd(long lhs, long rhs) => lhs + rhs;

// =============== TESTS ===================

public static void TestLong(long a, long b, out long result)
{
    var sw = Stopwatch.StartNew();

    for (var i = 1000000000; i > 0; --i)
    {
        a = LongAdd(a, b);
    }

    sw.Stop();

    result = a;

    return sw.ElapsedMilliseconds;
}

public static void TestLongStruct(LongStruct a, LongStruct b, out LongStruct result)
{
    var sw = Stopwatch.StartNew();

    for (var i = 1000000000; i > 0; --i)
    {
        a = LongStruct.Add(a, b);
    }

    sw.Stop();

    result = a;

    return sw.ElapsedMilliseconds;
}

// ============= TEST LOOP =================

public static void RunTests()
{
    var longStruct = new LongStruct(1);

    var count = 0;
    var longTime = 0L;
    var longStructTime = 0L;

    while (true)
    {
        count++;
        Console.WriteLine("Test #" + count);

        longTime += TestLong(1, 1, out var longResult);
        var longMean = longTime / count;
        Console.WriteLine($"Long: value={longResult}, Mean Time elapsed: {longMean} ms");

        longStructTime += TestLongStruct(longStruct, longStruct, out var longStructResult);
        var longStructMean = longStructTime / count;
        Console.WriteLine($"LongStruct: value={longStructResult.Primitive}, Mean Time elapsed: {longStructMean} ms");

        Console.WriteLine();
    }
}
使用

LongAdd是为了使测试循环匹配-每个循环都调用一个进行一些加法的方法,而不是针对原始情况进行内联

在我的机器上,这两个时间彼此之间稳定在2%以内,非常接近,以至于我确信它们已经针对几乎相同的代码进行了优化。

IL的差异很小:

  • 除了调用哪个方法(LongAddLongStruct.Add)以外,测试循环代码是相同的。
  • LongStruct.Add还有一些其他说明:
    • 一对ldfld指令,用于从结构中加载Primitive
    • 一条newobj指令,将新的long打包回LongStruct

所以,要么抖动使这些指令变得最优化,要么它们基本上是免费的。

测试:双打

如果我采用上面的代码,并用long替换每个double,我希望得到相同的结果(绝对速度较慢,因为add指令会稍慢一些,但是两者都以相同的幅度)。

我实际上看到的是DoubleStruct版本比double版本慢4.8倍(即480%)。

IL与long情况相同(除了将int64LongStruct交换float64DoubleStruct之外,但运行时在某种程度上正在做DoubleStruct案例或LongStruct案例没有的double案例的大量工作。

测试:其他类型

测试其他一些原始类型,我发现float(465%)的行为与double相同,并且shortint的行为与long,因此似乎是因为浮点运算导致无法进行某些优化。

问题

为什么DoubleStructFloatStructdoublefloat慢得多,而longint和{{1} }等效项不会遭受这种放缓?

2 个答案:

答案 0 :(得分:3)

这不是一个单独的答案,但是它在x86和x64上都比较严格,因此希望它可以为其他可以解释此问题的人提供更多信息。

我试图用BenchmarkDotNet复制它。我还想看看删除in会有什么不同。我将其分别作为x86和x64运行。

x86(LegacyJIT)

|                 Method |     Mean |    Error |   StdDev |
|----------------------- |---------:|---------:|---------:|
|               TestLong | 257.9 ms | 2.099 ms | 1.964 ms |
|         TestLongStruct | 529.3 ms | 4.977 ms | 4.412 ms |
|   TestLongStructWithIn | 526.2 ms | 6.722 ms | 6.288 ms |
|             TestDouble | 256.7 ms | 1.466 ms | 1.300 ms |
|       TestDoubleStruct | 342.5 ms | 5.189 ms | 4.600 ms |
| TestDoubleStructWithIn | 338.7 ms | 3.808 ms | 3.376 ms |

x64(RyuJIT)

|                 Method |       Mean |     Error |    StdDev |
|----------------------- |-----------:|----------:|----------:|
|               TestLong |   269.8 ms |  5.359 ms |  9.099 ms |
|         TestLongStruct |   266.2 ms |  6.706 ms |  8.236 ms |
|   TestLongStructWithIn |   270.4 ms |  4.150 ms |  3.465 ms |
|             TestDouble |   270.4 ms |  5.336 ms |  6.748 ms |
|       TestDoubleStruct | 1,250.9 ms | 24.702 ms | 25.367 ms |
| TestDoubleStructWithIn |   577.1 ms | 12.159 ms | 16.644 ms |

我可以使用RyuJIT在x64上复制它,但是不能使用LegacyJIT在x86上复制它。这似乎是RyuJIT管理优化long情况而不是double情况的人工产物-LegacyJIT也不愿进行优化。

我不知道为什么TestDoubleStruct在RyuJIT上如此离群。

代码:

public readonly struct LongStruct
{
    public readonly long Primitive;

    public LongStruct(long value) => Primitive = value;

    public static LongStruct Add(LongStruct lhs, LongStruct rhs)
        => new LongStruct(lhs.Primitive + rhs.Primitive);
    public static LongStruct AddWithIn(in LongStruct lhs, in LongStruct rhs)
        => new LongStruct(lhs.Primitive + rhs.Primitive);
}

public readonly struct DoubleStruct
{
    public readonly double Primitive;

    public DoubleStruct(double value) => Primitive = value;

    public static DoubleStruct Add(DoubleStruct lhs, DoubleStruct rhs)
        => new DoubleStruct(lhs.Primitive + rhs.Primitive);
    public static DoubleStruct AddWithIn(in DoubleStruct lhs, in DoubleStruct rhs)
        => new DoubleStruct(lhs.Primitive + rhs.Primitive);
}


public class Benchmark
{
    [Benchmark]
    public void TestLong()
    {
        for (var i = 1000000000; i > 0; --i)
        {
            LongAdd(1, 2);
        }
    }

    [Benchmark]
    public void TestLongStruct()
    {
        var a = new LongStruct(1);
        var b = new LongStruct(2);

        for (var i = 1000000000; i > 0; --i)
        {
            LongStruct.Add(a, b);
        }
    }

    [Benchmark]
    public void TestLongStructWithIn()
    {
        var a = new LongStruct(1);
        var b = new LongStruct(2);

        for (var i = 1000000000; i > 0; --i)
        {
            LongStruct.AddWithIn(a, b);
        }
    }

    [Benchmark]
    public void TestDouble()
    {
        for (var i = 1000000000; i > 0; --i)
        {
            DoubleAdd(1, 2);
        }
    }

    [Benchmark]
    public void TestDoubleStruct()
    {
        var a = new DoubleStruct(1);
        var b = new DoubleStruct(2);

        for (var i = 1000000000; i > 0; --i)
        {
            DoubleStruct.Add(a, b);
        }
    }

    [Benchmark]
    public void TestDoubleStructWithIn()
    {
        var a = new DoubleStruct(1);
        var b = new DoubleStruct(2);

        for (var i = 1000000000; i > 0; --i)
        {
            DoubleStruct.AddWithIn(a, b);
        }
    }

    public static long LongAdd(long lhs, long rhs) => lhs + rhs;
    public static double DoubleAdd(double lhs, double rhs) => lhs + rhs;
}

class Program
{
    static void Main(string[] args)
    {
        var summary = BenchmarkRunner.Run<Benchmark>();
        Console.ReadLine();
    }
}

有趣的是,这是两种情况的x64程序集:

代码

using System;

public class C {
    public long AddLongs(long a, long b) {
        return a + b;
    }

    public LongStruct AddLongStructs(LongStruct a, LongStruct b) {
        return LongStruct.Add(a, b);
    }

    public LongStruct AddLongStructsWithIn(LongStruct a, LongStruct b) {
        return LongStruct.AddWithIn(a, b);
    }

    public double AddDoubles(double a, double b) {
        return a + b;
    }

    public DoubleStruct AddDoubleStructs(DoubleStruct a, DoubleStruct b) {
        return DoubleStruct.Add(a, b);
    }

    public DoubleStruct AddDoubleStructsWithIn(DoubleStruct a, DoubleStruct b) {
        return DoubleStruct.AddWithIn(a, b);
    }
}

public readonly struct LongStruct
{
    public readonly long Primitive;

    public LongStruct(long value) => Primitive = value;

    public static LongStruct Add(LongStruct lhs, LongStruct rhs)
        => new LongStruct(lhs.Primitive + rhs.Primitive);
    public static LongStruct AddWithIn(in LongStruct lhs, in LongStruct rhs)
        => new LongStruct(lhs.Primitive + rhs.Primitive);
}   

public readonly struct DoubleStruct
{
    public readonly double Primitive;

    public DoubleStruct(double value) => Primitive = value;

    public static DoubleStruct Add(DoubleStruct lhs, DoubleStruct rhs)
        => new DoubleStruct(lhs.Primitive + rhs.Primitive);
    public static DoubleStruct AddWithIn(in DoubleStruct lhs, in DoubleStruct rhs)
        => new DoubleStruct(lhs.Primitive + rhs.Primitive);
}

x86程序集

C.AddLongs(Int64, Int64)
    L0000: mov eax, [esp+0xc]
    L0004: mov edx, [esp+0x10]
    L0008: add eax, [esp+0x4]
    L000c: adc edx, [esp+0x8]
    L0010: ret 0x10

C.AddLongStructs(LongStruct, LongStruct)
    L0000: push esi
    L0001: mov eax, [esp+0x10]
    L0005: mov esi, [esp+0x14]
    L0009: add eax, [esp+0x8]
    L000d: adc esi, [esp+0xc]
    L0011: mov [edx], eax
    L0013: mov [edx+0x4], esi
    L0016: pop esi
    L0017: ret 0x10

C.AddLongStructsWithIn(LongStruct, LongStruct)
    L0000: push esi
    L0001: mov eax, [esp+0x10]
    L0005: mov esi, [esp+0x14]
    L0009: add eax, [esp+0x8]
    L000d: adc esi, [esp+0xc]
    L0011: mov [edx], eax
    L0013: mov [edx+0x4], esi
    L0016: pop esi
    L0017: ret 0x10

C.AddDoubles(Double, Double)
    L0000: fld qword [esp+0xc]
    L0004: fadd qword [esp+0x4]
    L0008: ret 0x10

C.AddDoubleStructs(DoubleStruct, DoubleStruct)
    L0000: fld qword [esp+0xc]
    L0004: fld qword [esp+0x4]
    L0008: faddp st1, st0
    L000a: fstp qword [edx]
    L000c: ret 0x10

C.AddDoubleStructsWithIn(DoubleStruct, DoubleStruct)
    L0000: fld qword [esp+0xc]
    L0004: fadd qword [esp+0x4]
    L0008: fstp qword [edx]
    L000a: ret 0x10

x64程序集

C..ctor()
    L0000: ret

C.AddLongs(Int64, Int64)
    L0000: lea rax, [rdx+r8]
    L0004: ret

C.AddLongStructs(LongStruct, LongStruct)
    L0000: lea rax, [rdx+r8]
    L0004: ret

C.AddLongStructsWithIn(LongStruct, LongStruct)
    L0000: lea rax, [rdx+r8]
    L0004: ret

C.AddDoubles(Double, Double)
    L0000: vzeroupper
    L0003: vmovaps xmm0, xmm1
    L0008: vaddsd xmm0, xmm0, xmm2
    L000d: ret

C.AddDoubleStructs(DoubleStruct, DoubleStruct)
    L0000: sub rsp, 0x18
    L0004: vzeroupper
    L0007: mov [rsp+0x28], rdx
    L000c: mov [rsp+0x30], r8
    L0011: mov rax, [rsp+0x28]
    L0016: mov [rsp+0x10], rax
    L001b: mov rax, [rsp+0x30]
    L0020: mov [rsp+0x8], rax
    L0025: vmovsd xmm0, qword [rsp+0x10]
    L002c: vaddsd xmm0, xmm0, [rsp+0x8]
    L0033: vmovsd [rsp], xmm0
    L0039: mov rax, [rsp]
    L003d: add rsp, 0x18
    L0041: ret

C.AddDoubleStructsWithIn(DoubleStruct, DoubleStruct)
    L0000: push rax
    L0001: vzeroupper
    L0004: mov [rsp+0x18], rdx
    L0009: mov [rsp+0x20], r8
    L000e: vmovsd xmm0, qword [rsp+0x18]
    L0015: vaddsd xmm0, xmm0, [rsp+0x20]
    L001c: vmovsd [rsp], xmm0
    L0022: mov rax, [rsp]
    L0026: add rsp, 0x8
    L002a: ret

SharpLab


如果您添加循环:

代码

public class C {
    public void AddLongs(long a, long b) {
        for (var i = 1000000000; i > 0; --i) {
            long c = a + b;
        }
    }

    public void AddLongStructs(LongStruct a, LongStruct b) {
        for (var i = 1000000000; i > 0; --i) {
            a = LongStruct.Add(a, b);
        }
    }

    public void AddLongStructsWithIn(LongStruct a, LongStruct b) {
        for (var i = 1000000000; i > 0; --i) {
            a = LongStruct.AddWithIn(a, b);
        }
    }

    public void AddDoubles(double a, double b) {
        for (var i = 1000000000; i > 0; --i) {
            a = a + b;
        }
    }

    public void AddDoubleStructs(DoubleStruct a, DoubleStruct b) {
        for (var i = 1000000000; i > 0; --i) {
            a = DoubleStruct.Add(a, b);
        }
    }

    public void AddDoubleStructsWithIn(DoubleStruct a, DoubleStruct b) {
        for (var i = 1000000000; i > 0; --i) {
            a = DoubleStruct.AddWithIn(a, b);
        }
    }
}

public readonly struct LongStruct
{
    public readonly long Primitive;

    public LongStruct(long value) => Primitive = value;

    public static LongStruct Add(LongStruct lhs, LongStruct rhs)
        => new LongStruct(lhs.Primitive + rhs.Primitive);
    public static LongStruct AddWithIn(in LongStruct lhs, in LongStruct rhs)
        => new LongStruct(lhs.Primitive + rhs.Primitive);
}   

public readonly struct DoubleStruct
{
    public readonly double Primitive;

    public DoubleStruct(double value) => Primitive = value;

    public static DoubleStruct Add(DoubleStruct lhs, DoubleStruct rhs)
        => new DoubleStruct(lhs.Primitive + rhs.Primitive);
    public static DoubleStruct AddWithIn(in DoubleStruct lhs, in DoubleStruct rhs)
        => new DoubleStruct(lhs.Primitive + rhs.Primitive);
}

x86

C.AddLongs(Int64, Int64)
    L0000: push ebp
    L0001: mov ebp, esp
    L0003: mov eax, 0x3b9aca00
    L0008: dec eax
    L0009: test eax, eax
    L000b: jg L0008
    L000d: pop ebp
    L000e: ret 0x10

C.AddLongStructs(LongStruct, LongStruct)
    L0000: push ebp
    L0001: mov ebp, esp
    L0003: push esi
    L0004: mov esi, 0x3b9aca00
    L0009: mov eax, [ebp+0x10]
    L000c: mov edx, [ebp+0x14]
    L000f: add eax, [ebp+0x8]
    L0012: adc edx, [ebp+0xc]
    L0015: mov [ebp+0x10], eax
    L0018: mov [ebp+0x14], edx
    L001b: dec esi
    L001c: test esi, esi
    L001e: jg L0009
    L0020: pop esi
    L0021: pop ebp
    L0022: ret 0x10

C.AddLongStructsWithIn(LongStruct, LongStruct)
    L0000: push ebp
    L0001: mov ebp, esp
    L0003: push esi
    L0004: mov esi, 0x3b9aca00
    L0009: mov eax, [ebp+0x10]
    L000c: mov edx, [ebp+0x14]
    L000f: add eax, [ebp+0x8]
    L0012: adc edx, [ebp+0xc]
    L0015: mov [ebp+0x10], eax
    L0018: mov [ebp+0x14], edx
    L001b: dec esi
    L001c: test esi, esi
    L001e: jg L0009
    L0020: pop esi
    L0021: pop ebp
    L0022: ret 0x10

C.AddDoubles(Double, Double)
    L0000: push ebp
    L0001: mov ebp, esp
    L0003: mov eax, 0x3b9aca00
    L0008: dec eax
    L0009: test eax, eax
    L000b: jg L0008
    L000d: pop ebp
    L000e: ret 0x10

C.AddDoubleStructs(DoubleStruct, DoubleStruct)
    L0000: push ebp
    L0001: mov ebp, esp
    L0003: mov eax, 0x3b9aca00
    L0008: fld qword [ebp+0x10]
    L000b: fld qword [ebp+0x8]
    L000e: faddp st1, st0
    L0010: fstp qword [ebp+0x10]
    L0013: dec eax
    L0014: test eax, eax
    L0016: jg L0008
    L0018: pop ebp
    L0019: ret 0x10

C.AddDoubleStructsWithIn(DoubleStruct, DoubleStruct)
    L0000: push ebp
    L0001: mov ebp, esp
    L0003: mov eax, 0x3b9aca00
    L0008: fld qword [ebp+0x10]
    L000b: fadd qword [ebp+0x8]
    L000e: fstp qword [ebp+0x10]
    L0011: dec eax
    L0012: test eax, eax
    L0014: jg L0008
    L0016: pop ebp
    L0017: ret 0x10

x64

C.AddLongs(Int64, Int64)
    L0000: mov eax, 0x3b9aca00
    L0005: dec eax
    L0007: test eax, eax
    L0009: jg L0005
    L000b: ret

C.AddLongStructs(LongStruct, LongStruct)
    L0000: mov eax, 0x3b9aca00
    L0005: add rdx, r8
    L0008: dec eax
    L000a: test eax, eax
    L000c: jg L0005
    L000e: ret

C.AddLongStructsWithIn(LongStruct, LongStruct)
    L0000: mov eax, 0x3b9aca00
    L0005: add rdx, r8
    L0008: dec eax
    L000a: test eax, eax
    L000c: jg L0005
    L000e: ret

C.AddDoubles(Double, Double)
    L0000: vzeroupper
    L0003: mov eax, 0x3b9aca00
    L0008: vaddsd xmm1, xmm1, xmm2
    L000d: dec eax
    L000f: test eax, eax
    L0011: jg L0008
    L0013: ret

C.AddDoubleStructs(DoubleStruct, DoubleStruct)
    L0000: sub rsp, 0x18
    L0004: vzeroupper
    L0007: mov [rsp+0x28], rdx
    L000c: mov [rsp+0x30], r8
    L0011: mov eax, 0x3b9aca00
    L0016: mov rdx, [rsp+0x28]
    L001b: mov [rsp+0x10], rdx
    L0020: mov rdx, [rsp+0x30]
    L0025: mov [rsp+0x8], rdx
    L002a: vmovsd xmm0, qword [rsp+0x10]
    L0031: vaddsd xmm0, xmm0, [rsp+0x8]
    L0038: vmovsd [rsp], xmm0
    L003e: mov rdx, [rsp]
    L0042: mov [rsp+0x28], rdx
    L0047: dec eax
    L0049: test eax, eax
    L004b: jg L0016
    L004d: add rsp, 0x18
    L0051: ret

C.AddDoubleStructsWithIn(DoubleStruct, DoubleStruct)
    L0000: push rax
    L0001: vzeroupper
    L0004: mov [rsp+0x18], rdx
    L0009: mov [rsp+0x20], r8
    L000e: mov eax, 0x3b9aca00
    L0013: vmovsd xmm0, qword [rsp+0x20]
    L001a: vmovaps xmm1, xmm0
    L001f: vaddsd xmm1, xmm1, [rsp+0x18]
    L0026: vmovsd [rsp], xmm1
    L002c: mov rdx, [rsp]
    L0030: mov [rsp+0x18], rdx
    L0035: dec eax
    L0037: test eax, eax
    L0039: jg L001a
    L003b: add rsp, 0x8
    L003f: ret

SharpLab

我对汇编的解释还不够熟悉,但是很明显AddDoubleStructs中的工作要比AddLongStructs中的工作多。

答案 1 :(得分:3)

有关我的结论的一些计时结果和x86 asm输出,请参见@ canton7的答案。 (我没有Windows或C#编译器。)

异常: SharpLab上的循环的“版本”组件与@ canton7的任何Intel或AMD CPU的BenchmarkDotNet性能数字不匹配。汇编显示TestDouble确实在循环内运行a+=b,但时序显示它的运行速度与1 / clock整数循环一样快。(FP加总延迟为3到5个周期AMD K8 / K10 / Bulldozer-family / Ryzen,以及通过Skylake的Intel P6。)

也许这只是一次通过优化,并且在运行更长的时间后,JIT将完全优化FP添加(因为未返回该值)。 因此,我认为很遗憾,我们仍然没有真正运行的 组件,但是我们可以看到JIT优化器造成的混乱。

我不知道TestDoubleStructWithIn可能比整数循环慢,但只有两倍(不是3x),除非long循环不是每个时钟1次迭代运行。有了如此高的数量,启动开销应该可以忽略不计。保留在内存中的循环计数器可以对此进行解释(在所有迭代过程中,每个迭代瓶颈均会产生〜6个周期,隐藏了除非常慢的FP版本外的任何事物的延迟。)但是@ canton7表示,他们使用Release版本进行了测试。但是由于功率/热量限制,它们的i7-8650U可能无法在所有环路上都保持最大涡轮= 4.20 GHz。 (所有核心的最低持续频率= 1.90 GHz),那么以秒为单位的时间而不是周期可能会使我们陷入没有瓶颈的循环?这仍然不能解释原始的double具有相同的速度。那些必须已经过优化。


期望此类可以内联和优化(使用方式)是合理的。一个好的编译器可以做到这一点。但是JIT必须快速编译,因此它并不总是很好,并且在这种情况下显然不适用于double

对于整数循环,x86-64上的64位整数加法器具有1个周期的延迟,现代超标量CPU具有足够的吞吐量以运行包含加法器的循环,其速度与否则倒数的空循环相同。一个柜台。因此,我们无法从时间上判断编译器是否在循环外执行了a + b * 1000000000(但仍然运行了一个空循环),或者执行了什么操作。

@ canton7使用SharpLab查看了AddDoubleStructs的独立版本的JIT x86-64 asm,以及调用它的循环。 standalone and loops, x86-64, release mode

我们可以看到,对于原语long c = a + b,它完全优化了添加操作(但保留了一个空的倒计时循环)!如果我们使用a = a+b;,即使没有从函数返回add,也会得到一条实际的a指令。

loops.AddLongs(Int64, Int64)
    L0000: mov eax, 0x3b9aca00    # i = init
                                  # do {
                                  #   long c = a+b   optimized out
    L0005: dec eax                #   --i;
    L0007: test eax, eax
    L0009: jg L0005               # }while(i>0);

    L000b: ret

但是该结构版本具有来自add的实际a = LongStruct.Add(a, b);指令。 (我们对具有原始a = a+b;的{​​{1}}的理解相同。)

long

但是,如果我们将其更改为loops.AddLongStructs(LongStruct a, LongStruct b) L0000: mov eax, 0x3b9aca00 L0005: add rdx, r8 # a += b; other insns are identical L0008: dec eax L000a: test eax, eax L000c: jg L0005 L000e: ret (不在任何地方分配结果),则会在循环之外得到LongStruct.Add(a, b);(提升a + b),然后是L0006: add rdx, r8 / {{1 }}。 (注册副本,然后将其存储到无效的暂存空间中,完全是疯了。)在C#中(与C / C ++不同),单独编写L0009: mov rcx, rdx作为语句是错误的,因此我们看不到原始等价的东西仍然会导致愚蠢的浪费指令。 L000c: mov [rsp], rcx

我认为我们不能将这些错过的优化归咎于结构本身。但是,即使您在循环中使用/不使用a+b;对其进行基准测试,也不会导致现代x86上该循环中的实际速度降低。空循环达到1 / clock循环吞吐量瓶颈,而循环中只有2微克(Only assignment, call, increment, decrement, await, and new object expressions can be used as a statement和宏融合的add),只要它们没有,就可以再容纳2微克而不会减慢速度引入任何比1 / clock更严重的瓶颈。 (https://agner.org/optimize/),例如dec具有3个周期的延迟,会使循环速度降低3倍。“ 4 oups”的前端吞吐量假设是最近的Intel。推土机家庭比较狭窄,雷岑(Ryzen)只有5人。

这些是类的非静态成员函数(无缘无故,但我没有立即注意到,因此现在不对其进行更改)。在asm调用约定中,第一个arg(RCX)是test/jg指针,而args 2和3是成员函数(RDX和R8)的显式args。

JIT代码生成器在imul edx, r8d后面放置了一个额外的this,它已经根据test eax,eax设置了FLAGS(我们未测试的CF除外)。起点是一个正的编译时间常数;任何C编译器都会将此优化为dec eax / i - 1。我认为dec eax / jnz也会起作用,因为dec eax为假。jg产生零时会失败。


DoubleStruct与调用约定

C#在x86-64上使用的调用约定在整数寄存器中传递8字节结构,该结构吸收了包含dec的结构(因为它必须被反弹到XMM注册1 > 1或其他FP操作。因此,非内联函数调用的结构会有不可避免的缺点。

double

这完全是疯子。该释放模式的代码源,但是编译器将这些结构存储到内存中,然后再次将它们重新加载并存储 ,然后才将它们实际加载到FPU中。 (我猜int-> int副本可能是构造函数,但我不知道。我通常会查看C / C ++编译器的输出,通常在优化的版本中这并不傻)。

在函数arg上使用vaddsd可以避免将每个输入的额外副本复制到第二个堆栈位置,但仍会通过存储/重载将它们从整数传输到XMM。

>

这就是gcc在默认调整下对int-> xmm所做的工作,但这是错过的优化。 Agner Fog(在他的微架构指南中)说,AMD的优化手册建议在调整Bulldozer时存储/重新加载,但是他发现即使在AMD上,它也不会更快。 (其中ALU int-> xmm的延迟约为10个周期,而在Intel或Ryzen上则为2至3个周期,而1 / clock的吞吐量与商店相同。)

此功能的一个很好的实现(如果我们坚持调用约定)是### stand-alone versions of functions: not inlined into a loop # with primitive double, args are passed in XMM regs standalone.AddDoubles(Double, Double) L0000: vzeroupper L0003: vmovaps xmm0, xmm1 # stupid missed optimization defeating the purpose of AVX 3-operand instructions L0008: vaddsd xmm0, xmm0, xmm2 # vaddsd xmm0, xmm1, xmm2 would do retval = a + b L000d: ret # without `in`. Significantly less bad with `in`, see the link. standalone.AddDoubleStructs(DoubleStruct a, DoubleStruct b) L0000: sub rsp, 0x18 # reserve 24 bytes of stack space L0004: vzeroupper # Weird to use this in a function that doesn't have any YMM vectors... L0007: mov [rsp+0x28], rdx # spill args 2 (rdx=double a) and 3 (r8=double b) to the stack. L000c: mov [rsp+0x30], r8 # (first arg = rcx = unused this pointer) L0011: mov rax, [rsp+0x28] L0016: mov [rsp+0x10], rax # copy a to another place on the stack! L001b: mov rax, [rsp+0x30] L0020: mov [rsp+0x8], rax # copy b to another place on the stack! L0025: vmovsd xmm0, qword [rsp+0x10] L002c: vaddsd xmm0, xmm0, [rsp+0x8] # add a and b in the SSE/AVX FPU L0033: vmovsd [rsp], xmm0 # store the result to yet another stack location L0039: mov rax, [rsp] # reload it into RAX, the return value L003d: add rsp, 0x18 L0041: ret / in,然后是vaddsd,然后是vmovq xmm0, rdx / vmovq xmm1, r8。 / p>


内联到循环后

原始vmovq rax, xmm0的优化类似于ret

  • 原始:double完全优化了
  • long(如使用的@ canton7)仍然 not ,即使结果仍未使用。这将成为double c = a + b;延迟的瓶颈(3到5个周期,具体取决于Bulldozer,Ryzen,Intel之前的Skylake与Skylake。)但它确实存在寄存器中。
a  = a + b

内联结构版本

将函数内联到循环中后,所有存储/重载开销都应消除;这是内联的很大一部分。 令人惊讶的是,它不会优化。 2倍的存储/重载位于循环传输的数据依赖链(FP的添加)的关键路径上!!!这是一个巨大的优化遗漏。

在现代Intel上,存储/重装延迟大约为5或6个周期,比FP添加要慢。 vaddsd正在装入/存储到XMM0中,然后又在返回中。

loops.AddDoubles(Double, Double)
    L0000: vzeroupper
    L0003: mov eax, 0x3b9aca00
                                        # do {
    L0008: vaddsd xmm1, xmm1, xmm2        # a += b
    L000d: dec eax                        # --i
    L000f: test eax, eax
    L0011: jg L0008                     # }while(i>0);

    L0013: ret

原始的a循环优化为一个简单的循环,将所有内容保留在寄存器中,没有巧妙的优化会违反严格的FP。即不要将其转换为乘法,也不能使用多个累加器隐藏FP增加延迟。 (但是我们从loops.AddDoubleStructs(DoubleStruct, DoubleStruct) L0000: sub rsp, 0x18 L0004: vzeroupper L0007: mov [rsp+0x28], rdx # spill function args: a L000c: mov [rsp+0x30], r8 # and b L0011: mov eax, 0x3b9aca00 # i= init # do { L0016: mov rdx, [rsp+0x28] L001b: mov [rsp+0x10], rdx # tmp_a = copy a to another local L0020: mov rdx, [rsp+0x30] L0025: mov [rsp+0x8], rdx # tmp_b = copy b L002a: vmovsd xmm0, qword [rsp+0x10] # tmp_a L0031: vaddsd xmm0, xmm0, [rsp+0x8] # + tmp_b L0038: vmovsd [rsp], xmm0 # tmp_a = sum L003e: mov rdx, [rsp] L0042: mov [rsp+0x28], rdx # a = copy tmp_a L0047: dec eax # --i; L0049: test eax, eax L004b: jg L0016 # }while(i>0) L004d: add rsp, 0x18 L0051: ret 版本知道,编译器无论如何都不会做得更好。)它将所有附加项作为一个长的依赖链来执行,因此每3个double(Broadwell或更早版本,Ryzen) )或4个周期(Skylake)。