Question

我注意到包装单个浮点数的结构比直接使用浮点数慢得多，大约有一半的性能。

using System;
using System.Diagnostics;

struct Vector1 {

    public float X;

    public Vector1(float x) {
        X = x;
    }

    public static Vector1 operator +(Vector1 a, Vector1 b) {
        a.X = a.X + b.X;
        return a;
    }
}

然而，在添加额外的“额外”字段后，似乎会出现一些魔法并且性能再次变得更加合理：

struct Vector1Magic {

    public float X;
    private bool magic;

    public Vector1Magic(float x) {
        X = x;
        magic = true;
    }

    public static Vector1Magic operator +(Vector1Magic a, Vector1Magic b) {
        a.X = a.X + b.X;
        return a;
    }
}

我用来对这些代码进行基准测试的代码如下：

class Program {
    static void Main(string[] args) {
        int iterationCount = 1000000000;
        var sw = new Stopwatch();
        sw.Start();
        var total = 0.0f;
        for (int i = 0; i < iterationCount; i++) {
            var v = (float) i;
            total = total + v;
        }
        sw.Stop();
        Console.WriteLine("Float time was {0} for {1} iterations.", sw.Elapsed, iterationCount);
        Console.WriteLine("total = {0}", total);
        sw.Reset();
        sw.Start();
        var totalV = new Vector1(0.0f);
        for (int i = 0; i < iterationCount; i++) {
            var v = new Vector1(i);
            totalV += v;
        }
        sw.Stop();
        Console.WriteLine("Vector1 time was {0} for {1} iterations.", sw.Elapsed, iterationCount);
        Console.WriteLine("totalV = {0}", totalV);
        sw.Reset();
        sw.Start();
        var totalVm = new Vector1Magic(0.0f);
        for (int i = 0; i < iterationCount; i++) {
            var vm = new Vector1Magic(i);
            totalVm += vm;
        }
        sw.Stop();
        Console.WriteLine("Vector1Magic time was {0} for {1} iterations.", sw.Elapsed, iterationCount);
        Console.WriteLine("totalVm = {0}", totalVm);
        Console.Read();
    }
}

基准测试结果：

Float time was 00:00:02.2444910 for 1000000000 iterations.
Vector1 time was 00:00:04.4490656 for 1000000000 iterations.
Vector1Magic time was 00:00:02.2262701 for 1000000000 iterations.

编译器/环境设置：操作系统：Windows 10 64位工具链：VS2017 框架：.Net 4.6.2 目标：任何CPU首选32位

如果将64位设置为目标，我们的结果更可预测，但比我们在32位目标上使用Vector1Magic看到的要差得多：

Float time was 00:00:00.6800014 for 1000000000 iterations.
Vector1 time was 00:00:04.4572642 for 1000000000 iterations.
Vector1Magic time was 00:00:05.7806399 for 1000000000 iterations.

对于真正的向导，我在这里包含了IL的转储：https://pastebin.com/sz2QLGEx

进一步研究表明，这似乎是特定于Windows运行时，因为单声道编译器生成相同的IL。

在单声道运行时，与原始浮点数相比，两种结构变体的性能大约低2倍。这与我们在.Net上看到的性能有很大的不同。

这里发生了什么？

*注意这个问题最初包括一个有缺陷的基准测试过程（感谢Max Payne指出这一点），并且已经更新以更准确地反映时间。

Answer 1

Jit具有一种称为“结构提升”的优化，它可以有效地用多个局部替换结构局部或自变量，每个局部对应一个结构。

但是禁用了单个结构包装的浮点的结构升级。原因有些晦涩，但大致是：

简单包装原始类型的结构在传递给调用或从调用返回时被视为结构大小的整数值
在促销分析过程中，Jit无法确定该结构是否传递给调用或从调用返回。
在调用时将int重新分类为float所需的代码序列（反之亦然）被认为在运行时很昂贵。
因此不会升级该结构，因此对float字段的访问和操作要慢一些。

因此，粗略地说，准时制是优先考虑降低呼叫现场的成本，而不是提高使用现场的成本。有时（如上述情况，在运营成本中占主导地位），这不是正确的选择。

如您所见，如果您使结构变大，则用于传递和返回结构更改的规则（现在通过引用返回传递），这将阻止升级。

在CoreCLR sources中，您可以在Compiler::lvaShouldPromoteStructVar中看到这种逻辑。

Answer 2

这不应该发生。这显然是某种错位，迫使JIT不能像它应该的那样工作。

struct Vector1 //Works fast in 32 Bit 
{
    public double X;
}

struct Vector1 //Works fast in 64 Bit and 32 Bit
{
    public double X;
    public double X2;
}

您还必须致电： Console.WriteLine（total）; ，它将时间精确地增加到Vector1Magic时间，这是有道理的。问题仍然存在，为什么Vector1太慢了。

也许结构没有针对sizeof（foo）进行优化＆lt; 64位模式下的64位。

这似乎是7年前的这个： Why is 16 byte the recommended size for struct in C#?

Answer 3

CIL代码完全相同（实际上）。但是x86汇编代码不是。

我认为，它是JIT编译器优化的一些特性。

编译器为Vector1生成以下汇编代码。

C＃（评论中部分汇编x86）：

var totalV = new Vector1(0.0f);
/*
01300576  fldz  
01300578  fstp        dword ptr [ebp-14h] 
*/
for (int i = 0; i < iterationCount; i++)
{
   var v = new Vector1(i);
   /*
   0130057D  mov         dword ptr [ebp-4Ch],ecx ; ecx - is index "i"
   01300580  fild        dword ptr [ebp-4Ch]
   01300583  fstp        dword ptr [ebp-4Ch]  
   01300586  fld         dword ptr [ebp-4Ch]
   */
   totalV += v;
   /*
   01300589  lea         eax,[ebp-14h]  
   0130058C  mov         eax,dword ptr [eax]  
   0130058E  lea         edx,[ebp-18h]  
   01300591  mov         dword ptr [edx],eax  
   01300593  fadd        dword ptr [ebp-18h]  
   01300596  fstp        dword ptr [ebp-18h]  
   01300599  mov         eax,dword ptr [ebp-18h]  
   0130059C  mov         dword ptr [ebp-14h],eax  
   */
}

编译器为Vector1Magic生成以下汇编代码。

C＃（评论中部分汇编x86）：

var totalVm = new Vector1Magic(0.0f);
/*
01300657  mov         byte ptr [ebp-20h],1  ; here's assignment "magic=true"
0130065B  fldz  
0130065D  fstp        dword ptr [ebp-1Ch]
*/
for (int i = 0; i < iterationCount; i++)
{
    var vm = new Vector1Magic(i);
    /*
    01300662  mov         dword ptr [ebp-4Ch],edx ; edx - is index "i"
    01300665  fild        dword ptr [ebp-4Ch]  
    01300668  fstp        dword ptr [ebp-4Ch]  
    0130066B  fld         dword ptr [ebp-4Ch]  
    */
    totalVm += vm;
    /*
    0130066E  movzx       ecx,byte ptr [ebp-20h] ; here's some work with "unused" magic field
    01300672  fld         dword ptr [ebp-1Ch]  
    01300675  faddp       st(1),st  
    01300677  fstp        dword ptr [ebp-1Ch]  
    0130067A  mov         byte ptr [ebp-20h],cl  ; here's some work with "unused" magic field
    */
}

显然这个asm块会影响性能：

;Vector1
01300589  lea         eax,[ebp-14h]  
0130058C  mov         eax,dword ptr [eax]  
0130058E  lea         edx,[ebp-18h]  
01300591  mov         dword ptr [edx],eax  
01300593  fadd        dword ptr [ebp-18h]  
01300596  fstp        dword ptr [ebp-18h]  
01300599  mov         eax,dword ptr [ebp-18h]  
0130059C  mov         dword ptr [ebp-14h],eax  

;Vector1Magic
0130066E  movzx       ecx,byte ptr [ebp-20h] ; here's some work with "unused" magic field
01300672  fld         dword ptr [ebp-1Ch]  
01300675  faddp       st(1),st  
01300677  fstp        dword ptr [ebp-1Ch]  
0130067A  mov         byte ptr [ebp-20h],cl  ; here's some work with "unused" magic field

JIT编译器在具有一个字段和多个字段的结构上以不同方式处理操作。可能它期望Vector1Magic操作所有字段（＆＃34;未使用＆＃34;也）。

为什么在struct中添加额外的字段会大大提高其性能？

3 个答案: