Question

我已经知道设置字段比设置局部变量要慢得多，但是看起来设置字段一个局部变量比使用字段设置局部变量慢得多。为什么是这样？在任何一种情况下都使用该字段的地址。

public class Test
{
    public int A = 0;
    public int B = 4;

    public void Method1() // Set local with field
    {
        int a = A;

        for (int i = 0; i < 100; i++)
        {
            a += B;
        }

        A = a;
    }

    public void Method2() // Set field with local
    {
        int b = B;

        for (int i = 0; i < 100; i++)
        {
            A += b;
        }
    }
}

10e + 6次迭代的基准测试结果是：

Method1: 28.1321 ms
Method2: 162.4528 ms

Answer 1

在我的机器上运行它，我得到了类似的时差，但是看看10M迭代的JITted代码，很明显为什么会出现这种情况：

方法A：

mov     r8,rcx
; "A" is loaded into eax
mov     eax,dword ptr [r8+8]
xor     edx,edx
; "B" is loaded into ecx
mov     ecx,dword ptr [r8+0Ch]
nop     dword ptr [rax]
loop_start:
; Partially unrolled loop, all additions done in registers
add     eax,ecx
add     eax,ecx
add     eax,ecx
add     eax,ecx
add     edx,4
cmp     edx,989680h
jl      loop_start
; Store the sum in eax back to "A"
mov     dword ptr [r8+8],eax
ret

方法B：

; "B" is loaded into edx
mov     edx,dword ptr [rcx+0Ch]
xor     r8d,r8d
nop word ptr [rax+rax]
loop_start:
; Partially unrolled loop, but each iteration requires reading "A" from memory
; adding "B" to it, and then writing the new "A" back to memory.
mov     eax,dword ptr [rcx+8]
add     eax,edx
mov     dword ptr [rcx+8],eax
mov     eax,dword ptr [rcx+8]
add     eax,edx
mov     dword ptr [rcx+8],eax
mov     eax,dword ptr [rcx+8]
add     eax,edx
mov     dword ptr [rcx+8],eax
mov     eax,dword ptr [rcx+8]
add     eax,edx
mov     dword ptr [rcx+8],eax
add     r8d,4
cmp     r8d,989680h
jl      loop_start
rep ret

正如您从程序集中看到的那样，方法A将会明显加快，因为A和B的值都放在寄存器中，并且所有的添加都在那里发生而没有对内存的中间写入。另一方面，方法B产生一个加载并在内存中存储“A”，用于每次迭代。

Answer 2

如果1 a明确存储在寄存器中。其他任何东西都是可怕的编译结果。

在案例2中，.NET JIT可能不愿意/能够将商店转换为A来注册商店。

我怀疑这是由.NET内存模型强制的，因为如果他们只将A视为0或总和，则其他线程永远无法区分两种方法。他们无法反驳优化从未发生过的理论。这使得它允许在.NET抽象机器的语义下。

看到.NET JIT执行很少的优化并不令人惊讶。对于Stack Overflow上的performance标记的关注者来说，这是众所周知的。

据我所知，JIT更有可能在寄存器中缓存内存负载。这就是为什么案例1（显然）在每次迭代时都不会访问B。

寄存器计算比内存访问便宜。如果有问题的内存在CPU L1缓存中（这里就是这种情况），情况就是如此。

我以为只有本地人才有资格获得CPU缓存？

这不可能是因为CPU甚至不知道本地是什么。所有地址看起来都一样。

Answer 3

方法2：字段读取~100x并设置~100x = 200x larg_0（this）+ 100x ldfld（加载字段）+ 100x stfld（设置字段）+ 100x ldloc（本地）

method1：字段读取100x但未设置它相当于method1减去100x ldarg_0（this）

为什么设置字段比获取字段慢很多倍？

3 个答案: