为什么从C#调用masm这么慢?

时间:2017-09-07 21:38:55

标签: c# masm slowdown

这是一个64位masm程序,用于将两个64位数相乘,并添加一个64位数,以得到128位结果(使用标准的64位调用约定):

; public static extern ulong MulAdd64(ulong U, ulong V, ref ulong k);
; Return  (U*V + k) % ß  and set  k = (U*V + k) / ß.
;  U in rcx, V in rdx, &k in r8
; Note  0  <=  0*0 + 0  <=  (ß-1)*(ß-1) + (ß-1) = ß*(ß-1)  <  ß^2
MulAdd64  proc  public
        mov     rax,rcx
        mul     rdx
        add     rax,qword ptr [r8]      ; low part of product
        adc     rdx,0
        mov     qword ptr [r8],rdx      ; high part of product
        ret
MulAdd64  endp

通过以下方式将其导入C#代码:

    [DllImport(@"C:\path\MulAdd64.dll")]
    public static extern ulong MulAdd64 (ulong U, ulong V, ref ulong k);

现在这里是用C#编写的相同函数以及测试程序:

public static void TestCS_Masm_Speed ()
{
    ulong x = 3141592653589793238, y = 2718281828459045, aux = 1234567890123456789, lo = 0;
    // just in case the first invocation is excessivly slow
    //lo = MulAdd64(x, y, ref aux);
    lo = CS_MulAdd64(x, y, ref aux);
    Stopwatch sw = new Stopwatch();
    sw.Restart();
    for (int i = 0; i < 10000000; i++) {
        //lo = MulAdd64(x, y, ref aux);
        lo = CS_MulAdd64(x, y, ref aux);
    }
    Console.WriteLine("Ticks = {0}", sw.ElapsedTicks);
    // verify low-order results  hi:lo = x*y + aux;
    if (x*y+aux != lo) Console.WriteLine("Error in low order result");
    else Console.WriteLine("Low order result is OK");
}

[DllImport(@"C:\path\MulAdd64.dll")]
public static extern ulong MulAdd64 (ulong U, ulong V, ref ulong k);

/*
    Multiplication. We need to multiply two unsigned 64-bit integers, obtaining an
    unsigned 128-bit product. Using Algorithm 4.3.1M of Seminumerical Algorithms,
    with ß = 2^32, the following subroutine computes hi:lo = y*z + aux .  Then
    sets aux to hi and return lo.
*/
public static ulong CS_MulAdd64 (ulong y, ulong z, ref ulong aux)
{
    ulong[] u = new ulong[2],  v = new ulong[2],  w = new ulong[4];
    // Unpack the multiplier, multiplicand, and aux  to  u, v, and w 
    u[1] = (ulong)y >> 32;      u[0] = (ulong)y & 0xFFFFFFFF;
    v[1] = (ulong)z >> 32;      v[0] = (ulong)z & 0xFFFFFFFF;
    w[1] = (ulong)aux >> 32;    w[0] = (ulong)aux & 0xFFFFFFFF;
    // Multiply
    for (int j = 0; j < 2; j++) {
        ulong k = 0;
        for (int i = 0; i < 2; i++) {
            ulong t = u[i] * v[j] + w[i+j] + k;
            w[i+j] = t & 0xFFFFFFFF; k = t >> 32;
        }
        w[j+2] = k;
    }
    // Pack w into the outputs aux and return w
    aux = ((ulong)w[3] << 32) + (ulong)w[2];
    return ((ulong)w[1] << 32) + (ulong)w[0];
}   

优化的C#代码明显长于masm代码(153条指令与6条指令相比),但运行速度几乎快两倍(941694 ticks vs 1722289 ticks)!怎么会这样?一切都通过寄存器传递,没有内存可以固定!显然,在C#的调用和masm中的执行之间发生了一些事情,但是什么呢?我无法介入该代码。

1 个答案:

答案 0 :(得分:0)

通过使用Google,您可以更好地了解与托管/本机转换相关的开销。以下是一些文章:

在像你这样的情况下,调整调用约定和其他所需调整的开销可能高于调用本身。

通常,如果您进行了足够的此类乘法运算,则可以将它们一起批处理以最大限度地减少开销。你会做一些测试来找到最佳尺寸。然而,如果你真的认真考虑性能,那么你会考虑像缓存大小,并行处理......

我相信C#编译器可以相对较好地优化您的代码,以便在CPU级别指令可以正确排序,以便几乎没有浪费的周期。它可能能够使用SIMD或其他技巧来优化代码。否则,您可以在C#中编写更有效的乘法。

顺便说一下,当Windows已经提供了一个函数来进行这样的乘法时,为什么要编写自己的代码:UnsignedMultiply128 function

您是否比较过BigInteger表现?