这是一个64位masm程序,用于将两个64位数相乘,并添加一个64位数,以得到128位结果(使用标准的64位调用约定):
; public static extern ulong MulAdd64(ulong U, ulong V, ref ulong k);
; Return (U*V + k) % ß and set k = (U*V + k) / ß.
; U in rcx, V in rdx, &k in r8
; Note 0 <= 0*0 + 0 <= (ß-1)*(ß-1) + (ß-1) = ß*(ß-1) < ß^2
MulAdd64 proc public
mov rax,rcx
mul rdx
add rax,qword ptr [r8] ; low part of product
adc rdx,0
mov qword ptr [r8],rdx ; high part of product
ret
MulAdd64 endp
通过以下方式将其导入C#代码:
[DllImport(@"C:\path\MulAdd64.dll")]
public static extern ulong MulAdd64 (ulong U, ulong V, ref ulong k);
现在这里是用C#编写的相同函数以及测试程序:
public static void TestCS_Masm_Speed ()
{
ulong x = 3141592653589793238, y = 2718281828459045, aux = 1234567890123456789, lo = 0;
// just in case the first invocation is excessivly slow
//lo = MulAdd64(x, y, ref aux);
lo = CS_MulAdd64(x, y, ref aux);
Stopwatch sw = new Stopwatch();
sw.Restart();
for (int i = 0; i < 10000000; i++) {
//lo = MulAdd64(x, y, ref aux);
lo = CS_MulAdd64(x, y, ref aux);
}
Console.WriteLine("Ticks = {0}", sw.ElapsedTicks);
// verify low-order results hi:lo = x*y + aux;
if (x*y+aux != lo) Console.WriteLine("Error in low order result");
else Console.WriteLine("Low order result is OK");
}
[DllImport(@"C:\path\MulAdd64.dll")]
public static extern ulong MulAdd64 (ulong U, ulong V, ref ulong k);
/*
Multiplication. We need to multiply two unsigned 64-bit integers, obtaining an
unsigned 128-bit product. Using Algorithm 4.3.1M of Seminumerical Algorithms,
with ß = 2^32, the following subroutine computes hi:lo = y*z + aux . Then
sets aux to hi and return lo.
*/
public static ulong CS_MulAdd64 (ulong y, ulong z, ref ulong aux)
{
ulong[] u = new ulong[2], v = new ulong[2], w = new ulong[4];
// Unpack the multiplier, multiplicand, and aux to u, v, and w
u[1] = (ulong)y >> 32; u[0] = (ulong)y & 0xFFFFFFFF;
v[1] = (ulong)z >> 32; v[0] = (ulong)z & 0xFFFFFFFF;
w[1] = (ulong)aux >> 32; w[0] = (ulong)aux & 0xFFFFFFFF;
// Multiply
for (int j = 0; j < 2; j++) {
ulong k = 0;
for (int i = 0; i < 2; i++) {
ulong t = u[i] * v[j] + w[i+j] + k;
w[i+j] = t & 0xFFFFFFFF; k = t >> 32;
}
w[j+2] = k;
}
// Pack w into the outputs aux and return w
aux = ((ulong)w[3] << 32) + (ulong)w[2];
return ((ulong)w[1] << 32) + (ulong)w[0];
}
优化的C#代码明显长于masm代码(153条指令与6条指令相比),但运行速度几乎快两倍(941694 ticks vs 1722289 ticks)!怎么会这样?一切都通过寄存器传递,没有内存可以固定!显然,在C#的调用和masm中的执行之间发生了一些事情,但是什么呢?我无法介入该代码。
答案 0 :(得分:0)
通过使用Google,您可以更好地了解与托管/本机转换相关的开销。以下是一些文章:
在像你这样的情况下,调整调用约定和其他所需调整的开销可能高于调用本身。
通常,如果您进行了足够的此类乘法运算,则可以将它们一起批处理以最大限度地减少开销。你会做一些测试来找到最佳尺寸。然而,如果你真的认真考虑性能,那么你会考虑像缓存大小,并行处理......
我相信C#编译器可以相对较好地优化您的代码,以便在CPU级别指令可以正确排序,以便几乎没有浪费的周期。它可能能够使用SIMD或其他技巧来优化代码。否则,您可以在C#中编写更有效的乘法。
顺便说一下,当Windows已经提供了一个函数来进行这样的乘法时,为什么要编写自己的代码:UnsignedMultiply128 function?
您是否比较过BigInteger表现?