Question

我有类似以下的功能：

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public void SetVariable<T>(T newValue) where T : struct {
    // I know by this point that T is blittable (i.e. only unmanaged value types)

    // varPtr is a void*, and is where I want to copy newValue to
    *varPtr = newValue; // This won't work, but is basically what I want to do
}

我看过Marshal.StructureToIntPtr（），但看起来很慢，这是对性能敏感的代码。如果我知道T类型，我可以将varPtr声明为T*，但是......好吧，我不会。

无论哪种方式，我都是以最快的方式完成此任务。＆＃39;安全＆＃39;不是问题：在代码的这一点上，我知道结构T的大小将完全适合varPtr指向的内存。

Answer 1

一个答案是在C＃中重新实现本机memcpy，使用与本机memcpy尝试相同的优化技巧。您可以看到Microsoft在自己的源代码中执行此操作。请参阅Microsoft参考源中的Buffer.cs文件：

     // This is tricky to get right AND fast, so lets make it useful for the whole Fx.
     // E.g. System.Runtime.WindowsRuntime!WindowsRuntimeBufferExtensions.MemCopy uses it.
     internal unsafe static void Memcpy(byte* dest, byte* src, int len) {

        // This is portable version of memcpy. It mirrors what the hand optimized assembly versions of memcpy typically do.
        // Ideally, we would just use the cpblk IL instruction here. Unfortunately, cpblk IL instruction is not as efficient as
        // possible yet and so we have this implementation here for now.

        switch (len)
        {
        case 0:
            return;
        case 1:
            *dest = *src;
            return;
        case 2:
            *(short *)dest = *(short *)src;
            return;
        case 3:
            *(short *)dest = *(short *)src;
            *(dest + 2) = *(src + 2);
            return;
        case 4:
            *(int *)dest = *(int *)src;
            return;
        ...

有趣的是，它们本身实现了所有大小达到512的memcpy;大多数大小使用指针别名技巧来让VM发出操作不同大小的指令。只有在512，他们最终才会调用本机memcpy：

        // P/Invoke into the native version for large lengths
        if (len >= 512)
        {
            _Memcpy(dest, src, len);
            return;
        }

据推测，本机memcpy更快，因为它可以手动优化以使用SSE / MMX指令来执行复制。

Answer 2

根据BenVoigt的建议，我尝试了一些选择。对于所有这些测试，我使用Any CPU架构在标准的VS2013 Release版本上编译，并在IDE外部运行测试。在测量每个测试之前，方法DoTestA()和DoTestB()被多次运行以允许JIT预热。

首先，我将Marshal.StructToPtr与具有各种结构大小的逐字节循环进行比较。我使用SixtyFourByteStruct：

显示了以下代码

private unsafe static void DoTestA() {
    fixed (SixtyFourByteStruct* fixedStruct = &structToCopy) {
        byte* structStart = (byte*) fixedStruct;
        byte* targetStart = (byte*) unmanagedTarget;
        for (byte* structPtr = structStart, targetPtr = targetStart; structPtr < structStart + sizeof(SixtyFourByteStruct); ++structPtr, ++targetPtr) {
            *targetPtr = *structPtr;
        }
    }
}

private static void DoTestB() {
    Marshal.StructureToPtr(structToCopy, unmanagedTarget, false);
}

结果：

>>> 500000 repetitions >>> IN NANOSECONDS (1000ns = 0.001ms)
Method   Avg.         Min.         Max.         Jitter       Total
A        82ns         0ns          22,000ns     21,917ns     ! 41.017ms
B        137ns        0ns          38,700ns     38,562ns     ! 68.834ms

如您所见，手动循环更快（我怀疑）。对于16字节和4字节结构，结果类似，结构越小，差异越明显。

现在，尝试手动复制与使用P / Invoke和memcpy：

private unsafe static void DoTestA() {
    fixed (FourByteStruct* fixedStruct = &structToCopy) {
        byte* structStart = (byte*) fixedStruct;
        byte* targetStart = (byte*) unmanagedTarget;
        for (byte* structPtr = structStart, targetPtr = targetStart; structPtr < structStart + sizeof(FourByteStruct); ++structPtr, ++targetPtr) {
            *targetPtr = *structPtr;
        }
    }
}

private unsafe static void DoTestB() {
    fixed (FourByteStruct* fixedStruct = &structToCopy) {
        memcpy(unmanagedTarget, (IntPtr) fixedStruct, new UIntPtr((uint) sizeof(FourByteStruct)));
    }
}

>>> 500000 repetitions >>> IN NANOSECONDS (1000ns = 0.001ms)
Method   Avg.         Min.         Max.         Jitter       Total
A        61ns         0ns          28,000ns     27,938ns     ! 30.736ms
B        84ns         0ns          45,900ns     45,815ns     ! 42.216ms

所以，在我的情况下，似乎手动副本仍然更好。与之前一样，4/16/64字节结构的结果非常相似（尽管64字节大小的差距小于10ns）。

我想到我只测试适合缓存行的结构（我有一个标准的x86_64 CPU）。所以我尝试了一个128字节的结构，并且它有利于memcpy：

>>> 500000 repetitions >>> IN NANOSECONDS (1000ns = 0.001ms)
Method   Avg.         Min.         Max.         Jitter       Total
A        104ns        0ns          48,300ns     48,195ns     ! 52.150ms
B        84ns         0ns          38,400ns     38,315ns     ! 42.284ms

无论如何，对于我的机器上的x86_64 CPU上任何大小为＆lt; = 64字节的结构，逐字节复制似乎是最快的结论。按照你的意愿（也许有人会发现我的代码效率低下）。

Answer 3

FYI。我发布了如何将the accepted answer用于其他人＆＃39;通过反射访问方法时因为它过载而受到影响。

public static class Buffer
{
    public unsafe delegate void MemcpyDelegate(byte* dest, byte* src, int len);

    public static readonly MemcpyDelegate Memcpy;
    static Buffer()
    {
        var methods = typeof (System.Buffer).GetMethods(BindingFlags.Static | BindingFlags.NonPublic).Where(m=>m.Name == "Memcpy");
        var memcpy = methods.First(mi => mi.GetParameters().Select(p => p.ParameterType).SequenceEqual(new[] {typeof (byte*), typeof (byte*), typeof (int)}));
        Memcpy = (MemcpyDelegate) memcpy.CreateDelegate(typeof (MemcpyDelegate));
    }
}

用法：

public static unsafe void MemcpyExample()
{
     int src = 12345;
     int dst = 0;
     Buffer.Memcpy((byte*) &dst, (byte*) &src, sizeof (int));
     System.Diagnostics.Debug.Assert(dst==12345);
}

Answer 4

   public void SetVariable<T>(T newValue) where T : struct

您不能使用泛型来快速完成此操作。编译器并没有把你那漂亮的蓝眼睛作为T实际上是blittable的保证，约束不够好。你应该使用重载：

    public unsafe void SetVariable(int newValue) {
        *(int*)varPtr = newValue;
    }
    public unsafe void SetVariable(double newValue) {
        *(double*)varPtr = newValue;
    }
    public unsafe void SetVariable(Point newValue) {
        *(Point*)varPtr = newValue;
    }
    // etc...

这可能不方便，但速度快。它编译为单个MOV指令，在释放模式下没有方法调用开销。它可能是最快的。

在后备案例中，探查器会告诉您何时需要重载：

    public unsafe void SetVariable<T>(T newValue) {
        Marshal.StructureToPtr(newValue, (IntPtr)varPtr, false);
    }

将blittable结构复制到非托管内存位置的最快方法（IntPtr）

4 个答案: