Question

I'm testing CUDAfy with a small gravity simulation and after running a profiler on the code I see that most of the time is spent on the CopyFromDevice method of the GPU. Here's the code:

    private void WithGPU(float dt)
    {
        this.myGpu.CopyToDevice(this.myBodies, this.myGpuBodies);
        this.myGpu.Launch(1024, 1, "MoveBodies", -1, dt, this.myGpuBodies);
        this.myGpu.CopyFromDevice(this.myGpuBodies, this.myBodies);
    }

Just to clarify, this.myBodies is an array with 10,000 structs like the following:

[Cudafy(eCudafyType.Struct)]
[StructLayout(LayoutKind.Sequential)]
internal struct Body
{
    public float Mass;

    public Vector Position;

    public Vector Speed;
}

And Vector is a struct with two floats X and Y.

According to my profiler the average timings for those three lines are 0.092, 0.192 and 222.873 ms. These timings where taken on a Windows 7 with a NVIDIA NVS 310.

Is there a way to improve the time of the CopyFromDevice() method?

Thank you

Answer 1

CUDA内核启动是异步。这意味着在启动内核后立即释放CPU线程以在内核启动后立即处理代码，而内核仍在执行。

如果后续代码包含任何类型的CUDA执行障碍，那么CPU线程将在屏障处停止，直到内核执行完成。在CUDA中，cudaMemcpy（基于cudafy CopyFromDevice方法的操作）和cudaDeviceSynchronize（基于cudafy Synchronize方法的操作）都包含执行障碍。

因此，从主机代码的角度来看，内核启动后立即出现这样的障碍似乎会在内核执行期间暂停CPU线程执行。

因此，此示例中的特定屏障将包括内核执行时间以及数据复制时间。您可以在内核启动后立即使用Synchronize屏障方法来消除通过分析主机代码指示的时间歧义。

CUDAfy CopyFromDevice several orders of magnitude slower than CopyToDevice

1 个答案: