Question

我有这样的代码：

for(int i =0; i<2; i++)
{
    //initialization of memory and some variables
    ........
    ........
    RunDll(input image, output image); //function that calls kernel
}

上述循环中的每次迭代都是独立的。我想同时运行它们。所以，我试过这个：

for(int i =0; i<num_devices; i++)
{
    cudaSetDevice(i);
    //initialization of memory and some variables
    ........
    ........
    RunDll(input image, output image); 
    {
        RunBasicFBP_CUDA(parameters); //function that calls kernel 1

        xSegmentMetal(parameters); //CPU function

        RunBasicFP_CUDA(parameters);  //function that uses output of kernel 1 as input for kernel 2

        for (int idx_view = 0; idx_view < param.fbp.num_view; idx_view++)
        {
            for (int idx_bin = 1; idx_bin < param.fbp.num_bin-1; idx_bin++)
            {
                sino_diff[idx_view][idx_bin] = sino_org[idx_view][idx_bin] - sino_mask[idx_view][idx_bin];
            }
        }

        RunBasicFP_CUDA(parameters);
        if(some condition)
        {
            xInterpolateSinoLinear(parameters);  //CPU function
        }
        else
        {
            xInterpolateSinoPoly(parameters);  //CPU function
        }

        RunBasicFBP_CUDA( parameters );
    }
}

我正在使用2 GTX 680，我想同时使用这两个设备。有了上面的代码，我没有得到任何加速。处理时间与在单个GPU上运行时的处理时间几乎相同。

如何在两个可用设备上实现并发执行？

Answer 1

在评论中你说：

RunDll有两个内核，它们是逐个启动的。内核确实有cudaThreadSynchronize（）

请注意，cudaThreadSynchronize()相当于cudaDeviceSynchronize()（前者实际上是deprecated），这意味着您将在一个GPU上运行，同步，然后在另一个GPU上运行。另请注意cudaMemcpy()是一个阻塞例程，您需要cudaMemcpyAsync()版本来避免所有阻塞（正如评论中@JackOLantern所指出的那样）。

一般情况下，您需要发布RunDLL()内部内容的更多详细信息，因为如果没有，您的问题没有足够的信息来提供明确的答案。理想情况下请遵循these guidelines。

Answer 2

在我对您之前发布的帖子（Concurrently running two for loops with same number of loop cycles involving GPU and CPU tasks on two GPU）的回复中，我指出在使用2 GPU时，您的速度不会达到2。

要解释原因，让我们考虑以下代码段

Kernel1<<<...,...>>>(...); // assume Kernel1 takes t1 seconds

// assume CPUFunction + cudaMemcpys take tCPU seconds
cudaMemcpy(...,...,...,cudaMemcpyDeviceToHost); // copy the results of Kernel1 to CPU
CPUFunction(...); // assume it takes tCPU seconds
cudaMemcpy(...,...,...,cudaMemcpyHostToDevice); // copy data from the CPU to Kernel2

Kernel2<<<...,...>>>(...); // assume it takes t2 seconds

如果我使用cudaDeviceSynchronize()或cudaMemcpy获取同步，则无关紧要。

仅在一个GPU上的for循环中执行上述代码段的成本是

t1 + tCPU + t2 + t1 + tCPU + t2 = 2t1 + 2tCPU + 2t2

对于2 GPU，如果能够在两个不同的GPU上实现Kernel1和Kernel2的执行的完美并发，那么执行上述操作的成本代码段将是

t1（在两个GPU上同时执行Kernel1）+ 2 * tCPU（您需要对CPU函数进行两次调用，每次调用Kernel1的不同输出实例） + t2（在两个GPU上同时执行Kernel2）

因此，通过使用两个GPU而不是一个GPU实现的加速将是

（2 *（t1 + tCPU + t2））/（t1 + 2tCPU + t2）

当tCPU等于零时，加速变为2。

这是Amdahl's law的表达。

在两个GPU上运行的代码无法达到并发执行且具有无关的加速

2 个答案: