尽管异步

时间:2016-12-22 15:52:24

标签: asynchronous cuda blocking cuda-streams

我正在尝试使用GeForce GTX 960M实时处理视频流。 (Windows 10,VS 2013,CUDA 8.0)

必须捕获每个帧,轻微模糊,并且只要我能,我需要在10个最新帧上进行一些艰苦的计算。 所以我需要以30 fps的速度捕捉所有帧,我期望以5 fps的速度获得努力。

我的问题是我不能保持捕获以正确的速度运行:看起来硬件计算会减慢帧的捕获速度,无论是在CPU级别还是在GPU级别。我想念一些帧...

我尝试了很多解决方案。没有工作:

  1. 我尝试在2个流上设置作业(下图):
    • 主持人获得一个框架
    • 第一个流(称为Stream2):cudaMemcpyAsync复制设备上的帧。然后,第一个内核进行基本的模糊计算。 (在附图中,模糊显示为3.07秒和3.085秒的短槽。然后没有......直到大部分完成)
    • 主机检查第二个流是否可用"感谢CudaEvent,并尽可能地推出它。实际上,该流可用1/2次尝试。
    • 第二个流(称为Stream4):在内核中启动勤奋计算(kernelCalcul_W2),输出结果并记录事件。
  2. NSight capture

    实际上,我写道:

    cudaStream_t  sHigh, sLow;
    cudaStreamCreateWithPriority(&sHigh, cudaStreamNonBlocking, priority_high);
    cudaStreamCreateWithPriority(&sLow, cudaStreamNonBlocking, priority_low);
    
    cudaEvent_t event_1;
    cudaEventCreate(&event_1);
    
    if (frame has arrived)
    {
        cudaMemcpyAsync(..., sHigh);        // HtoD, to upload images in the GPU
        blur_Image <<<... , sHigh>>> (...)
        if (cudaEventQuery(event_1)==cudaSuccess)) hard_work(sLow);
        else printf("Event 2 not ready\n");
    }
    
    void hard_work( cudaStream_t sLow_)
    {
        kernelCalcul_W2<<<... , sLow_>>> (...);
        cudaMemcpyAsync(... the result..., sLow_); //DtoH
        cudaEventRecord(event_1, sLow_);    
    }
    
    1. 我试图只使用一个流。它与上面的代码相同,但在启动hard_work时更改1参数。
      • 主机获取框架
      • Stream:cudaMemcpyAsync复制设备上的帧。然后,内核进行基本的模糊计算。然后,如果CudaEvent Event_1没问题,我就会努力工作,并添加一个Event_1以获得下一轮的状态。 实际上,流总是可用的:我永远不会落入&#34;否则&#34;一部分。
    2. 这样,在努力工作的同时,我期待&#34;缓冲&#34;所有要复制的帧,而不是丢失任何帧。但我确实失去了一些:事实证明每次我得到一个框架并且我复制它,Event_1似乎没问题所以我开始努力工作,并且只能很晚才到达下一帧。

      1. 我试图将两个流放在两个不同的线程中(在C中)。不是更好(甚至更糟)。
      2. 所以问题是:如何确保第一个流捕获所有帧? 我真的觉得不同的流阻塞了CPU。

        我用OpenGL显示图像。它会干涉吗?

        有什么方法可以改善这个吗? 非常感谢!

        修改 根据要求,我在这里放了一个MCVE。

        您可以调整一个参数(#define ADJUST)以查看发生了什么。基本上,主程序在异步模式下发送CUDA请求,但它似乎阻止了主线程。正如您将在图像中看到的那样,我有&#34;内存访问&#34; (即拍摄的图像)每30毫秒,除非勤奋工作(然后,我只是没有得到图像)。

        最后一个细节:我使用CUDA 7.5来运行它。我试图安装8.0,但显然编译器仍然是7.5

        #define _USE_MATH_DEFINES 1
        #define _CRT_SECURE_NO_WARNINGS 1
        
        #include <stdio.h>
        #include <stdlib.h>
        #include <time.h>
        #include <Windows.h>
        
        #define ADJUST  400
        // adjusting this paramter may make the problem occur.
        // Too high => probably watchdog will stop the kernel
        // too low => probably the kernel will run smothly
        
        unsigned short * images_as_Unsigned_in_Host;
        unsigned short * Images_as_Unsigned_in_Device;
        unsigned short * camera;
        float * images_as_Output_in_Host;
        float *  Images_as_Float_in_Device;
        float * imageOutput_in_Device;
        
        unsigned short imageWidth, imageHeight, totNbOfImages, imageSlot;
        unsigned long imagePixelSize;
        unsigned short lastImageFromCamera;
        
        
        cudaStream_t  s1, s2;
        cudaEvent_t event_2;
        clock_t timeRef;
        
        // Basically, in the middle of the image, I average the values. I removed the logic behind to make it simpler.
        // This kernel runs fast, and that's the point.
        __global__ void blurImage(unsigned short * Images_as_Unsigned_in_Device_, float * Images_as_Float_in_Device_, unsigned short imageWidth_, 
            unsigned long  imagePixelSize_, short blur_distance)
        {
            // we start from 'blur_distance' from the edge
            // p0 is the point we will calculate. p is a pointer which will move around for average
            unsigned long p0 = (threadIdx.x + blur_distance) + (blockIdx.x + blur_distance) * imageWidth_;
            unsigned long p = p0;
            unsigned short * us;
            if (p >= imagePixelSize_) return;
            unsigned long tot = 0;
            short a, b, n, k;
            k = 0;
            // p starts from the top edge and will move to the right-bottom
            p -= blur_distance + blur_distance * imageWidth_;
            us = Images_as_Unsigned_in_Device_ + p;
            for (a = 2 * blur_distance; a >= 0; a--)
            {
                for (b = 2 * blur_distance; b >= 0; b--)
                {
                    n = *us;
                    if (n > 0) { tot += n; k++; }
                    us++;
                }
                us += imageWidth_ - 2 * blur_distance - 1;
            }
            if (k > 0) Images_as_Float_in_Device_[p0] = (float)tot / (float)k;
            else Images_as_Float_in_Device_[p0] = 128.f;
        }
        
        
        __global__ void kernelCalcul_W2(float *inputImage, float *outputImage, unsigned long  imagePixelSize_, unsigned short imageWidth_, unsigned short slot, unsigned short totImages)
        {
            // point the pixel and crunch it
            unsigned long p = threadIdx.x + blockIdx.x * imageWidth_;
            if (p >= imagePixelSize_)   { return; }
            float result;
            long a, b, n, n0;
            float input;
            b = 3;
        
            // this is not the right algorithm (which is pretty complex). 
            // I know this is not optimal in terms of memory management. Still, I want a "long" calculation here so I don't care...
            for (n = 0; n < 10; n++)
            {
                n0 = slot - n;
                if (n0 < 0) n0 += totImages;
                input = inputImage[p + n0 * imagePixelSize_]; 
                for (a = 0; a < ADJUST ; a++)
                        result += pow(input, inputImage[a + n0 * imagePixelSize_]) * cos(input);
            }
            outputImage[p] = result;
        }
        
        
        void hard_work( cudaStream_t s){
        
            cudaError err;
            // launch the hard work
            printf("Hard work is launched after image %d is captured  ==> ", imageSlot);
            kernelCalcul_W2 << <340, 500, 0, s >> >(Images_as_Float_in_Device, imageOutput_in_Device, imagePixelSize, imageWidth, imageSlot, totNbOfImages);
            err = cudaPeekAtLastError();
            if (err != cudaSuccess) printf( "running error: %s \n", cudaGetErrorString(err));
            else printf("running ok\n");
        
            // copy the result back to Host
            //printf(" %p  %p  \n", images_as_Output_in_Host, imageOutput_in_Device);
            cudaMemcpyAsync(images_as_Output_in_Host, imageOutput_in_Device, sizeof(float) *  imagePixelSize, cudaMemcpyDeviceToHost, s);
            cudaEventRecord(event_2, s);
        }
        
        
        void createStorageSpace()
        {
            imageWidth = 640;
            imageHeight = 480;
            totNbOfImages = 300;
            imageSlot = 0;
            imagePixelSize = 640 * 480;
            lastImageFromCamera = 0;
        
            camera = (unsigned short *)malloc(imagePixelSize * sizeof(unsigned short));
            for (int i = 0; i < imagePixelSize; i++) camera[i] = rand() % 255;
            // storing the images in the Host memory. I know I could optimize with cudaHostAllocate.
            images_as_Unsigned_in_Host = (unsigned short *) malloc(imagePixelSize * sizeof(unsigned short) * totNbOfImages);
            images_as_Output_in_Host = (float *)malloc(imagePixelSize * sizeof(float));
        
            cudaMalloc(&Images_as_Unsigned_in_Device, imagePixelSize * sizeof(unsigned short) * totNbOfImages);
            cudaMalloc(&Images_as_Float_in_Device, imagePixelSize * sizeof(float) * totNbOfImages);
        
            cudaMalloc(&imageOutput_in_Device, imagePixelSize * sizeof(float));
        
        
        
            int priority_high, priority_low;
            cudaDeviceGetStreamPriorityRange(&priority_low, &priority_high);
            cudaStreamCreateWithPriority(&s1, cudaStreamNonBlocking, priority_high);
            cudaStreamCreateWithPriority(&s2, cudaStreamNonBlocking, priority_low);
            cudaEventCreate(&event_2);
        
        }
        
        void releaseMapFile()
        {
            cudaFree(Images_as_Unsigned_in_Device);
            cudaFree(Images_as_Float_in_Device);
            cudaFree(imageOutput_in_Device);
            free(images_as_Output_in_Host);
            free(camera);
        
            cudaStreamDestroy(s1);
            cudaStreamDestroy(s2);
            cudaEventDestroy(event_2);
        }
        
        void putImageCUDA(const void * data)
        {       
            // We put the image in a round-robin. The slot to put the image is imageSlot
            printf("\nDealing with image %d\n", imageSlot);
            // Copy the image in the Round Robin
            cudaMemcpyAsync(Images_as_Unsigned_in_Device + imageSlot * imagePixelSize, data, sizeof(unsigned short) *  imagePixelSize, cudaMemcpyHostToDevice, s1);
        
            // We will blur the image. Let's prepare the memory to get the results as floats
            cudaMemsetAsync(Images_as_Float_in_Device + imageSlot * imagePixelSize, 0., sizeof(float) *  imagePixelSize, s1);
        
            // blur image
            blurImage << <imageHeight - 140, imageWidth - 140, 0, s1 >> > (Images_as_Unsigned_in_Device + imageSlot * imagePixelSize,
                        Images_as_Float_in_Device + imageSlot * imagePixelSize,
                        imageWidth, imagePixelSize, 3);
        
        
            // launches the hard-work
            if (cudaEventQuery(event_2) == cudaSuccess) hard_work(s2);
            else printf("Hard_work still running, so unable to process after image %d\n", imageSlot);
        
            imageSlot++;
            if (imageSlot >= totNbOfImages) {
                imageSlot = 0;
            }
        }
        
        int main()
        {
            createStorageSpace();
            printf("The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...\nYou may adjust a #define ADJUST parameter to see what's happening.");
        
            for (int i = 0; i < 10; i++)
            {
                putImageCUDA(camera);  // Puts an image in the GPU, does the bluring, and tries to do the hard-work
                Sleep(30);  // to simulate Camera
            }
            releaseMapFile();
            getchar();
        }
        

1 个答案:

答案 0 :(得分:1)

这里的主要问题是cudaMemcpyAsync只是一个正确的非阻塞异步操作,如果涉及的主机内存被固定,即使用cudaHostAlloc分配。此特征涵盖在多个位置,包括API documentation和相关的programming guide section

您的代码的以下修改(在Linux上运行,我更喜欢)演示了行为差异:

$ cat t33.cu
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>

#define ADJUST  400
// adjusting this paramter may make the problem occur.
// Too high => probably watchdog will stop the kernel
// too low => probably the kernel will run smothly

unsigned short * images_as_Unsigned_in_Host;
unsigned short * Images_as_Unsigned_in_Device;
unsigned short * camera;
float * images_as_Output_in_Host;
float *  Images_as_Float_in_Device;
float * imageOutput_in_Device;

unsigned short imageWidth, imageHeight, totNbOfImages, imageSlot;
unsigned long imagePixelSize;
unsigned short lastImageFromCamera;


cudaStream_t  s1, s2;
cudaEvent_t event_2;
clock_t timeRef;

// Basically, in the middle of the image, I average the values. I removed the logic behind to make it simpler.
// This kernel runs fast, and that's the point.
__global__ void blurImage(unsigned short * Images_as_Unsigned_in_Device_, float * Images_as_Float_in_Device_, unsigned short imageWidth_,
    unsigned long  imagePixelSize_, short blur_distance)
{
    // we start from 'blur_distance' from the edge
    // p0 is the point we will calculate. p is a pointer which will move around for average
    unsigned long p0 = (threadIdx.x + blur_distance) + (blockIdx.x + blur_distance) * imageWidth_;
    unsigned long p = p0;
    unsigned short * us;
    if (p >= imagePixelSize_) return;
    unsigned long tot = 0;
    short a, b, n, k;
    k = 0;
    // p starts from the top edge and will move to the right-bottom
    p -= blur_distance + blur_distance * imageWidth_;
    us = Images_as_Unsigned_in_Device_ + p;
    for (a = 2 * blur_distance; a >= 0; a--)
    {
        for (b = 2 * blur_distance; b >= 0; b--)
        {
            n = *us;
            if (n > 0) { tot += n; k++; }
            us++;
        }
        us += imageWidth_ - 2 * blur_distance - 1;
    }
    if (k > 0) Images_as_Float_in_Device_[p0] = (float)tot / (float)k;
    else Images_as_Float_in_Device_[p0] = 128.f;
}


__global__ void kernelCalcul_W2(float *inputImage, float *outputImage, unsigned long  imagePixelSize_, unsigned short imageWidth_, unsigned short slot, unsigned short totImages)
{
    // point the pixel and crunch it
    unsigned long p = threadIdx.x + blockIdx.x * imageWidth_;
    if (p >= imagePixelSize_)   { return; }
    float result;
    long a, n, n0;
    float input;

    // this is not the right algorithm (which is pretty complex).
    // I know this is not optimal in terms of memory management. Still, I want a "long" calculation here so I don't care...
    for (n = 0; n < 10; n++)
    {
        n0 = slot - n;
        if (n0 < 0) n0 += totImages;
        input = inputImage[p + n0 * imagePixelSize_];
        for (a = 0; a < ADJUST ; a++)
                result += pow(input, inputImage[a + n0 * imagePixelSize_]) * cos(input);
    }
    outputImage[p] = result;
}


void hard_work( cudaStream_t s){
#ifndef QUICK
    cudaError err;
    // launch the hard work
    printf("Hard work is launched after image %d is captured  ==> ", imageSlot);
    kernelCalcul_W2 << <340, 500, 0, s >> >(Images_as_Float_in_Device, imageOutput_in_Device, imagePixelSize, imageWidth, imageSlot, totNbOfImages);
    err = cudaPeekAtLastError();
    if (err != cudaSuccess) printf( "running error: %s \n", cudaGetErrorString(err));
    else printf("running ok\n");

    // copy the result back to Host
    //printf(" %p  %p  \n", images_as_Output_in_Host, imageOutput_in_Device);
    cudaMemcpyAsync(images_as_Output_in_Host, imageOutput_in_Device, sizeof(float) *  imagePixelSize/2, cudaMemcpyDeviceToHost, s);
    cudaEventRecord(event_2, s);
#endif
}


void createStorageSpace()
{
    imageWidth = 640;
    imageHeight = 480;
    totNbOfImages = 300;
    imageSlot = 0;
    imagePixelSize = 640 * 480;
    lastImageFromCamera = 0;
#ifdef USE_HOST_ALLOC
    cudaHostAlloc(&camera, imagePixelSize*sizeof(unsigned short), cudaHostAllocDefault);
    cudaHostAlloc(&images_as_Unsigned_in_Host, imagePixelSize*sizeof(unsigned short)*totNbOfImages, cudaHostAllocDefault);
    cudaHostAlloc(&images_as_Output_in_Host, imagePixelSize*sizeof(unsigned short), cudaHostAllocDefault);
#else
    camera = (unsigned short *)malloc(imagePixelSize * sizeof(unsigned short));
    images_as_Unsigned_in_Host = (unsigned short *) malloc(imagePixelSize * sizeof(unsigned short) * totNbOfImages);
    images_as_Output_in_Host = (float *)malloc(imagePixelSize * sizeof(float));
#endif
    for (int i = 0; i < imagePixelSize; i++) camera[i] = rand() % 255;
    cudaMalloc(&Images_as_Unsigned_in_Device, imagePixelSize * sizeof(unsigned short) * totNbOfImages);
    cudaMalloc(&Images_as_Float_in_Device, imagePixelSize * sizeof(float) * totNbOfImages);

    cudaMalloc(&imageOutput_in_Device, imagePixelSize * sizeof(float));



    int priority_high, priority_low;
    cudaDeviceGetStreamPriorityRange(&priority_low, &priority_high);
    cudaStreamCreateWithPriority(&s1, cudaStreamNonBlocking, priority_high);
    cudaStreamCreateWithPriority(&s2, cudaStreamNonBlocking, priority_low);
    cudaEventCreate(&event_2);
    cudaEventRecord(event_2, s2);
}

void releaseMapFile()
{
    cudaFree(Images_as_Unsigned_in_Device);
    cudaFree(Images_as_Float_in_Device);
    cudaFree(imageOutput_in_Device);

    cudaStreamDestroy(s1);
    cudaStreamDestroy(s2);
    cudaEventDestroy(event_2);
}

void putImageCUDA(const void * data)
{
    // We put the image in a round-robin. The slot to put the image is imageSlot
    printf("\nDealing with image %d\n", imageSlot);
    // Copy the image in the Round Robin
    cudaMemcpyAsync(Images_as_Unsigned_in_Device + imageSlot * imagePixelSize, data, sizeof(unsigned short) *  imagePixelSize, cudaMemcpyHostToDevice, s1);

    // We will blur the image. Let's prepare the memory to get the results as floats
    cudaMemsetAsync(Images_as_Float_in_Device + imageSlot * imagePixelSize, 0, sizeof(float) *  imagePixelSize, s1);

    // blur image
    blurImage << <imageHeight - 140, imageWidth - 140, 0, s1 >> > (Images_as_Unsigned_in_Device + imageSlot * imagePixelSize,
                Images_as_Float_in_Device + imageSlot * imagePixelSize,
                imageWidth, imagePixelSize, 3);


    // launches the hard-work
    if (cudaEventQuery(event_2) == cudaSuccess) hard_work(s2);
    else printf("Hard_work still running, so unable to process after image %d\n", imageSlot);

    imageSlot++;
    if (imageSlot >= totNbOfImages) {
        imageSlot = 0;
    }
}

int main()
{
    createStorageSpace();
    printf("The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...\nYou may adjust a #define ADJUST parameter to see what's happening.");

    for (int i = 0; i < 10; i++)
    {
        putImageCUDA(camera);  // Puts an image in the GPU, does the bluring, and tries to do the hard-work
        usleep(30000);  // to simulate Camera
    }
    cudaError_t err = cudaGetLastError();
    if (err != cudaSuccess) printf("some CUDA error: %s\n", cudaGetErrorString(err));
    releaseMapFile();
}
$ nvcc -arch=sm_52 -lineinfo -o t33 t33.cu
$ time ./t33
The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...
You may adjust a #define ADJUST parameter to see what's happening.
Dealing with image 0
Hard work is launched after image 0 is captured  ==> running ok

Dealing with image 1
Hard work is launched after image 1 is captured  ==> running ok

Dealing with image 2
Hard work is launched after image 2 is captured  ==> running ok

Dealing with image 3
Hard work is launched after image 3 is captured  ==> running ok

Dealing with image 4
Hard work is launched after image 4 is captured  ==> running ok

Dealing with image 5
Hard work is launched after image 5 is captured  ==> running ok

Dealing with image 6
Hard work is launched after image 6 is captured  ==> running ok

Dealing with image 7
Hard work is launched after image 7 is captured  ==> running ok

Dealing with image 8
Hard work is launched after image 8 is captured  ==> running ok

Dealing with image 9
Hard work is launched after image 9 is captured  ==> running ok

real    0m2.790s
user    0m0.688s
sys     0m0.966s
$ nvcc -arch=sm_52 -lineinfo -o t33 t33.cu -DUSE_HOST_ALLOC
$ time ./t33
The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...
You may adjust a #define ADJUST parameter to see what's happening.
Dealing with image 0
Hard work is launched after image 0 is captured  ==> running ok

Dealing with image 1
Hard_work still running, so unable to process after image 1

Dealing with image 2
Hard_work still running, so unable to process after image 2

Dealing with image 3
Hard_work still running, so unable to process after image 3

Dealing with image 4
Hard_work still running, so unable to process after image 4

Dealing with image 5
Hard_work still running, so unable to process after image 5

Dealing with image 6
Hard_work still running, so unable to process after image 6

Dealing with image 7
Hard work is launched after image 7 is captured  ==> running ok

Dealing with image 8
Hard_work still running, so unable to process after image 8

Dealing with image 9
Hard_work still running, so unable to process after image 9

real    0m1.721s
user    0m0.028s
sys     0m0.629s
$

在上面的USE_HOST_ALLOC案例中,低优先级内核的启动模式是间歇性的,正如预期的那样,整体运行时间要短得多。

简而言之,如果您希望预期的行为超出cudaMemcpyAsync,请确保所有参与的主机分配都是页面锁定的。

可以在this answer中看到图钉(探查器)示例,其中钉扎可以对多流行为产生影响。