我是CUDA的新手,并且在执行远程返回光流的GPU版本方面存在一些问题。我最近在opencv中使用gpu版本的farneback光流,用于我在视频中的动作识别应用之一。我为其中一个示例视频执行了farneback光流(GPU版本),并计算了具有96个内核的NVidea Geforce Gpu上的光流量大约需要12秒。
然而,相同的代码我试图在更高级的GPU(TitanX)上运行它,它有大约3072个核心,但我不知道为什么它几乎花费相同的时间计算光流量采取NVidea Geforce Gpu。我的代码是否有可能存在一些缺陷,或者有可能函数本身为程序分配有限数量的内核而不管GPU如何?我是否可以访问函数可以分配给GPU的线程数,以便我可以手动设置线程数,以加快代码以获得更高级的GPU。 farneback光流的cuda文件有一些代码行,例如 dim3 block(128),它们为GPU分配一些块大小。如果我尝试将块大小从128增加到256,我的程序会运行得快吗?我尝试了但它显示了一些错误,如 OpenCV错误:调用中的Gpu API调用(无效的设备函数),文件/home/aditya-vision/opencv-2.4.9/modules/gpu/include/opencv2/gpu /device/detail/transform_detail.hpp 即可。请提供宝贵的建议,帮助我解决这个问题。
我也发布了nvprof输出,通过它我可以看到代码中的bootlenecks。请基于此提出您的建议。
==10935== NVPROF is profiling process 10935, command: ./farneback_flow resizeVideo.avi
It took 14 second(s).
==10935== Profiling application: ./farneback_flow resizeVideo.avi
==10935== Profiling result:
Time(%) Time Calls Avg Min Max Name
62.38% 543.05ms 9870 55.020us 23.265us 263.63us cv::gpu::device::optflow_farneback::boxFilter5(int, int, cv::gpu::PtrStep<float>, int, float, cv::gpu::PtrStep<float>)
17.11% 148.94ms 9870 15.090us 5.9840us 58.978us cv::gpu::device::optflow_farneback::updateMatrices(int, int, cv::gpu::PtrStep<float>, cv::gpu::PtrStep<float>,cv::gpu::PtrStep<float>, cv::gpu::PtrStep<float>, cv::gpu::PtrStep<float>)
5.12% 44.550ms 1974 22.568us 15.328us 41.090us void cv::gpu::device::optflow_farneback::gaussianBlur<cv::gpu::device::BrdReflect101<float>>(int, int, cv::gpu::PtrStep<float>, int, float, cv::gpu::PtrStep)
5.07% 44.137ms 9870 4.4710us 2.7200us 16.225us cv::gpu::device::optflow_farneback::updateFlow(int, int, cv::gpu::PtrStep<float>, cv::gpu::PtrStep<float>, cv::gpu::PtrStep<float>)
3.72% 32.371ms 658 49.196us 48.898us 53.090us [CUDA memcpy DtoH]
3.05% 26.517ms 1974 13.433us 5.9200us 31.265us void cv::gpu::device::optflow_farneback::polynomialExpansion<int=5>(int, int, cv::gpu::PtrStep<float>, cv::gpu::PtrStep)
1.58% 13.744ms 4277 3.2130us 928ns 17.601us [CUDA memcpy HtoD]
0.96% 8.3349ms 2632 3.1660us 2.0800us 5.7610us void cv::gpu::device::resize_linear<float>(cv::gpu::PtrStepSz<float>, float, float, float)
0.38% 3.3082ms 658 5.0270us 4.4480us 6.4960us [CUDA memcpy DtoD]
0.36% 3.1376ms 1316 2.3840us 1.4400us 3.5840us void cv::gpu::device::transform_detail::transformSmart<float, float, cv::gpu::device::Convertor<float, float, float>, cv::gpu::device::WithOutMask>(cv::gpu::PtrStepSz<float>, cv::gpu::PtrStep<float>, float, float)
0.20% 1.7484ms 658 2.6570us 1.8880us 5.4720us void cv::gpu::device::transform_detail::transformSmart<unsigned char, float, cv::gpu::device::Convertor<unsigned char, float, float>, cv::gpu::device::WithOutMask>(cv::gpu::PtrStepSz<unsigned char>, cv::gpu::PtrStep<float>, float, unsigned char)
0.08% 707.22us 658 1.0740us 864ns 1.8240us [CUDA memset]
==10935== API calls:
Time(%) Time Calls Avg Min Max Name
92.17% 12.7227s 1342 9.4804ms 4.1420us 12.6725s cudaMallocPitch
3.96% 547.18ms 1342 407.74us 3.5120us 1.7912ms cudaFree
1.47% 202.25ms 38164 5.2990us 4.1290us 266.19us cudaLaunch
0.87% 120.66ms 41783 2.8870us 579ns 268.98us cudaStreamSynchronize
0.55% 75.686ms 1974 38.341us 6.9130us 113.53us cudaMemcpy2D
0.28% 38.270ms 215824 177ns 144ns 264.85us cudaSetupArgument
0.27% 36.972ms 1316 28.093us 9.7560us 44.813us cudaDeviceSynchronize
0.20% 27.309ms 3619 7.5460us 5.6390us 35.784us cudaMemcpyToSymbol
0.06% 8.2456ms 38164 216ns 159ns 264.80us cudaGetLastError
0.06% 7.6159ms 38164 199ns 158ns 229.11us cudaConfigureCall
0.05% 6.2155ms 658 9.4460us 5.0870us 36.748us cudaMemset2DAsync
0.04% 4.8575ms 9871 492ns 353ns 260.80us cudaGetDevice
0.01% 2.0688ms 1645 1.2570us 894ns 7.9890us cudaStreamDestroy
0.01% 1.6704ms 1645 1.0150us 493ns 133.13us cudaStreamCreate
0.01% 802.14us 83 9.6640us 590ns 349.30us cuDeviceGetAttribute
0.00% 645.00us 1 645.00us 645.00us 645.00us cudaGetDeviceProperties
0.00% 558.01us 3948 141ns 106ns 303ns cudaSetDoubleForDevice
0.00% 90.543us 1 90.543us 90.543us 90.543us cuDeviceTotalMem
0.00% 63.614us 1 63.614us 63.614us 63.614us cuDeviceGetName
0.00% 3.3740us 2 1.6870us 1.0400us 2.3340us cuDeviceGetCount
0.00% 1.6930us 2 846ns 773ns 920ns cuDeviceGet