Question

我有cuda / C ++代码返回C ++主机端数组。我想在MATLAB中操作这些数组，所以我用mex格式重写了我的代码并用mex编译。

我通过将预先分配的数组从MATLAB传递到mex脚本来实现它，但这会让事情变得疯狂。（54秒vs 14秒没有mex）

这是我的代码的简化，无输入1输出版本的缓慢解决方案：

#include "mex.h"
#include "gpu/mxGPUArray.h"
#include "matrix.h"
#include <stdio.h>
#include <stdlib.h>
#include "cuda.h"
#include "curand.h"
#include <cuda_runtime.h>
#include "math.h"
#include <curand_kernel.h>
#include <time.h>
#include <algorithm>
#include <iostream>

#define iterations 159744
#define transMatrixSize 2592 // Just for clarity. Do not change. No need to adjust this value for this simulation.
#define reps 1024 // Is equal to blocksize. Do not change without proper source code adjustments.
#define integralStep 13125  // Number of time steps to be averaged at the tail of the Force-Time curves to get Steady State Force

__global__ void kern(float *masterForces, ...)
{

int globalIdx = ((blockIdx.x + (blockIdx.y * gridDim.x)) * (blockDim.x * blockDim.y)) + (threadIdx.x + (threadIdx.y * blockDim.x));
...

  ...
   {
...
      {
          masterForces[i] = buffer[0]/24576.0;
      }

      }
   }
...
}



}


void mexFunction(int nlhs, mxArray *plhs[],
                 int nrhs, mxArray const *prhs[])
{
   ...

plhs[0] = mxCreateNumericMatrix(iterations,1,mxSINGLE_CLASS,mxREAL);
float *h_F0 = (float*) mxGetData(plhs[0]);


//Device input vectors
float *d_F0;

..
// Allocate memory for each vector on GPU
cudaMalloc((void**)&d_F0, iterations * sizeof(float));
...




//////////////////////////////////////////////LAUNCH ////////////////////////////////////////////////////////////////////////////////////

kern<<<1, 1024>>>( d_F0);



//////////////////////////////////////////////RETRIEVE DATA ////////////////////////////////////////////////////////////////////////////////////


cudaMemcpyAsync( h_F0 , d_F0 , iterations * sizeof(float), cudaMemcpyDeviceToHost);



///////////////////Free Memory///////////////////


cudaDeviceReset();
////////////////////////////////////////////////////

}

为什么这么慢？

编辑：Mex正在使用较旧的架构（SM_13）INSTEAD OF SM_35进行编译。现在时间有意义了。（16s with mex，14 s with c ++ / cuda）

Answer 1

如果您的CUDA代码的输出是普通旧数据（POD）主机端（与设备端）数组，则不需要使用mxGPUArray，例如Forces1使用float创建的new数组。您正在引用的MathWorks示例可能演示了如何使用MATLAB的gpuArray和内置CUDA功能，而不是如何在MEX函数中向常规CUDA函数传递数据。

如果您可以在CUDA函数之前（例如Forces1）初始化h_F0（或完整代码中的mexFunction 之外的，那么解决方案只是从new更改为mxCreate*个函数之一（即mxCreateNumericArray，mxCreateDoubleMatrix，mxCreateNumericMatrix等），然后传递数据指向CUDA函数的指针：

plhs[0] = mxCreateNumericMatrix(iterations,1,mxSINGLE_CLASS,mxREAL); float *h_F0 = (float*) mxGetData(plhs[0]); // myCudaWrapper(...,h_F0 ,...) /* i.e. cudaMemcpyAsync(h_F0,d_F0,...)

因此，对代码的唯一更改是：

<强>替换：

float *h_F0 = new float[(iterations)];

与

plhs[0] = mxCreateNumericMatrix(iterations,1,mxSINGLE_CLASS,mxREAL); float *h_F0 = (float*) mxGetData(plhs[0]);

删除：

delete h_F0;

注意：如果您的CUDA代码拥有输出主机端阵列，那么您必须将数据复制到mxArray。这是因为除非您使用mexFunction API分配mx输出，否则您分配的任何数据缓冲区（例如，使用mxSetData）都不会由MATLAB内存管理器处理，您将拥有段错误或充其量只是内存泄漏。

Mex Cuda动态分配/慢速mex代码

1 个答案: