Question

最近，我使用Cuda编写了一种称为“正交匹配追踪”的算法。在我丑陋的Cuda代码中，整个迭代需要60秒，而Eigen lib只需要3秒......

在我的代码中，矩阵A是[640,1024]，y是[640,1]，在每个步骤中我从A中选择一些向量来组成一个名为A_temp的新矩阵[640，itera]，iter = 1：500 。我在cpu中新建了一个数组MaxDex_Host []来告诉选择哪一列。

我想使用最小二乘法得到A_temp * x_temp = y的x_temp [itera，1]，我使用cula API'culaDeviceSgels'和cublas矩阵向量乘法API。

所以culaDeviceSgels会调用500次，我认为这比Eigen lib的QR.Sovler更快。

我检查了Nisight的性能分析，我发现这个监控系统需要很长时间。我在迭代之前初始化了cublas，并在得到结果之后将它摧毁了。所以我想知道什么是custervdestory，与cublasdestory不同？

主要问题是memcpy和函数'gemm_kernel1x1val'。我认为这个函数来自'culaDeviceSgels'

while（itera＆lt; 500）：我使用cublasSgemv和cublasIsamax获取MaxDex_Host [itera]，然后

        MaxDex_Host[itera]=pos;
    itera++; 
    float* A_temp_cpu=new float[M*itera]; // matrix all in col-major
    for (int j=0;j<itera;j++) // to  get A_temp [M,itera] , the MaxDex_Host[] shows the positon of which column of A to chose , 
    {
        for (int i=0;i<M;i++) //M=640 , and A is 640*1024 ,itera is add 1 each step
        {
            A_temp_cpu[j*M+i]=A[MaxDex_Host[j]*M+i];
        }
    }
          // I must allocate one more array because culaDeviceSgels will decompose the one input Array ,  and I want to use A_temp after least-square solving.
    float* A_temp_gpu;
    float* A_temp2_gpu;  
    cudaMalloc((void**)&A_temp_gpu,Size_float*M*itera);
    cudaMalloc((void**)&A_temp2_gpu,Size_float*M*itera);
    cudaMemcpy(A_temp_gpu,A_temp_cpu,Size_float*M*itera,cudaMemcpyHostToDevice);
    cudaMemcpy(A_temp2_gpu,A_temp_gpu,Size_float*M*itera,cudaMemcpyDeviceToDevice);
    culaDeviceSgels('N',M,itera,1,A_temp_gpu,M,y_Gpu_temp,M);// the x_temp I want is in y_Gpu_temp's return value ,  stored in the y_Gpu_temp[0]——y_Gpu_temp[itera-1]
     float* x_temp;
    cudaMalloc((void**)&x_temp,Size_float*itera);
    cudaMemcpy(x_temp,y_Gpu_temp,Size_float*itera,cudaMemcpyDeviceToDevice);

Cuda的内存管理似乎太复杂了，还有其他方便的方法来解决最小二乘法吗？

Answer 1

我认为custreamdestory和gemm_kernel1x1val在内部由您使用的API调用，因此与它们没有太大关系。

为了改进您的代码，我建议您执行以下操作。

您可以通过保留矩阵A_temp_cpu的设备副本来摆脱A。然后，您可以通过内核分配将A行复制到A_temp_gpu和A_temp2_gpu的行中。这将避免执行前两个cudaMemcpy。
您可以使用A_temp_gpu的最大可能值而不是A_temp2_gpu在while循环之外预先分配itera和itera。这将避免循环中的前两个cudaMalloc。这同样适用于x_temp。
只要我知道，culaDeviceSgels解决了线性方程组。我认为你也可以通过仅使用cuBLAS API来做同样的事情。例如，您可以先cublasDgetrfBatched()执行LU分解，然后使用cublasStrsv()两次来解决两个出现的线性系统。您可能希望了解此解决方案是否会导致更快的算法。

Cuda：最小二乘解决，速度差

1 个答案: