如何在CUDA中单独获取复杂矩阵的实部和虚部?

时间:2013-12-27 19:18:08

标签: c++ matrix cuda fft

我正试图获得2D数组的fft。输入是NxM实矩阵,因此输出矩阵也是NxM矩阵(2xNxM输出矩阵,其复数使用属性Hermitian symmetry保存在NxM矩阵中)。

所以我想知道是否有方法在cuda中提取以分别提取实数和复数矩阵?在opencv中分割功能是有责任的。所以我在寻找cuda中的类似功能,但我还没找到它。

以下是我的完整代码

#define NRANK 2
#define BATCH 10

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <cufft.h>
#include <stdio.h> 

#include <iostream>
#include <vector>

using namespace std;

int main()
    { 

    const size_t NX = 4;
    const size_t NY = 5;

    // Input array - host side
     float b[NX][NY] ={ 
        {0.7943 ,   0.6020 ,   0.7482  ,  0.9133  ,  0.9961},
        {0.3112 ,   0.2630 ,   0.4505  ,  0.1524  ,  0.0782},
        {0.5285 ,   0.6541 ,   0.0838  ,  0.8258  ,  0.4427},
        {0.1656 ,   0.6892 ,   0.2290  ,  0.5383  ,  0.1067}
    };


    // Output array - host side
    float c[NX][NY] = { 0 };

    cufftHandle plan;
    cufftComplex *data; // Holds both the input and the output - device side
    int n[NRANK] = {NX, NY};

    // Allocated memory and copy from host to device
    cudaMalloc((void**)&data, sizeof(cufftComplex)*NX*(NY/2+1));
    for(int i=0; i<NX; ++i){
        // Uses this because my actual array is a dynamically allocated. 
        // but here I've replaced it with a static 2D array to make it simple.
        cudaMemcpy(reinterpret_cast<float*>(data) + i*NY, b[i], sizeof(float)*NY, cudaMemcpyHostToDevice);
     }

    // Performe the fft
    cufftPlanMany(&plan, NRANK, n,NULL, 1, 0,NULL, 1, 0,CUFFT_R2C,BATCH);
    cufftSetCompatibilityMode(plan, CUFFT_COMPATIBILITY_NATIVE);
    cufftExecR2C(plan, (cufftReal*)data, data);
    cudaThreadSynchronize();
    cudaMemcpy(c, data, sizeof(float)*NX*NY, cudaMemcpyDeviceToHost);


    // Here c is a NxM matrix. I want to split it to 2 seperate NxM matrices with each   
    // having the complex and real component of the output

    // Here c is in 
    cufftDestroy(plan);
    cudaFree(data);

    return 0;
    }

修改

根据JackOLanter的建议,我修改了如下代码。但问题仍未解决。

 float  real_vec[NX][NY] = {0};       // host vector, real part
 float  imag_vec[NX][NY] = {0};       // host vector, imaginary part
cudaError  cudaStat1 = cudaMemcpy2D (real_vec, sizeof(real_vec[0]), data,  sizeof(data[0]),NY*sizeof(float2), NX, cudaMemcpyDeviceToHost);
cudaError  cudaStat2 = cudaMemcpy2D (imag_vec, sizeof(imag_vec[0]),data + 1,  sizeof(data[0]),NY*sizeof(float2), NX, cudaMemcpyDeviceToHost);

我得到的错误是'无效音调参数错误'。但我无法理解为什么。对于目的地,我使用间距大小为'float',而对于源我使用'float2'的大小

1 个答案:

答案 0 :(得分:2)

你的问题和你的代码对我来说没有多大意义。

  1. 您正在执行批量FFT,但似乎您没有预见到输入和输出数据都没有足够的内存空间;
  2. cufftExecR2C的输出是NX*(NY/2+1) float2矩阵,可以将其解释为NX*(NY+2) float矩阵。因此,您没有为c(仅NX*NY float)为最后cudaMemcpy分配足够的空间。对于输出的连续组件,您仍然需要一个复杂的内存位置;
  3. 您的问题似乎与cufftExecR2C命令无关,但更为一般:如何将复杂的NX*NY矩阵拆分为2 NX*NY个分别包含实部和虚部的矩阵。
  4. 如果我正确地解释了您的问题,那么@njuffa在

    提出的解决方案

    Copying data to “cufftComplex” data struct?

    可能是你的好线索。

    修改

    下面是一个小例子,说明当从主机到主机/从设备复制它们时,如何“组装”和“分解”复杂矢量的实部和虚部。 请添加您自己的CUDA错误检查

    #include <stdio.h>
    
    #define N 16
    
    int main() { 
    
        // Declaring, allocating and initializing a complex host vector
        float2* b = (float2*)malloc(N*sizeof(float2));
        printf("ORIGINAL DATA\n");
        for (int i=0; i<N; i++) {
            b[i].x = (float)i;
            b[i].y = 2.f*(float)i;
            printf("%f %f\n",b[i].x,b[i].y);
        }
        printf("\n\n");
    
        // Declaring and allocating a complex device vector
        float2 *data; cudaMalloc((void**)&data, sizeof(float2)*N);
    
        // Copying the complex host vector to device
        cudaMemcpy(data, b, N*sizeof(float2), cudaMemcpyHostToDevice);
    
        // Declaring and allocating space on the host for the real and imaginary parts of the complex vector
        float* cr = (float*)malloc(N*sizeof(float));       
        float* ci = (float*)malloc(N*sizeof(float));       
    
        /*******************************************************************/
        /* DISASSEMBLING THE COMPLEX DATA WHEN COPYING FROM DEVICE TO HOST */
        /*******************************************************************/
        float* tmp_d = (float*)data;
    
        cudaMemcpy2D(cr,        sizeof(float), tmp_d,    2*sizeof(float), sizeof(float), N, cudaMemcpyDeviceToHost);
        cudaMemcpy2D(ci,        sizeof(float), tmp_d+1,  2*sizeof(float), sizeof(float), N, cudaMemcpyDeviceToHost);
    
        printf("DISASSEMBLED REAL AND IMAGINARY PARTS\n");
        for (int i=0; i<N; i++)
            printf("cr[%i] = %f; ci[%i] = %f\n",i,cr[i],i,ci[i]);
        printf("\n\n");
    
        /******************************************************************************/
        /* REASSEMBLING THE REAL AND IMAGINARY PARTS WHEN COPYING FROM HOST TO DEVICE */
        /******************************************************************************/
        cudaMemcpy2D(tmp_d,     2*sizeof(float), cr, sizeof(float), sizeof(float), N, cudaMemcpyHostToDevice);
        cudaMemcpy2D(tmp_d + 1, 2*sizeof(float), ci, sizeof(float), sizeof(float), N, cudaMemcpyHostToDevice);
    
        // Copying the complex device vector to host
        cudaMemcpy(b, data, N*sizeof(float2), cudaMemcpyHostToDevice);
        printf("REASSEMBLED DATA\n");
        for (int i=0; i<N; i++) 
            printf("%f %f\n",b[i].x,b[i].y);
        printf("\n\n");
    
        getchar();
    
        return 0;
     }