Question

我想向你提出两个问题。

I）

我有一个.cpp文件，其中是main()，为了调用内核（在。cu文件中），我使用extern函数来调用内核的.cu文件launch()。这两个文件.cu和.cpp分别正在编译。为了将它们绑定在一起，因为我是CUDA的初学者，我尝试了两件事：

1）nvcc -Wno-deprecated-gpu-targets -o final file1.cpp file2.cu，它没有错误并成功编译最终程序

2）

nvcc -Wno-deprecated-gpu-targets -c file2.cu
   g++ -c file1.cpp
   g++ -o program file1.o file2.o -lcudart -lcurand -lcutil -lcudpp -lcuda

在第二种情况下，由于-l参数未被识别（仅-lcuda是），我猜是因为我没有指定路径，因为我不知道这些文件存储在何处。如果我跳过这些-l参数，则错误为：

$ g++ -o final backpropalgorithm_CUDA_kernel_copy.o backpropalgorithm_CUDA_main_copy.o -lcuda
backpropalgorithm_CUDA_kernel_copy.o: In function `launch':
tmpxft_0000717b_00000000-4_backpropalgorithm_CUDA_kernel_copy.cudafe1.cpp:(.text+0x185): undefined reference to `cudaConfigureCall'
backpropalgorithm_CUDA_kernel_copy.o: In function `__cudaUnregisterBinaryUtil()':
tmpxft_0000717b_00000000-4_backpropalgorithm_CUDA_kernel_copy.cudafe1.cpp:(.text+0x259): undefined reference to `__cudaUnregisterFatBinary'
backpropalgorithm_CUDA_kernel_copy.o: In function `__nv_init_managed_rt_with_module(void**)':
tmpxft_0000717b_00000000-4_backpropalgorithm_CUDA_kernel_copy.cudafe1.cpp:(.text+0x274): undefined reference to `__cudaInitModule'
backpropalgorithm_CUDA_kernel_copy.o: In function `__device_stub__Z21neural_network_kernelPfPiS0_PdS1_S1_S1_S1_S1_S1_S1_S1_S1_S1_S1_S1_S1_S1_S1_S1_(float*, int*, int*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*)':
tmpxft_0000717b_00000000-4_backpropalgorithm_CUDA_kernel_copy.cudafe1.cpp:(.text+0x2ac): undefined reference to `cudaSetupArgument'
tmpxft_0000717b_00000000-4_backpropalgorithm_CUDA_kernel_copy.cudafe1.cpp:(.text+0x2cf): undefined reference to `cudaSetupArgument'
tmpxft_0000717b_00000000-4_backpropalgorithm_CUDA_kernel_copy.cudafe1.cpp:(.text+0x2f2): undefined reference to `cudaSetupArgument'
tmpxft_0000717b_00000000-4_backpropalgorithm_CUDA_kernel_copy.cudafe1.cpp:(.text+0x315): undefined reference to `cudaSetupArgument'
tmpxft_0000717b_00000000-4_backpropalgorithm_CUDA_kernel_copy.cudafe1.cpp:(.text+0x338): undefined reference to `cudaSetupArgument'
backpropalgorithm_CUDA_kernel_copy.o:tmpxft_0000717b_00000000-4_backpropalgorithm_CUDA_kernel_copy.cudafe1.cpp:(.text+0x35b): more undefined references to `cudaSetupArgument' follow
backpropalgorithm_CUDA_kernel_copy.o: In function `__nv_cudaEntityRegisterCallback(void**)':
tmpxft_0000717b_00000000-4_backpropalgorithm_CUDA_kernel_copy.cudafe1.cpp:(.text+0x663): undefined reference to `__cudaRegisterFunction'
backpropalgorithm_CUDA_kernel_copy.o: In function `__sti____cudaRegisterAll_69_tmpxft_0000717b_00000000_7_backpropalgorithm_CUDA_kernel_copy_cpp1_ii_43082cd7()':
tmpxft_0000717b_00000000-4_backpropalgorithm_CUDA_kernel_copy.cudafe1.cpp:(.text+0x67c): undefined reference to `__cudaRegisterFatBinary'
backpropalgorithm_CUDA_kernel_copy.o: In function `cudaError cudaLaunch<char>(char*)':
tmpxft_0000717b_00000000-4_backpropalgorithm_CUDA_kernel_copy.cudafe1.cpp:(.text+0x6c0): undefined reference to `cudaLaunch'
backpropalgorithm_CUDA_main_copy.o: In function `main':
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x92): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0xf8): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x118): undefined reference to `cudaMemcpy'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x12c): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x14c): undefined reference to `cudaMemcpy'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x160): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x180): undefined reference to `cudaMemcpy'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x194): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x1b4): undefined reference to `cudaMemcpy'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x1c8): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x1e8): undefined reference to `cudaMemcpy'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x1ff): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x21f): undefined reference to `cudaMemcpy'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x236): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x256): undefined reference to `cudaMemcpy'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x26a): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x28a): undefined reference to `cudaMemcpy'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x2a1): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x2c1): undefined reference to `cudaMemcpy'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x2d5): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x2f5): undefined reference to `cudaMemcpy'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x309): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x329): undefined reference to `cudaMemcpy'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x33d): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x35d): undefined reference to `cudaMemcpy'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x371): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x391): undefined reference to `cudaMemcpy'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x3a5): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x3c5): undefined reference to `cudaMemcpy'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x3dc): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x3fc): undefined reference to `cudaMemcpy'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x413): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x433): undefined reference to `cudaMemcpy'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x44a): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x46a): undefined reference to `cudaMemcpy'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x481): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x4a1): undefined reference to `cudaMemcpy'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x5bf): undefined reference to `cudaDeviceSynchronize'
collect2: error: ld returned 1 exit status

问题是，在第一种情况下使用“成功”编译和链接，当我运行程序时，它只显示一个闪烁的光标（在输入命令的下一行），而在控制台中没有其他任何内容;通常它应该使用CUDA计算并显示正在构建的神经网络的错误。

II）我正在尝试在printf()文件中.cu，但它没有显示任何内容。我搜索了一下，发现可能我应该使用cuPrintf()函数。我尝试过，但是我遇到了标题问题，包括未定义的包含文件，即使我手动包含它们。我发现我应该包含一个cuPrintf.cu文件，这是我在网上找到的源代码。

不幸的是，当我单独编译它们时，.cu文件的错误是

ptxas fatal   : Unresolved extern function '_Z8cuPrintfIjEiPKcT_'

并且.cpp没有错误。

为什么会出现所有这些错误？错误的部分在哪里？为什么程序运行不正常，printf()似乎没有在内核中运行？为什么程序只显示一个闪烁的光标，仅此而已？如果有人可以告诉我这些问题，我将非常感激，非常感谢您提前！

这两个文件的代码是：

file1.cpp：

#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <string>
#include "/home/user/include_files/cuda-8.0/include/cuda.h"
#include "/home/user/include_files/cuda-8.0/include/cuda_runtime.h"
#include "/home/user/include_files/cuda-8.0/include/cuda_runtime_api.h"

#define datanum 4       // number of training samples
#define InputN 16       // number of neurons in the input layer
#define hn 64           // number of neurons in the hidden layer
#define OutN 1          // number of neurons in the output layer
#define threads_per_block 256




   using namespace std;

extern "C"
void launch(float *randData, int *times, int *loop, double *error, double *max, double *min, double *x_out, double *hn_out, double *y_out, double *y, double *w, double *v, double *deltaw, double *deltav, double *hn_delta, double *y_delta, double *alpha, double *beta, double *sumtemp, double *errtemp);

__global__ void neural_network_kernel (float *randData, int *times, int *loop, double *error, double *max, double *min, double *x_out, double *hn_out, double *y_out, double *y, double *w, double *v, double *deltaw, double *deltav, double *hn_delta, double *y_delta, double *alpha, double *beta, double *sumtemp, double *errtemp);

int main(int argc, char *argv[]){
    printf("welcome1\n");   
    int times = 100000;

    double sigmoid(double);
    //string result = "";
    char buffer[200];
printf("welcome2\n");
    double x_out[InputN];       // input layer
printf("welcome3\n");
    double hn_out[hn];          // hidden layer
printf("welcome4\n");
    double y_out[OutN];         // output layer
printf("welcome5\n");
    double y[OutN];             // expected output layer
printf("welcome6\n");
    double w[InputN][hn];       // weights from input layer to hidden layer
    double v[hn][OutN];         // weights from hidden layer to output layer

    double deltaw[InputN][hn];
    double deltav[hn][OutN];
printf("welcome7\n");
    double hn_delta[hn];        // delta of hidden layer
    double y_delta[OutN];       // delta of output layer
    //double errlimit = 0.001;
    double alpha = 0.1, beta = 0.1;
    int i, j, m;
    double sumtemp;
    double errtemp;


    /*cudaPrintfInit();
    cudaPrintfDisplay(stdout, true);
    cudaPrintfEnd();*/

    printf("Line : main\n");

    // Training


    /*struct{
        double input[InputN];
        double teach[OutN];
    }data[datanum];

    for(m=0; m<datanum; m++){
        for(i=0; i<InputN; i++)
            data[m].input[i] = (double)rand()/32767.0;
        for(i=0;i<OutN;i++)
            data[m].teach[i] = (double)rand()/32767.0;
    }

    // Initialization
    for(i=0; i<InputN; i++){
        for(j=0; j<hn; j++){
            w[i][j] = ((double)rand()/32767.0)*2-1;
            deltaw[i][j] = 0;
        }
    }
    for(i=0; i<hn; i++){
        for(j=0; j<OutN; j++){
            v[i][j] = ((double)rand()/32767.0)*2-1;
            deltav[i][j] = 0;
        }
    }*/


    //curandGenerator_t gen;
    srand (time(NULL));
    float randData[threads_per_block];
printf("welcome8\n");
    for (int i=0; i<threads_per_block; i++)
    {
        randData[i] = rand()%100;   //Else, without %100, it returns some billions for number!
    }
printf("welcome9\n");
    /*curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_DEFAULT);
    curandSetPseudoRandomGeneratorSeed(gen, 1234ULL);
    curandGenerateUniform(gen, randData, threads_per_block);*/
    int loop = 0;
    double error;
    double max, min;
    double *max_p_GPU, *min_p_GPU, *error_p_GPU;
    float *randData_p_GPU;
    int *times_p_GPU, *loop_p_GPU, *InputN_p_GPU, *hn_p_GPU, *OutN_p_GPU;
    double *x_out_p_GPU, *hn_out_p_GPU, *y_out_p_GPU, *y_p_GPU, *w_p_GPU, *v_p_GPU, *deltaw_p_GPU, *deltav_p_GPU, *hn_delta_p_GPU;
    double *y_delta_p_GPU, *alpha_p_GPU, *beta_p_GPU, *sumtemp_p_GPU, *errtemp_p_GPU;
    //int blocks = times/threads_per_block;

printf("welcome10\n");  
    cudaMalloc((void **)&randData_p_GPU, threads_per_block*sizeof(float));
printf("DEBUG1\n");
    cudaMemcpy(randData_p_GPU, randData, threads_per_block*sizeof(float), cudaMemcpyHostToDevice);
printf("welcome11\n");
    cudaMalloc((void **)&times_p_GPU, sizeof(int));
printf("welcome12\n");
    cudaMemcpy(times_p_GPU, &times, sizeof(int), cudaMemcpyHostToDevice);
printf("welcome13\n");
    cudaMalloc((void **)&loop_p_GPU, sizeof(int));
printf("welcome14\n");
    cudaMemcpy(loop_p_GPU, &loop, sizeof(int), cudaMemcpyHostToDevice);
printf("welcome15\n");
    cudaMalloc((void **)&error_p_GPU, sizeof(double));
printf("welcome16\n");
    cudaMemcpy(error_p_GPU, &error, sizeof(double), cudaMemcpyHostToDevice);
printf("welcome17\n");
    cudaMalloc((void **)&max_p_GPU, sizeof(double));
printf("welcome18\n");
    cudaMemcpy(max_p_GPU, &max, sizeof(double), cudaMemcpyHostToDevice);
printf("welcome19\n");
    cudaMalloc((void **)&min_p_GPU, sizeof(double));
printf("welcome20\n");
    cudaMemcpy(min_p_GPU, &min, sizeof(double), cudaMemcpyHostToDevice);
printf("welcome21\n");
    /* cudaMalloc((void **)&InputN_p_GPU, sizeof(int));
    cudaMemcpy(InputN_p_GPU, &InputN, sizeof(int), cudaMemcpyHostToDevice);
    cudaMalloc((void **)&hn_p_GPU, sizeof(int));
    cudaMemcpy(hn_p_GPU, &hn, sizeof(int), cudaMemcpyHostToDevice);
    cudaMalloc((void **)&OutN_p_GPU, sizeof(int));
    cudaMemcpy(OutN_p_GPU, &OutN, sizeof(int), cudaMemcpyHostToDevice); */

    /*cudaMalloc((void **)&x_out_p_GPU, sizeof(double)*(threads_per_block*InputN));
    cudaMemcpy(x_out_p_GPU, &x_out, sizeof(double)*InputN, cudaMemcpyHostToDevice);
    cudaMalloc((void **)&hn_out_p_GPU, sizeof(double)*(threads_per_block*hn));
    cudaMemcpy(hn_out_p_GPU, &hn_out, sizeof(double)*hn, cudaMemcpyHostToDevice);
    cudaMalloc((void **)&y_out_p_GPU, sizeof(double)*(threads_per_block*OutN));
    cudaMemcpy(y_out_p_GPU, &y_out, sizeof(double)*OutN, cudaMemcpyHostToDevice);
    cudaMalloc((void **)&hn_delta_p_GPU, sizeof(double)*(threads_per_block*hn));
    cudaMemcpy(hn_delta_p_GPU, &hn_delta, sizeof(double)*hn, cudaMemcpyHostToDevice);
    cudaMalloc((void **)&y_delta_p_GPU, sizeof(double)*(threads_per_block*OutN));
    cudaMemcpy(y_delta_p_GPU, &y_delta, sizeof(double)*OutN, cudaMemcpyHostToDevice);*/

    cudaMalloc((void **)&x_out_p_GPU, sizeof(double)*InputN);
printf("welcome22\n");
    cudaMemcpy(x_out_p_GPU, &x_out, sizeof(double)*InputN, cudaMemcpyHostToDevice);
printf("welcome23\n");
    cudaMalloc((void **)&hn_out_p_GPU, sizeof(double)*hn);
printf("welcome24\n");
    cudaMemcpy(hn_out_p_GPU, &hn_out, sizeof(double)*hn, cudaMemcpyHostToDevice);
printf("welcome25\n");
    cudaMalloc((void **)&y_out_p_GPU, sizeof(double)*OutN);
printf("welcome26\n");
    cudaMemcpy(y_out_p_GPU, &y_out, sizeof(double)*OutN, cudaMemcpyHostToDevice);
printf("welcome27\n");
    cudaMalloc((void **)&hn_delta_p_GPU, sizeof(double)*hn);
printf("welcome28\n");
    cudaMemcpy(hn_delta_p_GPU, &hn_delta, sizeof(double)*hn, cudaMemcpyHostToDevice);
printf("welcome29\n");
    cudaMalloc((void **)&y_delta_p_GPU, sizeof(double)*OutN);
printf("welcome30\n");
    cudaMemcpy(y_delta_p_GPU, &y_delta, sizeof(double)*OutN, cudaMemcpyHostToDevice);
printf("welcome31\n");

    cudaMalloc((void **)&alpha_p_GPU, sizeof(double));
    cudaMemcpy(alpha_p_GPU, &alpha, sizeof(double), cudaMemcpyHostToDevice);
    cudaMalloc((void **)&beta_p_GPU, sizeof(double));
    cudaMemcpy(beta_p_GPU, &beta, sizeof(double), cudaMemcpyHostToDevice);
    cudaMalloc((void **)&sumtemp_p_GPU, sizeof(double));
    cudaMemcpy(sumtemp_p_GPU, &sumtemp, sizeof(double), cudaMemcpyHostToDevice);
    cudaMalloc((void **)&errtemp_p_GPU, sizeof(double));
    cudaMemcpy(errtemp_p_GPU, &errtemp, sizeof(double), cudaMemcpyHostToDevice);
    cudaMalloc((void **)&w_p_GPU, sizeof(double)*InputN*hn);
    cudaMemcpy(w_p_GPU, &w, sizeof(double)*(InputN*hn), cudaMemcpyHostToDevice);
    cudaMalloc((void **)&v_p_GPU, sizeof(double)*hn*OutN);
    cudaMemcpy(v_p_GPU, &v, sizeof(double)*(hn*OutN), cudaMemcpyHostToDevice);
    cudaMalloc((void **)&deltaw_p_GPU, sizeof(double)*InputN*hn);
    cudaMemcpy(deltaw_p_GPU, &deltaw, sizeof(double)*(InputN*hn), cudaMemcpyHostToDevice);
    cudaMalloc((void **)&deltav_p_GPU, sizeof(double)*hn*OutN);
    cudaMemcpy(deltav_p_GPU, &deltav, sizeof(double)*(hn*OutN), cudaMemcpyHostToDevice);

printf("welcome40\n");

    launch(randData, times_p_GPU, loop_p_GPU, error_p_GPU, max_p_GPU, min_p_GPU, x_out_p_GPU, hn_out_p_GPU, y_out_p_GPU, y_p_GPU, w_p_GPU, v_p_GPU, deltaw_p_GPU, deltav_p_GPU, hn_delta_p_GPU, y_delta_p_GPU, alpha_p_GPU, beta_p_GPU, sumtemp_p_GPU, errtemp_p_GPU);

printf("welcome41\n");

    cudaDeviceSynchronize();
printf("welcome_after_kernel\n");
}

file.cu：

#define w(i,j) w[(i)*(InputN*hn) + (j)]
#define v(i,j) v[(i)*(hn*OutN) + (j)]
#define x_out(i,j) x_out[(i)*(InputN) + (j)]
#define y(i,j) y[(i)*(OutN) + (j)]
#define hn_out(i,j) hn_out[(i)*(hn) + (j)]
#define y_out(i,j) y_out[(i)*(OutN) + (j)]
#define y_delta(i,j) y_delta[(i)*(OutN) + (j)]
#define hn_delta(i,j) hn_delta[(i)*(hn) + (j)]
#define deltav(i,j) deltav[(i)*(hn*OutN) + (j)]
#define deltaw(i,j) deltaw[(i)*(InputN*hn) + (j)]

#define datanum 4       // number of training samples
#define InputN 16       // number of neurons in the input layer
#define hn 64           // number of neurons in the hidden layer
#define OutN 1          // number of neurons in the output layer
#define threads_per_block 256
#define MAX_RAND 100
#define MIN_RAND 10

#include <stdio.h>
#include <math.h>   //for truncf()


// sigmoid serves as avtivation function
__device__ double sigmoid(double x){
    return(1.0 / (1.0 + exp(-x)));
}


__device__ int rand_kernel(int index, float *randData){
    float myrandf = randData[index];
    myrandf *= (MAX_RAND - MIN_RAND + 0.999999);
    myrandf += MIN_RAND;
    int myrand = (int)truncf(myrandf);
    return myrand;
}


__global__ void neural_network_kernel (float *randData, int *times, int *loop, double *error, double *max, double *min, double *x_out, double *hn_out, double *y_out, double *y, double *w, double *v, double *deltaw, double *deltav, double *hn_delta, double *y_delta, double *alpha, double *beta, double *sumtemp, double *errtemp)
{
    //int i = blockIdx.x;
    //int idx = threadIdx.x;
    //int idy = threadIdx.y

    int index = blockIdx.x * blockDim.x + threadIdx.x;

    // training set
    struct{
        double input_kernel[InputN];
        double teach_kernel[OutN];
    }data_kernel[threads_per_block + datanum];

    if (index==0)
    {
        for(int m=0; m<datanum; m++){
            for(int i=0; i<InputN; i++)
                data_kernel[threads_per_block + m].input_kernel[i] = (double)rand_kernel(index, randData)/32767.0;
            for(int i=0;i<OutN;i++)
                data_kernel[threads_per_block + m].teach_kernel[i] = (double)rand_kernel(index, randData)/32767.0;
        }
    }


    // Initialization
    for(int i=0; i<InputN; i++){
        for(int j=0; j<hn; j++){
            w(i,j) = ((double)rand_kernel(index, randData)/32767.0)*2-1;
            deltaw(i,j) = 0;
        }
    }
    for(int i=0; i<hn; i++){
        for(int j=0; j<OutN; j++){
            v(i,j) = ((double)rand_kernel(index, randData)/32767.0)*2-1;
            deltav(i,j) = 0;
        }
    }


    while(loop[index] < *times){
        loop[index]++;
        error[index] = 0.0;

        for(int m=0; m<datanum ; m++){
            // Feedforward
            max[index] = 0.0;
            min[index] = 0.0;
            for(int i=0; i<InputN; i++){
                x_out(index,i) = data_kernel[threads_per_block + m].input_kernel[i];
                if(max[index] < x_out(index,i))
                    max[index] = x_out(index,i);
                if(min[index] > x_out(index,i))
                    min[index] = x_out(index,i);
            }
            for(int i=0; i<InputN; i++){
                x_out(index,i) = (x_out(index,i) - min[index]) / (max[index] - min[index]);
            }

            for(int i=0; i<OutN ; i++){
                y(index,i) = data_kernel[threads_per_block + m].teach_kernel[i];
            }

            for(int i=0; i<hn; i++){
                sumtemp[index] = 0.0;
                for(int j=0; j<InputN; j++)
                    sumtemp[index] += w(j,i) * x_out(index,j);
                hn_out(index,i) = sigmoid(sumtemp[index]);      // sigmoid serves as the activation function
            }

            for(int i=0; i<OutN; i++){
                sumtemp[index] = 0.0;
                for(int j=0; j<hn; j++)
                    sumtemp[index] += v(j,i) * hn_out(index,j);
                y_out(index,i) = sigmoid(sumtemp[index]);
            }

            // Backpropagation
            for(int i=0; i<OutN; i++){
                errtemp[index] = y(index,i) - y_out(index,i);
                y_delta(index,i) = -errtemp[index] * sigmoid(y_out(index,i)) * (1.0 - sigmoid(y_out(index,i)));
                error[index] += errtemp[index] * errtemp[index];
            }

            for(int i=0; i<hn; i++){
                errtemp[index] = 0.0;
                for(int j=0; j<OutN; j++)
                    errtemp[index] += y_delta(index,j) * v(i,j);
                hn_delta(index,i) = errtemp[index] * (1.0 + hn_out(index,i)) * (1.0 - hn_out(index,i));
            }

            // Stochastic gradient descent
            for(int i=0; i<OutN; i++){
                for(int j=0; j<hn; j++){
                    deltav(j,i) = (*alpha) * deltav(j,i) + (*beta) * y_delta(index,i) * hn_out(index,j);
                    v(j,i) -= deltav(j,i);
                }
            }

            for(int i=0; i<hn; i++){
                for(int j=0; j<InputN; j++){
                    deltaw(j,i) = (*alpha) * deltaw(j,i) + (*beta) * hn_delta(index,i) * x_out(index,j);
                    w(j,i) -= deltaw(j,i);
                }
            }
        }

        // Global error
        error[index] = error[index] / 2;
        /*if(loop%1000==0){
            result = "Global Error = ";
            sprintf(buffer, "%f", error);
            result += buffer;
            result += "\r\n";
        }
        if(error < errlimit)
            break;*/

        printf("The %d th training, error: %0.100f\n", loop[index], error[index]);
    }
}


extern "C"
void launch(float *randData, int *times, int *loop, double *error, double *max, double *min, double *x_out, double *hn_out, double *y_out, double *y, double *w, double *v, double *deltaw, double *deltav, double *hn_delta, double *y_delta, double *alpha, double *beta, double *sumtemp, double *errtemp)
{
    int blocks = *times/threads_per_block;
    neural_network_kernel<<<blocks, threads_per_block>>>(randData, times, loop, error, max, min, x_out, hn_out, y_out, y, w, v, deltaw, deltav, hn_delta, y_delta, alpha, beta, sumtemp, errtemp);
}

更新：

我发现有关使用指针进行内存分配的一些错误。我更新了上面的代码......现在主要的问题是：

1）链接/编译是否正确，这是我应该怎么做的？我的意思是第一种方式。

2）我发现在第一个cudaMalloc()期间会立即显示闪烁的光标。在此之前它正确运行。

但首先cudaMalloc()它会永远挂起，为什么？

Answer 1

在此之前寻求帮助之前，最好使用正确的cuda错误检查并使用cuda-memcheck运行代码。如果你不这样做，你可能会忽略有用的错误信息并浪费你的时间以及其他人试图帮助你。

在第二种情况下，由于无法识别-l参数（仅-lcuda），我猜是因为我没有指定路径，因为我不知道这些文件的存储位置。 / p>

你不想跳过这些。 nvcc将自动链接到这些库中的某些库，并自动知道在哪里找到它们。使用g ++时，您必须告诉它在哪里以及您需要的特定库。对于您展示的代码，您不需要所有链接的库，因此以下内容应该足够了：

   g++ -o program file1.o file2.o -L/usr/local/cuda/lib64 -lcudart

用于CUDA的标准Linux安装。如果您没有标准安装，则可以which nvcc查找nvcc的位置，然后使用它来查找库所在的可能位置（更改{在bin）

的路径中{1}}

如果您确实需要其他一些库，lib64和cutil之类的内容将无法使用，除非您按照特殊步骤进行安装，并且您需要确定在这种情况下他们的路径。

关于cudpp，如果您正在编译并运行在cc2.0或更新的GPU上（无论如何都是CUDA 8支持的最低计算能力），您不应该需要它。普通cuPrintf应该在设备代码中工作，如果不是（因为您有设备代码错误 - 使用正确的错误检查和printf），那么cuda-memcheck不会工作任何更好。因此，只需将代码还原为使用cuPrintf代替（并包含printf），而不是努力使其正常工作。

关于你的程序及其无效的原因，我想你可能有很多错误。您可能想学习如何使用调试器。在主机代码中，您尝试从主机代码初始化stdio.h是非法的。

现在我发现你已经多次改变了这个问题，把它变成一个移动的目标，我会停下来。

如果您需要帮助，请停止移动目标。

使用正确的cuda错误检查。

CUDA和C ++链接/编译，cudaMalloc

1 个答案: