我正在玩CUDA,并尝试在GPU上计算现实的神经元模型。这是我在CUDA工作的第二天,可能我做了一些完全愚蠢的事情。
我的系统:
$ nvidia-smi
Wed Aug 1 18:03:53 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.45 Driver Version: 396.45 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro K600 Off | 00000000:01:00.0 On | N/A |
| 25% 50C P8 N/A / N/A | 597MiB / 974MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1235 G /usr/lib/xorg/Xorg 232MiB |
| 0 2496 G /usr/bin/krunner 1MiB |
| 0 2498 G /usr/bin/plasmashell 102MiB |
| 0 2914 G ...-token=1063E9B61C5D53298A4DC8A65D896440 215MiB |
| 0 4817 G /usr/bin/kwin_x11 41MiB |
+-----------------------------------------------------------------------------+
$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 396.45 Thu Jul 12 20:49:29 PDT 2018
GCC version: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10)
根据规范,我有一个SM,具有192个内核和每个块1024个线程。
现在我要实现的是并行运行(例如)64个神经元的模拟。每个神经元使用欧拉方法交互式地计算3个微分方程(此时一切都很简单)。这只是一个测试。对于性能测试,我想以0.01 ms的时间步长计算1分钟的模型时间。这里的代码:
#include <stdio.h>
#include <iostream>
#include <math.h>
#define I 7
#define gna 35.
#define gk 9.
#define gl 0.1
#define ena 55.
#define ek (-90.)
#define el (-65.)
#define dt 0.01
__global__
void run(float *v, float *h, float *n)
{
int i = threadIdx.x;
printf("DB>> i=%d v=%g\n",i,v[i]);
float minf, ninf, hinf, ntau, htau, a, b;
for(unsigned long t = 0; t<6000000l; ++t){
//for(unsigned long t = 0; t<1000000l; ++t){
a = 0.1*(v[i]+35.)/(1.0-exp(-(v[i]+35.)/10.)) ;
b = 4.0*exp(-(v[i]+60.)/18.);
minf = a/(a+b);
a = 0.01*(v[i]+34.)/(1.0-exp(-(v[i]+34.)/10.));
b = 0.125*exp(-(v[i]+44.)/80.);
ninf = a/(a+b);
ntau = 1./(a+b);
a = 0.07*exp(-(v[i]+58.)/20.);
b = 1.0/(1.0+exp(-(v[i]+28.)/10.));
hinf = a/(a+b);
htau = 1./(a+b);
n[i] += dt*(ninf - n[i])/ntau;
h[i] += dt*(hinf - h[i])/htau;
v[i] += dt*(-gna*minf*minf*minf*h[i]*(v[i]-ena)-gk*n[i]*n[i]*n[i]*n[i]*(v[i]-ek)-gl*(v[i]-el)+I);
//printf("%g %g\n",dt*t,v);
}
printf("DB>> i=%d v=%g\n",i,v[i]);
}
int main(void)
{
int N = 64;
float *v, *h, *n;
// Allocate Unified Memory – accessible from CPU or GPU
cudaMallocManaged(&v, N*sizeof(float));
cudaMallocManaged(&h, N*sizeof(float));
cudaMallocManaged(&n, N*sizeof(float));
fprintf(stderr,"STEP 1\n");
// initialize arrays on the host
for (int i = 0; i < N; i++) {
v[i] = -63.f;
h[i] = n[i] = 0.f;
}
fprintf(stderr,"STEP 2\n");
run<<<1, N>>>(v, h, n);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
fprintf(stderr,"STEP 3\n");
// Free memory
cudaFree(v);
cudaFree(h);
cudaFree(n);
return 0;
}
此代码(似乎)崩溃了,printf
函数中没有第二个run
。但是,如果我将步骤数减少到1000000l
(请参见run
函数中的注释行),它将起作用,并在printf
中打印run
的前后,然后显示或多或少好的结果。
那是为什么?