我是初学者,正在尝试学习如何使用GPU来执行高速计算。我正在尝试使用GPU实现一个简单的FFT程序。 下面是我用于使用CPU内核计算FFT的程序。
from time import time as timer
import numpy as np
import matplotlib.pyplot as plt
winsize=512
shift=16
my_cmap='gray_r'
Fs = 8000
f = 1000
sample =200000
x = np.arange(sample)
y = np.sin(2 * np.pi * f * x / Fs)
data_len=len(y)
window_func=np.blackman(winsize)
fftdata=np.zeros((0,int(winsize/2)))
startime=timer()
for frame in range(0, data_len, shift):
#==============================================================================
# if frame>0:
# break
#==============================================================================
kiri=y[frame:frame+winsize]
if len(kiri) != winsize:
break
windata = window_func * kiri
fftframe=np.fft.fft(windata,n=winsize)
magframe=np.abs(fftframe)**2
powerframe=np.log10(magframe):int(winsize/2)].reshape((1,int(winsize/2)))
fftdata=np.append(fftdata,powerframe,axis=0)
endtime=timer()-startime
fftdata=np.asarray(fftdata)
fftfrq=np.fft.fftfreq(winsize,d=1/Fs)[:int(winsize/2)]
print("CPU runtime:",endtime,"sec")
现在,下图是使用imshow()
函数绘制时的光谱图输出:
时序输出如下:
CPU runtime: 65.02100014686584 sec
现在我重写上面的程序,使用我的PC的GPU,即使用Anaconda提供的pyculib和numba软件包的Quadro K2200。
from time import time as timer
import numpy as np
import pyculib.fft
from numba import cuda
import matplotlib.pyplot as plt
winsize=512
shift=16
Fs = 8000
f = 1000
sample =200000
t = np.arange(sample,dtype=np.float64)
y = np.sin(2 * np.pi * f * t / Fs)
my_cmap='gray_r'
data_len=len(y)
window_func=np.blackman(winsize)
fftdata_gpu=np.zeros((0,int(winsize/2)))
startime=timer()
for frame in range(0, data_len, shift):
# =============================================================================
# if frame>0:
# break
# =============================================================================
kiri=y[frame:frame+winsize]
if len(kiri) != winsize:
break
windata = window_func * kiri
fftframe_gpu = np.zeros(winsize, np.complex128)
d_xf_gpu = cuda.to_device(fftframe_gpu)
pyculib.fft.fft(windata.astype(np.complex128),d_xf_gpu)
d_xf_gpu.copy_to_host(fftframe_gpu)
magframe_gpu=np.abs(fftframe_gpu)**2
powerframe_gpu=np.log10(magframe_gpu)[:int(winsize/2)].reshape((1,int(winsize/2)))
fftdata_gpu=np.append(fftdata_gpu,powerframe_gpu,axis=0)
endtime=timer()-startime
fftdata_gpu=np.asarray(fftdata_gpu)
print("GPU runtime:",endtime,"sec")
运行上述程序时的时序输出显示GPU实现实际上花费了30秒。
GPU runtime: 92.87200021743774 sec
我猜这是因为我反复将数组复制到设备上,并为每一帧取回它。有没有更好的方法来实现这个?我真的很想知道我在这里做错了什么意见。
编辑:添加分析结果
对于CPU代码:根据累计时间排序的前20个函数调用
569255 function calls (563691 primitive calls) in 64.792 seconds
Ordered by: cumulative time
List reduced from 2594 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
296/1 0.009 0.000 64.793 64.793 {built-in method builtins.exec}
1 11.489 11.489 64.793 64.793 cuda_fft_tr1_cpu.py:6(<module>)
12469 0.037 0.000 52.145 0.004 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numpy\lib\function_base.py:5095(append)
12469 52.093 0.004 52.093 0.004 {built-in method numpy.core.multiarray.concatenate}
12469 0.073 0.000 0.622 0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numpy\fft\fftpack.py:102(fft)
333/2 0.002 0.000 0.493 0.247 <frozen importlib._bootstrap>:966(_find_and_load)
333/2 0.001 0.000 0.493 0.246 <frozen importlib._bootstrap>:936(_find_and_load_unlocked)
323/3 0.002 0.000 0.491 0.164 <frozen importlib._bootstrap>:651(_load_unlocked)
272/3 0.001 0.000 0.491 0.164 <frozen importlib._bootstrap_external>:672(exec_module)
432/3 0.000 0.000 0.490 0.163 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
12469 0.073 0.000 0.415 0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numpy\fft\fftpack.py:47(_raw_fft)
356/24 0.000 0.000 0.336 0.014 {built-in method builtins.__import__}
1 0.000 0.000 0.232 0.232 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\matplotlib\pyplot.py:17(<module>)
1445/628 0.001 0.000 0.220 0.000 <frozen importlib._bootstrap>:997(_handle_fromlist)
1 0.000 0.000 0.141 0.141 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numpy\__init__.py:106(<module>)
12469 0.139 0.000 0.139 0.000 {built-in method numpy.fft.fftpack_lite.cfftf}
328 0.003 0.000 0.122 0.000 <frozen importlib._bootstrap>:870(_find_spec)
310 0.000 0.000 0.117 0.000 <frozen importlib._bootstrap_external>:1149(find_spec)
310 0.002 0.000 0.117 0.000 <frozen importlib._bootstrap_external>:1117(_get_spec)
1 0.000 0.000 0.115 0.115 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\matplotlib\__init__.py:101(<module>)
对于GPU代码:基于累积时间排序
5689881 function calls (5642977 primitive calls) in 94.179 seconds
Ordered by: cumulative time
List reduced from 4373 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
590/1 0.019 0.000 94.207 94.207 {built-in method builtins.exec}
1 12.080 12.080 94.207 94.207 cuda_fft_tr1_gpu.py:6(<module>)
12469 0.046 0.000 51.752 0.004 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numpy\lib\function_base.py:5095(append)
12469 51.679 0.004 51.679 0.004 {built-in method numpy.core.multiarray.concatenate}
62345/49876 0.131 0.000 22.880 0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\cudadrv\devices.py:209(_require_cuda_context)
12469 0.111 0.000 20.875 0.002 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\pyculib\fft\api.py:190(fft)
49876 17.129 0.000 17.171 0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\pyculib\utils\libutils.py:40(wrapped)
12469 0.176 0.000 15.354 0.001 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\pyculib\fft\api.py:38(__init__)
12469 0.284 0.000 15.046 0.001 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\pyculib\fft\binding.py:207(many)
37407 0.151 0.000 7.634 0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\cudadrv\devicearray.py:451(auto_device)
24938 0.108 0.000 4.707 0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\cudadrv\devicearray.py:422(from_array_like)
12469 0.043 0.000 4.644 0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\pyculib\fft\api.py:134(forward)
24938 0.356 0.000 4.599 0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\cudadrv\devicearray.py:58(__init__)
87290 4.156 0.000 4.398 0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\cudadrv\driver.py:284(safe_cuda_api_call)
12469 0.035 0.000 4.292 0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\api.py:26(to_device)
12469 0.064 0.000 3.468 0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\pyculib\fft\api.py:86(_prepare)
24938 0.027 0.000 3.404 0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\api.py:275(_auto_device)
24938 0.114 0.000 2.452 0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\cudadrv\devicearray.py:139(copy_to_device)
24938 0.080 0.000 2.157 0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\cudadrv\driver.py:1573(host_to_device)
24938 0.261 0.000 1.848 0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\cudadrv\driver.py:667(memalloc)
pyculib的fft功能大约需要20秒,而numpy fft需要大约0.6秒。为什么pyculib的功能需要这么长时间?有没有办法改进代码以缩短这个时间?或者使用不同的库更好吗?