pyculib fft使用gpu:加速

时间:2017-10-20 05:40:36

标签: python performance gpu numba cufft

我是初学者,正在尝试学习如何使用GPU来执行高速计算。我正在尝试使用GPU实现一个简单的FFT程序。 下面是我用于使用CPU内核计算FFT的程序。

from time import time as timer
import numpy as np
import matplotlib.pyplot as plt
winsize=512
shift=16
my_cmap='gray_r'
Fs = 8000
f = 1000
sample =200000
x = np.arange(sample)
y = np.sin(2 * np.pi * f * x / Fs)

data_len=len(y)
window_func=np.blackman(winsize)
fftdata=np.zeros((0,int(winsize/2)))

startime=timer()
for frame in range(0, data_len, shift):
#==============================================================================
#     if frame>0:
#         break
#==============================================================================
    kiri=y[frame:frame+winsize]
    if len(kiri) != winsize:
        break
    windata = window_func * kiri
    fftframe=np.fft.fft(windata,n=winsize)
    magframe=np.abs(fftframe)**2
    powerframe=np.log10(magframe):int(winsize/2)].reshape((1,int(winsize/2)))
    fftdata=np.append(fftdata,powerframe,axis=0)
endtime=timer()-startime
fftdata=np.asarray(fftdata)
fftfrq=np.fft.fftfreq(winsize,d=1/Fs)[:int(winsize/2)]
print("CPU runtime:",endtime,"sec")

现在,下图是使用imshow()函数绘制时的光谱图输出:

enter image description here

时序输出如下:

 CPU runtime: 65.02100014686584 sec

现在我重写上面的程序,使用我的PC的GPU,即使用Anaconda提供的pyculib和numba软件包的Quadro K2200。

from time import time as timer
import numpy as np
import pyculib.fft
from numba import cuda
import matplotlib.pyplot as plt
winsize=512
shift=16
Fs = 8000
f = 1000
sample =200000
t = np.arange(sample,dtype=np.float64)
y = np.sin(2 * np.pi * f * t / Fs)
my_cmap='gray_r'
data_len=len(y)
window_func=np.blackman(winsize)
fftdata_gpu=np.zeros((0,int(winsize/2)))

startime=timer()
for frame in range(0, data_len, shift):
# =============================================================================
#     if frame>0:
#          break
# =============================================================================
    kiri=y[frame:frame+winsize]
    if len(kiri) != winsize:
        break
    windata = window_func * kiri
    fftframe_gpu = np.zeros(winsize, np.complex128)
    d_xf_gpu = cuda.to_device(fftframe_gpu)
    pyculib.fft.fft(windata.astype(np.complex128),d_xf_gpu)
    d_xf_gpu.copy_to_host(fftframe_gpu)
    magframe_gpu=np.abs(fftframe_gpu)**2
    powerframe_gpu=np.log10(magframe_gpu)[:int(winsize/2)].reshape((1,int(winsize/2)))
    fftdata_gpu=np.append(fftdata_gpu,powerframe_gpu,axis=0)
endtime=timer()-startime
fftdata_gpu=np.asarray(fftdata_gpu)
print("GPU runtime:",endtime,"sec")

运行上述程序时的时序输出显示GPU实现实际上花费了30秒。

GPU runtime: 92.87200021743774 sec

我猜这是因为我反复将数组复制到设备上,并为每一帧取回它。有没有更好的方法来实现这个?我真的很想知道我在这里做错了什么意见。

下面我粘贴GPU实现的输出。 enter image description here

编辑:添加分析结果

对于CPU代码:根据累计时间排序的前20个函数调用

   569255 function calls (563691 primitive calls) in 64.792 seconds

   Ordered by: cumulative time
   List reduced from 2594 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    296/1    0.009    0.000   64.793   64.793 {built-in method builtins.exec}
        1   11.489   11.489   64.793   64.793 cuda_fft_tr1_cpu.py:6(<module>)
    12469    0.037    0.000   52.145    0.004 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numpy\lib\function_base.py:5095(append)
    12469   52.093    0.004   52.093    0.004 {built-in method numpy.core.multiarray.concatenate}
    12469    0.073    0.000    0.622    0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numpy\fft\fftpack.py:102(fft)
    333/2    0.002    0.000    0.493    0.247 <frozen importlib._bootstrap>:966(_find_and_load)
    333/2    0.001    0.000    0.493    0.246 <frozen importlib._bootstrap>:936(_find_and_load_unlocked)
    323/3    0.002    0.000    0.491    0.164 <frozen importlib._bootstrap>:651(_load_unlocked)
    272/3    0.001    0.000    0.491    0.164 <frozen importlib._bootstrap_external>:672(exec_module)
    432/3    0.000    0.000    0.490    0.163 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
    12469    0.073    0.000    0.415    0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numpy\fft\fftpack.py:47(_raw_fft)
   356/24    0.000    0.000    0.336    0.014 {built-in method builtins.__import__}
        1    0.000    0.000    0.232    0.232 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\matplotlib\pyplot.py:17(<module>)
 1445/628    0.001    0.000    0.220    0.000 <frozen importlib._bootstrap>:997(_handle_fromlist)
        1    0.000    0.000    0.141    0.141 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numpy\__init__.py:106(<module>)
    12469    0.139    0.000    0.139    0.000 {built-in method numpy.fft.fftpack_lite.cfftf}
      328    0.003    0.000    0.122    0.000 <frozen importlib._bootstrap>:870(_find_spec)
      310    0.000    0.000    0.117    0.000 <frozen importlib._bootstrap_external>:1149(find_spec)
      310    0.002    0.000    0.117    0.000 <frozen importlib._bootstrap_external>:1117(_get_spec)
        1    0.000    0.000    0.115    0.115 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\matplotlib\__init__.py:101(<module>)

对于GPU代码:基于累积时间排序

5689881 function calls (5642977 primitive calls) in 94.179 seconds

   Ordered by: cumulative time
   List reduced from 4373 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    590/1    0.019    0.000   94.207   94.207 {built-in method builtins.exec}
        1   12.080   12.080   94.207   94.207 cuda_fft_tr1_gpu.py:6(<module>)
    12469    0.046    0.000   51.752    0.004 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numpy\lib\function_base.py:5095(append)
    12469   51.679    0.004   51.679    0.004 {built-in method numpy.core.multiarray.concatenate}
62345/49876    0.131    0.000   22.880    0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\cudadrv\devices.py:209(_require_cuda_context)
    12469    0.111    0.000   20.875    0.002 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\pyculib\fft\api.py:190(fft)
    49876   17.129    0.000   17.171    0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\pyculib\utils\libutils.py:40(wrapped)
    12469    0.176    0.000   15.354    0.001 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\pyculib\fft\api.py:38(__init__)
    12469    0.284    0.000   15.046    0.001 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\pyculib\fft\binding.py:207(many)
    37407    0.151    0.000    7.634    0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\cudadrv\devicearray.py:451(auto_device)
    24938    0.108    0.000    4.707    0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\cudadrv\devicearray.py:422(from_array_like)
    12469    0.043    0.000    4.644    0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\pyculib\fft\api.py:134(forward)
    24938    0.356    0.000    4.599    0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\cudadrv\devicearray.py:58(__init__)
    87290    4.156    0.000    4.398    0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\cudadrv\driver.py:284(safe_cuda_api_call)
    12469    0.035    0.000    4.292    0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\api.py:26(to_device)
    12469    0.064    0.000    3.468    0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\pyculib\fft\api.py:86(_prepare)
    24938    0.027    0.000    3.404    0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\api.py:275(_auto_device)
    24938    0.114    0.000    2.452    0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\cudadrv\devicearray.py:139(copy_to_device)
    24938    0.080    0.000    2.157    0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\cudadrv\driver.py:1573(host_to_device)
    24938    0.261    0.000    1.848    0.000 C:\Users\na\AppData\Local\Continuum\Anaconda3_2\lib\site-packages\numba\cuda\cudadrv\driver.py:667(memalloc)
pyculib的fft功能大约需要20秒,而numpy fft需要大约0.6秒。为什么pyculib的功能需要这么长时间?有没有办法改进代码以缩短这个时间?或者使用不同的库更好吗?

0 个答案:

没有答案