使用Cuda在python中使用numba在GPU上创建数组

时间:2019-03-04 17:54:07

标签: python cuda gpu numba

我想评估网格中每个点的功能。问题是,如果我在CPU端创建网格,则将其传输到GPU的过程要比实际计算花费更长的时间。我可以在GPU端生成网格吗?

下面的代码显示了在CPU侧的网格的创建以及在GPU侧的大多数表达式的评估(我不确定如何让atan2在GPU上工作,因此我将其留在CPU上侧)。我应该提前道歉,并说我还在学习这些内容,因此,我确定下面的代码中仍有很大的改进空间!

谢谢!

import math
from numba import vectorize, float64
import numpy as np
from time import time

@vectorize([float64(float64,float64,float64,float64)],target='cuda')
def a_cuda(lat1, lon1, lat2, lon2):
    return  (math.sin(0.008726645 * (lat2 - lat1))**2) + \
             math.cos(0.01745329*(lat1)) * math.cos(0.01745329*(lat2)) * (math.sin(0.008726645 * (lon2 - lon1))**2)

def LLA_distance_numba_cuda(lat1, lon1, lat2, lon2):
    a = a_cuda(np.ascontiguousarray(lat1), np.ascontiguousarray(lon1), 
               np.ascontiguousarray(lat2), np.ascontiguousarray(lon2))
    return earthdiam_nm * np.arctan2(a,1-a)

# generate a mesh of one million evaluation points
nx, ny = 1000,1000
xv, yv = np.meshgrid(np.linspace(29, 31, nx), np.linspace(99, 101, ny))
X, Y = np.float64(xv.reshape(1,nx*ny).flatten()), np.float64(yv.reshape(1,nx*ny).flatten())
X2,Y2 = np.float64(np.array([30]*nx*ny)),np.float64(np.array([101]*nx*ny))

start = time()
LLA_distance_numba_cuda(X,Y,X2,Y2)
print('{:d} total evaluations in {:.3f} seconds'.format(nx*ny,time()-start))

1 个答案:

答案 0 :(得分:2)

让我们建立性能基准。为earthdiam_nm添加定义(1.0),并在nvprof下运行代码,我们有:

$ nvprof python t38.py
1000000 total evaluations in 0.581 seconds
(...)
==1973== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   55.58%  11.418ms         4  2.8544ms  2.6974ms  3.3044ms  [CUDA memcpy HtoD]
                   28.59%  5.8727ms         1  5.8727ms  5.8727ms  5.8727ms  cudapy::__main__::__vectorized_a_cuda$242(Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>)
                   15.83%  3.2521ms         1  3.2521ms  3.2521ms  3.2521ms  [CUDA memcpy DtoH]
(...)

因此在我的特定设置中,“内核”本身在我的(小型,慢速)QuadroK2000 GPU上运行约5.8ms,并且从主机到设备的4个副本和3.2的数据复制时间总计为11.4ms ms,结果传输回主机。重点是从主机到设备的4个副本。

我们先去摘那些低挂的水果。这行代码:

X2,Y2 = np.float64(np.array([30]*nx*ny)),np.float64(np.array([101]*nx*ny))

除了将值30和101传递给每个“工人”以外,实际上没有做任何事情。我在这里使用“工作者”来指代在跨大型数据集“广播” vectorize函数的numba流程中进行特定标量计算的想法。 numba向量化/广播过程不需要每个输入都是相同大小的数据集,仅要求提供的数据是“广播”的即可。因此,可以创建一个vectorize ufunc,该ufunc适用于数组和标量。这意味着每个工作人员将使用其数组元素和标量来执行其计算。

因此,低估的结果就是简单地删除这两个数组并将值(30,101)作为标量传递给ufunc a_cuda。当我们追求“低落的果实”时,让我们将您的arctan2计算(替换为math.atan2)和最终的缩放比例earthdiam_nm合并到矢量化代码中,因此我们没有在python / numpy的主机上执行此操作:

$ cat t39.py
import math
from numba import vectorize, float64
import numpy as np
from time import time
earthdiam_nm = 1.0
@vectorize([float64(float64,float64,float64,float64,float64)],target='cuda')
def a_cuda(lat1, lon1, lat2, lon2, s):
    a = (math.sin(0.008726645 * (lat2 - lat1))**2) + \
             math.cos(0.01745329*(lat1)) * math.cos(0.01745329*(lat2)) * (math.sin(0.008726645 * (lon2 - lon1))**2)
    return math.atan2(a, 1-a)*s

def LLA_distance_numba_cuda(lat1, lon1, lat2, lon2):
    return a_cuda(np.ascontiguousarray(lat1), np.ascontiguousarray(lon1),
               np.ascontiguousarray(lat2), np.ascontiguousarray(lon2), earthdiam_nm)

# generate a mesh of one million evaluation points
nx, ny = 1000,1000
xv, yv = np.meshgrid(np.linspace(29, 31, nx), np.linspace(99, 101, ny))
X, Y = np.float64(xv.reshape(1,nx*ny).flatten()), np.float64(yv.reshape(1,nx*ny).flatten())
# X2,Y2 = np.float64(np.array([30]*nx*ny)),np.float64(np.array([101]*nx*ny))
start = time()
Z=LLA_distance_numba_cuda(X,Y,30.0,101.0)
print('{:d} total evaluations in {:.3f} seconds'.format(nx*ny,time()-start))
#print(Z)
$ nvprof python t39.py
==2387== NVPROF is profiling process 2387, command: python t39.py
1000000 total evaluations in 0.401 seconds
==2387== Profiling application: python t39.py
==2387== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   48.12%  8.4679ms         1  8.4679ms  8.4679ms  8.4679ms  cudapy::__main__::__vectorized_a_cuda$242(Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>)
                   33.97%  5.9774ms         5  1.1955ms     864ns  3.2535ms  [CUDA memcpy HtoD]
                   17.91%  3.1511ms         4  787.77us  1.1840us  3.1459ms  [CUDA memcpy DtoH]
(snip)

现在,我们看到复制HtoD操作已从总计11.4ms减少到总计5.6ms。内核已从约5.8ms增长到了约8.5ms,因为我们正在内核中做更多的工作,但是python报告的函数执行时间已从〜0.58s降至〜0.4s。

我们可以做得更好吗?

我们可以,但是为了做到这一点(我相信),我们需要使用其他的numba cuda方法。 vectorize方法对于按标量元素进行操作很方便,但是它无法知道要在整个数据集中的哪个位置执行该操作。我们需要此信息,并且可以在CUDA代码中获取它,但是我们需要切换到@cuda.jit装饰器。

以下代码将先前的vectorize a_cuda函数转换为@cuda.jit设备函数(基本上没有其他更改),然后我们创建一个CUDA内核,该内核根据到提供的标量参数,并计算结果:

$ cat t40.py
import math
from numba import vectorize, float64, cuda
import numpy as np
from time import time

earthdiam_nm = 1.0

@cuda.jit(device='true')
def a_cuda(lat1, lon1, lat2, lon2, s):
    a = (math.sin(0.008726645 * (lat2 - lat1))**2) + \
             math.cos(0.01745329*(lat1)) * math.cos(0.01745329*(lat2)) * (math.sin(0.008726645 * (lon2 - lon1))**2)
    return math.atan2(a, 1-a)*s

@cuda.jit
def LLA_distance_numba_cuda(lat2, lon2, xb, xe, yb, ye, s, nx, ny, out):
    x,y = cuda.grid(2)
    if x < nx and y < ny:
        lat1 = (((xe-xb) * x)/(nx-1)) + xb # mesh generation
        lon1 = (((ye-yb) * y)/(ny-1)) + yb # mesh generation
        out[y][x] = a_cuda(lat1, lon1, lat2, lon2, s)

nx, ny = 1000,1000
Z = cuda.device_array((nx,ny), dtype=np.float64)
threads = (32,32)
blocks = (32,32)
start = time()
LLA_distance_numba_cuda[blocks,threads](30.0,101.0, 29.0, 31.0, 99.0, 101.0, earthdiam_nm, nx, ny, Z)
Zh = Z.copy_to_host()
print('{:d} total evaluations in {:.3f} seconds'.format(nx*ny,time()-start))
#print(Zh)
$ nvprof python t40.py
==2855== NVPROF is profiling process 2855, command: python t40.py
1000000 total evaluations in 0.294 seconds
==2855== Profiling application: python t40.py
==2855== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   75.60%  10.364ms         1  10.364ms  10.364ms  10.364ms  cudapy::__main__::LLA_distance_numba_cuda$241(double, double, double, double, double, double, double, __int64, __int64, Array<double, int=2, A, mutable, aligned>)
                   24.40%  3.3446ms         1  3.3446ms  3.3446ms  3.3446ms  [CUDA memcpy DtoH]
(...)

现在我们看到了:

  1. 内核运行时甚至更长,大约10毫秒(因为我们正在执行网格生成)
  2. 没有从主机到设备的数据显式复制
  3. 整个函数的运行时间从〜0.4s减少到〜0.3s