重要信息
第一件事是尝试通过以下功能使用guvectorize。我传入了一堆numpy数组,并尝试使用它们在两个数组之间进行乘法运算。如果与cuda以外的目标一起运行,这将起作用。但是,切换到cuda时会导致出现未知错误:
文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ numba \ cuda \ decorators.py”,行82,在jitwrapper中 debug = debug)
TypeError: init ()得到了意外的关键字参数'debug'
在我从该错误中可以找到的所有内容之后,除了死胡同,我什么都没有打。我猜这是一个非常简单的修复程序,我完全错过了,但是哦。还应该说此错误仅在运行一次并由于内存过载而崩溃后发生。
os.environ["NUMBA_ENABLE_CUDASIM"] = "1"
os.environ["CUDA_VISIBLE_DEVICES"] = "10DE 1B06 63933842"
...
所有数组都是numpy
@guvectorize(['void(int64, float64[:,:], float64[:,:], float64[:,:,:],
int64, int64, float64[:,:,:])'], '(),(m,o),(m,o),(n,m,o),(),() -> (n,m,o)',
target='cuda', nopython=True)
def cVestDiscount (ed, orCV, vals, discount, n, rowCount, cv):
for as_of_date in range(0,ed):
for ID in range(0,rowCount):
for num in range(0,n):
cv[as_of_date][ID][num] = orCV[ID][num] * discount[as_of_date][ID][num]
尝试在命令行中使用nvprofiler运行代码会导致以下错误:
警告:当前不支持统一内存分析 配置,因为一对不支持点对点的设备 在此“多GPU”设置中检测到。如果没有对等映射 如果可用,系统将退回到使用零拷贝内存。可能导致 访问统一内存的内核运行速度较慢。更多细节可以 可在以下位置找到: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-managed-memory
我意识到我使用的是启用SLI的图形卡(两张卡都是相同的,evga gtx 1080ti,并且具有相同的设备ID),因此我禁用了SLI并添加了“ CUDA_VISIBLE_DEVICES”行以尝试限制为另一张卡,但结果相同。
我仍然可以使用nvprof运行代码,但是cuda函数比njit(parallel = True)和prange慢。通过使用较小的数据量,我们可以运行代码,但是它比target ='parallel'和target ='cpu'慢。
为什么cuda这么慢,这些错误是什么意思?
感谢您的帮助!
编辑: 这是代码的工作示例:
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer
@guvectorize(['void(int64, float64[:,:], float64[:,:,:], int64, int64, float64[:,:,:])'], '(),(m,o),(n,m,o),(),() -> (n,m,o)', target='cuda', nopython=True)
def cVestDiscount (countRow, multBy, discount, n, countCol, cv):
for as_of_date in range(0,countRow):
for ID in range(0,countCol):
for num in range(0,n):
cv[as_of_date][ID][num] = multBy[ID][num] * discount[as_of_date][ID][num]
countRow = np.int64(100)
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(100,4000,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(countRow, multBy, discount, n, countCol, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
我能够使用gtx 1080ti在cuda中运行代码,但是,它比并行或cpu运行慢得多。我看过其他与guvectorize相关的文章,但没有一篇文章帮助我了解在guvectorize中运行的最佳选择。有什么方法可以使此代码“对代码友好”,还是仅对数组进行乘法太简单而看不到任何好处?
答案 0 :(得分:2)
首先,您显示的基本操作是取两个矩阵,将它们转移到GPU,进行一些元素乘法以生成第3个数组,然后将该第3个数组传递回主机。
有可能使numba / cuda guvectorize(或cuda.jit内核)实现的运行速度比朴素的串行python实现快,但我怀疑是否有可能超过编写良好的主机的性能代码(例如使用某种并行化方法,例如guvectorize)执行相同的操作。这是因为主机和设备之间传输的每字节算术强度太低。此操作太简单了。
第二,我认为重要的是,首先要了解数字vectorize
和guvectorize
的用途。基本原则是从“工人将做什么?”的角度编写ufunc定义。然后允许numba从中衍生出多个工人。您指示numba旋转多个工作程序的方式是传递一个大于给定签名的数据集。应该注意的是 numba不知道如何在ufunc定义内并行化for循环。通过获取您的ufunc定义并在并行工作程序中运行它,可以得到并行的“强度”,其中每个工作程序处理数据的“切片”,但在该切片上运行您的整个ufunc定义。作为补充阅读,我也介绍了一些here方面的内容。
因此,您在实现时遇到的问题是您编写了一个签名(和ufunc),该签名将整个输入数据集映射到单个工作程序。如@talonmies所示,您的底层内核总共有64个线程/工作程序(即使上面关于算术强度的声明除外),在GPU上有趣的程度仍然很小。 64实际上只是一个最小的线程块大小,实际上该线程块中只有1个线程正在执行任何有用的计算工作。一个线程正在以串行方式执行您的整个ufunc,包括所有for循环。
显然,这不是任何人合理使用vectorize
或guvectorize
的意图。
因此,让我们重新审视您想做的事情。最终,您的ufunc希望将一个数组的输入值乘以另一个数组的输入值,并将结果存储在第3个数组中。我们想重复该过程很多次。如果所有3个数组大小都相同,我们实际上可以使用vectorize
来实现这一点,甚至不必求助于更复杂的guvectorize
。让我们将这种方法与您的原始方法进行比较,重点放在CUDA内核执行上。这是一个可行的示例,其中t14.py是您的原始代码,并使用事件探查器运行,而t15.py是它的vectorize
版本,并确认我们已将multBy
数组的大小更改为匹配cv
和discount
:
$ nvprof --print-gpu-trace python t14.py
==4145== NVPROF is profiling process 4145, command: python t14.py
Function: discount factor cumVest duration (seconds):1.24354910851
==4145== Profiling application: python t14.py
==4145== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
312.36ms 1.2160us - - - - - 8B 6.2742MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
312.81ms 27.392us - - - - - 156.25KB 5.4400GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
313.52ms 5.8696ms - - - - - 15.259MB 2.5387GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
319.74ms 1.0880us - - - - - 8B 7.0123MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
319.93ms 896ns - - - - - 8B 8.5149MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
321.40ms 1.22538s (1 1 1) (64 1 1) 63 0B 0B - - - - Quadro K2000 (0 1 7 cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>) [37]
1.54678s 7.1816ms - - - - - 15.259MB 2.0749GB/s Device Pageable Quadro K2000 (0 1 7 [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$ cat t15.py
import numpy as np
from numba import guvectorize,vectorize
import time
from timeit import default_timer as timer
@vectorize(['float64(float64, float64)'], target='cuda')
def cVestDiscount (a, b):
return a * b
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
multBy = np.full_like(discount, 1)
cv = np.empty_like(discount)
func_start = timer()
cv = cVestDiscount(multBy, discount)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ nvprof --print-gpu-trace python t15.py
==4167== NVPROF is profiling process 4167, command: python t15.py
Function: discount factor cumVest duration (seconds):0.37507891655
==4167== Profiling application: python t15.py
==4167== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
193.92ms 6.2729ms - - - - - 15.259MB 2.3755GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
201.09ms 5.7101ms - - - - - 15.259MB 2.6096GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
364.92ms 842.49us (15625 1 1) (128 1 1) 13 0B 0B - - - - Quadro K2000 (0 1 7 cudapy::__main__::__vectorized_cVestDiscount$242(Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>) [31]
365.77ms 7.1528ms - - - - - 15.259MB 2.0833GB/s Device Pageable Quadro K2000 (0 1 7 [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$
我们看到您的应用程序报告的运行时间约为1.244秒,而矢量化版本报告的运行时间约为0.375秒。但是这两个数字都有python开销。如果我们在探查器中查看生成的CUDA内核持续时间,则差异更加明显。我们看到原始内核花费了约1.225秒,而矢量化内核执行了约842微秒(即不到1毫秒)。我们还注意到,现在的计算内核时间比将3个数组转移到GPU或从GPU转移3个数组所需的时间小得多(总共花费约20毫秒),并且我们注意到内核尺寸现在是15625个块(128个)每个线程的总线程数/工作人员总数为2000000,与要执行的乘法运算的总数完全匹配,并且远远超过使用原始代码的仅64个线程(可能实际上只有1个线程)。>
鉴于上述vectorize
方法的简便性,如果您真正想做的是元素逐个乘法,那么您可以考虑复制multBy
以便它在尺寸上与其他两个匹配数组,并完成它。
但是问题仍然存在:像原始问题一样,如何处理不同的输入数组大小?为此,我认为我们需要转到guvectorize
(或者,如@talonmies所示,编写您自己的@cuda.jit
内核,尽管这些方法都可能无法克服如前所述,向/从设备传输数据的开销)。
为了用guvectorize
解决这个问题,我们需要更仔细地考虑已经提到的“切片”概念。让我们重新编写您的guvectorize
内核,使其仅在整体数据的“切片”上运行,然后允许guvectorize
启动功能启动多个工作人员来解决它,每切片一个工作人员
在CUDA中,我们希望有很多工人。你真的不能有太多。因此,这将影响我们如何“分割”数组,从而使多个工作人员可以采取行动。如果我们要沿着第3个(最后一个n
)维度进行切片,则只能使用5个切片,因此最多5个工作人员。同样,如果我们沿第一维或countRow
进行切片,则将有100个切片,因此最多100个工作人员。理想情况下,我们将沿第二维或countCol
进行切片。但是,为简单起见,我将沿第一维或countRow
进行切片。这显然不是最佳选择,但是请参见下面的示例,了解如何处理“按秒切片”问题。按第一个维度进行切片意味着我们将从guvectorize内核中删除第一个for循环,并允许ufunc系统沿着该维度进行并行化(基于传递的数组大小)。该代码可能看起来像这样:
$ cat t16.py
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer
@guvectorize(['void(float64[:,:], float64[:,:], int64, int64, float64[:,:])'], '(m,o),(m,o),(),() -> (m,o)', target='cuda', nopython=True)
def cVestDiscount (multBy, discount, n, countCol, cv):
for ID in range(0,countCol):
for num in range(0,n):
cv[ID][num] = multBy[ID][num] * discount[ID][num]
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(100,4000,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(multBy, discount, n, countCol, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ nvprof --print-gpu-trace python t16.py
==4275== NVPROF is profiling process 4275, command: python t16.py
Function: discount factor cumVest duration (seconds):0.0670170783997
==4275== Profiling application: python t16.py
==4275== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
307.05ms 27.392us - - - - - 156.25KB 5.4400GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
307.79ms 5.9293ms - - - - - 15.259MB 2.5131GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
314.34ms 1.3440us - - - - - 8B 5.6766MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
314.54ms 896ns - - - - - 8B 8.5149MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
317.27ms 47.398ms (2 1 1) (64 1 1) 63 0B 0B - - - - Quadro K2000 (0 1 7 cudapy::__main__::__gufunc_cVestDiscount$242(Array<double, int=3, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>) [35]
364.67ms 7.3799ms - - - - - 15.259MB 2.0192GB/s Device Pageable Quadro K2000 (0 1 7 [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$
观察:
代码更改与删除countCol
参数,从guvectorize内核删除第一个for循环以及对函数签名进行适当的更改以反映这一点有关。我们还将签名中的3维函数 修改为二维。毕竟,我们要对3维数据进行二维“切片”,并让每个工作人员在切片上工作。
由探查器报告的内核尺寸现在是2个块而不是1个。这是有道理的,因为在最初的实现中,实际上只显示了1个“切片”,因此需要1个工作线程,因此1个线程(但numba旋转了1个线程块(共64个线程))。在此实现中,有100个切片,并且numba选择启动2个包含64个工作程序/线程的线程块,以提供所需的100个工作程序/线程。
分析器报告的内核性能为47.4ms,现在介于原始版本(约1.224s)和大规模并行vectorize
版本(约0.001s)之间。因此,从1名工人增加到100名工人可以大大加快工作速度,但是可能会获得更多的性能提升。如果您想知道如何在countCol
维度上进行切分,就性能而言,您可能会更接近vectorize
版本(请参见下文)。请注意,我们在此处的位置(〜47ms)与矢量化版本(〜1ms)之间的差异足以弥补传输稍大的{{1} }设备的矩阵,以简化multBy
的操作。
关于python时序的一些其他评论:我相信python如何为原始,向量化和guvectorize改进版本编译必要内核的确切行为是不同的。如果我们修改t15.py代码以运行“热身”运行,那么至少python时序是一致的,从趋势的角度来看,它与总体运行时间和仅内核时序有关:
vectorize
现在,对评论中的一个问题做出有效回答:“我将如何重铸该问题以沿4000($ cat t15.py
import numpy as np
from numba import guvectorize,vectorize
import time
from timeit import default_timer as timer
@vectorize(['float64(float64, float64)'], target='cuda')
def cVestDiscount (a, b):
return a * b
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
multBy = np.full_like(discount, 1)
cv = np.empty_like(discount)
#warm-up run
cv = cVestDiscount(multBy, discount)
func_start = timer()
cv = cVestDiscount(multBy, discount)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
[bob@cluster2 python]$ time python t14.py
Function: discount factor cumVest duration (seconds):1.24376320839
real 0m2.522s
user 0m1.572s
sys 0m0.809s
$ time python t15.py
Function: discount factor cumVest duration (seconds):0.0228319168091
real 0m1.050s
user 0m0.473s
sys 0m0.445s
$ time python t16.py
Function: discount factor cumVest duration (seconds):0.0665760040283
real 0m1.252s
user 0m0.680s
sys 0m0.441s
$
或“中间”)维度切分?”
我们可以根据在第一维上进行切片的方法进行指导。一种可能的方法是重新排列数组的形状,以使4000维为第一维,然后将其删除,这与我们之前对countCol
的处理类似。这是一个可行的示例:
guvectorize
可以预见的是,我们观察到执行时间从分割成100个工人时的约47ms减少到分割成4000个工人时的约9ms。同样,我们观察到numba选择增加63个64个线程的块,每个块总共4032个线程,以处理此“切片”所需的4000个工人。
仍然不及〜1ms $ cat t17.py
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer
@guvectorize(['void(int64, float64[:], float64[:,:], int64, float64[:,:])'], '(),(o),(m,o),() -> (m,o)', target='cuda', nopython=True)
def cVestDiscount (countCol, multBy, discount, n, cv):
for ID in range(0,countCol):
for num in range(0,n):
cv[ID][num] = multBy[num] * discount[ID][num]
countRow = np.int64(100)
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(4000,100,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(4000,100,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(countRow, multBy, discount, n, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
[bob@cluster2 python]$ python t17.py
Function: discount factor cumVest duration (seconds):0.0266749858856
$ nvprof --print-gpu-trace python t17.py
==8544== NVPROF is profiling process 8544, command: python t17.py
Function: discount factor cumVest duration (seconds):0.0268459320068
==8544== Profiling application: python t17.py
==8544== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
304.92ms 1.1840us - - - - - 8B 6.4437MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
305.36ms 27.392us - - - - - 156.25KB 5.4400GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
306.08ms 6.0208ms - - - - - 15.259MB 2.4749GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
312.44ms 1.0880us - - - - - 8B 7.0123MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
313.59ms 8.9961ms (63 1 1) (64 1 1) 63 0B 0B - - - - Quadro K2000 (0 1 7 cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=2, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>) [35]
322.59ms 7.2772ms - - - - - 15.259MB 2.0476GB/s Device Pageable Quadro K2000 (0 1 7 [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$
内核(为工作人员提供更多可用的并行“切片”)快,但比原始问题中建议的〜1.2s内核快很多。而且,即使有所有python开销,Python代码的总挂壁时间也要快大约2倍。
作为最后的观察,让我们回顾一下我之前所做的陈述(与注释和其他答案中的陈述相似):
“我怀疑是否有可能超过编写良好的主机代码的性能(例如使用某种并行化方法,例如guvectorize)来完成相同的工作。”
我们现在可以在t16.py或t17.py中使用方便的测试用例进行测试。为了简单起见,我将选择t16.py。我们只需从vectorize
ufunc中删除目标名称即可“将其转换回CPU代码”:
guvectorize
因此,我们看到此仅CPU版本在大约6毫秒内运行了该功能,并且它没有GPU“开销”(例如CUDA初始化)以及与GPU之间的数据复制。总的挂墙时间也是我们最好的度量,大约是0.5s,而我们最好的GPU情况大约是1.0s。因此,由于其每字节数据传输的算术强度低,所以这个特殊问题可能不太适合GPU计算。
答案 1 :(得分:1)
在剖析时,gufunc Numba发出和运行如此缓慢的原因立即变得显而易见(使用CUDA 8.0的numba 0.38.1)
==24691== Profiling application: python slowvec.py
==24691== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput Device Context Stream Name
271.33ms 1.2800us - - - - - 8B 5.9605MB/s GeForce GTX 970 1 7 [CUDA memcpy HtoD]
271.65ms 14.591us - - - - - 156.25KB 10.213GB/s GeForce GTX 970 1 7 [CUDA memcpy HtoD]
272.09ms 2.5868ms - - - - - 15.259MB 5.7605GB/s GeForce GTX 970 1 7 [CUDA memcpy HtoD]
274.98ms 992ns - - - - - 8B 7.6909MB/s GeForce GTX 970 1 7 [CUDA memcpy HtoD]
275.17ms 640ns - - - - - 8B 11.921MB/s GeForce GTX 970 1 7 [CUDA memcpy HtoD]
276.33ms 657.28ms (1 1 1) (64 1 1) 40 0B 0B - - GeForce GTX 970 1 7 cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>) [38]
933.62ms 3.5128ms - - - - - 15.259MB 4.2419GB/s GeForce GTX 970 1 7 [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
运行代码的最终内核启动是使用64个线程的单个块。在理论上每个MP最多具有2048个线程和23 MP的GPU上,这意味着未使用GPU的99.9%的理论处理能力。这似乎是numba开发人员的一个荒谬的设计选择,如果您受到它的阻碍(并且看来确实如此),我会将其报告为错误。
显而易见的解决方案是在CUDA python内核方言中将您的函数重写为@cuda.jit
函数,并明确控制执行参数。这样,您至少可以确保代码将以足够的线程运行,以潜在地利用硬件的所有容量。这仍然是一个非常受内存限制的操作,因此您可以在加速方面实现的目标可能要大大小于GPU与CPU的内存带宽之比。而且这可能不足以分摊主机到设备内存传输的成本,因此即使在最佳状态下,也可能无法获得性能提升,即使这还远远没有达到。
简而言之,当心自动编译器生成并行性的危险。...
补充说明,我设法弄清楚了如何获取numba发出的代码的PTX,并且足以说它绝对是致命的(而且很长时间我实际上无法将其全部发布):
{
.reg .pred %p<9>;
.reg .b32 %r<8>;
.reg .f64 %fd<4>;
.reg .b64 %rd<137>;
ld.param.u64 %rd29, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_5];
ld.param.u64 %rd31, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_11];
ld.param.u64 %rd32, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_12];
ld.param.u64 %rd34, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_14];
ld.param.u64 %rd35, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_15];
ld.param.u64 %rd36, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_16];
ld.param.u64 %rd37, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_17];
ld.param.u64 %rd38, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_22];
ld.param.u64 %rd39, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_23];
ld.param.u64 %rd40, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_24];
ld.param.u64 %rd41, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_25];
ld.param.u64 %rd42, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_26];
ld.param.u64 %rd43, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_27];
ld.param.u64 %rd44, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_28];
ld.param.u64 %rd45, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_29];
ld.param.u64 %rd46, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_30];
ld.param.u64 %rd48, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_36];
ld.param.u64 %rd51, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_43];
ld.param.u64 %rd53, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_49];
ld.param.u64 %rd54, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_50];
ld.param.u64 %rd55, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_51];
ld.param.u64 %rd56, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_52];
ld.param.u64 %rd57, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_53];
ld.param.u64 %rd58, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_54];
ld.param.u64 %rd59, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_55];
ld.param.u64 %rd60, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_56];
ld.param.u64 %rd61, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_57];
mov.u32 %r1, %tid.x;
mov.u32 %r3, %ctaid.x;
mov.u32 %r2, %ntid.x;
mad.lo.s32 %r4, %r3, %r2, %r1;
min.s64 %rd62, %rd32, %rd29;
min.s64 %rd63, %rd39, %rd62;
min.s64 %rd64, %rd48, %rd63;
min.s64 %rd65, %rd51, %rd64;
min.s64 %rd66, %rd54, %rd65;
cvt.s64.s32 %rd1, %r4;
setp.le.s64 %p2, %rd66, %rd1;
@%p2 bra BB0_8;
ld.param.u64 %rd126, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_42];
ld.param.u64 %rd125, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_44];
ld.param.u64 %rd124, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_35];
ld.param.u64 %rd123, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_37];
ld.param.u64 %rd122, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_4];
ld.param.u64 %rd121, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_6];
cvt.u32.u64 %r5, %rd1;
setp.lt.s32 %p1, %r5, 0;
selp.b64 %rd67, %rd29, 0, %p1;
add.s64 %rd68, %rd67, %rd1;
mul.lo.s64 %rd69, %rd68, %rd121;
add.s64 %rd70, %rd69, %rd122;
selp.b64 %rd71, %rd48, 0, %p1;
add.s64 %rd72, %rd71, %rd1;
mul.lo.s64 %rd73, %rd72, %rd123;
add.s64 %rd74, %rd73, %rd124;
ld.u64 %rd2, [%rd74];
selp.b64 %rd75, %rd51, 0, %p1;
add.s64 %rd76, %rd75, %rd1;
mul.lo.s64 %rd77, %rd76, %rd125;
add.s64 %rd78, %rd77, %rd126;
ld.u64 %rd3, [%rd78];
ld.u64 %rd4, [%rd70];
setp.lt.s64 %p3, %rd4, 1;
@%p3 bra BB0_8;
ld.param.u64 %rd128, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_13];
ld.param.u64 %rd127, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_12];
selp.b64 %rd80, %rd127, 0, %p1;
mov.u64 %rd79, 0;
min.s64 %rd81, %rd128, %rd79;
min.s64 %rd82, %rd34, %rd79;
selp.b64 %rd83, %rd39, 0, %p1;
min.s64 %rd84, %rd40, %rd79;
min.s64 %rd85, %rd41, %rd79;
min.s64 %rd86, %rd42, %rd79;
selp.b64 %rd87, %rd54, 0, %p1;
min.s64 %rd88, %rd55, %rd79;
min.s64 %rd89, %rd56, %rd79;
min.s64 %rd90, %rd57, %rd79;
mul.lo.s64 %rd91, %rd90, %rd61;
add.s64 %rd92, %rd53, %rd91;
mul.lo.s64 %rd93, %rd89, %rd60;
add.s64 %rd94, %rd92, %rd93;
mul.lo.s64 %rd95, %rd88, %rd59;
add.s64 %rd96, %rd94, %rd95;
add.s64 %rd98, %rd87, %rd1;
mul.lo.s64 %rd99, %rd58, %rd98;
add.s64 %rd5, %rd96, %rd99;
mul.lo.s64 %rd100, %rd86, %rd46;
add.s64 %rd101, %rd38, %rd100;
mul.lo.s64 %rd102, %rd85, %rd45;
add.s64 %rd103, %rd101, %rd102;
mul.lo.s64 %rd104, %rd84, %rd44;
add.s64 %rd105, %rd103, %rd104;
add.s64 %rd106, %rd83, %rd1;
mul.lo.s64 %rd107, %rd43, %rd106;
add.s64 %rd6, %rd105, %rd107;
mul.lo.s64 %rd108, %rd82, %rd37;
add.s64 %rd109, %rd31, %rd108;
mul.lo.s64 %rd110, %rd81, %rd36;
add.s64 %rd111, %rd109, %rd110;
add.s64 %rd112, %rd80, %rd1;
mul.lo.s64 %rd113, %rd35, %rd112;
add.s64 %rd7, %rd111, %rd113;
add.s64 %rd8, %rd2, 1;
mov.u64 %rd131, %rd79;
BB0_3:
mul.lo.s64 %rd115, %rd59, %rd131;
add.s64 %rd10, %rd5, %rd115;
mul.lo.s64 %rd116, %rd44, %rd131;
add.s64 %rd11, %rd6, %rd116;
setp.lt.s64 %p4, %rd3, 1;
mov.u64 %rd130, %rd79;
mov.u64 %rd132, %rd3;
@%p4 bra BB0_7;
BB0_4:
mov.u64 %rd13, %rd132;
mov.u64 %rd12, %rd130;
mul.lo.s64 %rd117, %rd60, %rd12;
add.s64 %rd136, %rd10, %rd117;
mul.lo.s64 %rd118, %rd45, %rd12;
add.s64 %rd135, %rd11, %rd118;
mul.lo.s64 %rd119, %rd36, %rd12;
add.s64 %rd134, %rd7, %rd119;
setp.lt.s64 %p5, %rd2, 1;
mov.u64 %rd133, %rd8;
@%p5 bra BB0_6;
BB0_5:
mov.u64 %rd17, %rd133;
ld.f64 %fd1, [%rd135];
ld.f64 %fd2, [%rd134];
mul.f64 %fd3, %fd2, %fd1;
st.f64 [%rd136], %fd3;
add.s64 %rd136, %rd136, %rd61;
add.s64 %rd135, %rd135, %rd46;
add.s64 %rd134, %rd134, %rd37;
add.s64 %rd24, %rd17, -1;
setp.gt.s64 %p6, %rd24, 1;
mov.u64 %rd133, %rd24;
@%p6 bra BB0_5;
BB0_6:
add.s64 %rd25, %rd13, -1;
add.s64 %rd26, %rd12, 1;
setp.gt.s64 %p7, %rd13, 1;
mov.u64 %rd130, %rd26;
mov.u64 %rd132, %rd25;
@%p7 bra BB0_4;
BB0_7:
sub.s64 %rd120, %rd4, %rd131;
add.s64 %rd131, %rd131, 1;
setp.gt.s64 %p8, %rd120, 1;
@%p8 bra BB0_3;
BB0_8:
ret;
}
所有这些整数运算都恰好执行一个双精度乘法!