我的最终目标是通过使用支持CUDA的GPU加速Python中矩阵矢量产品的计算。矩阵A约为15k x 15k且稀疏(密度约为0.05),矢量x为15k元素且密集,我正在计算Ax。我必须多次执行此计算,因此尽可能快地进行此计算将是理想的。
我目前的非GPU“优化”是将A表示为scipy.sparse.csc_matrix对象,然后简单地计算A.dot(x),但我希望能够在具有几个NVIDIA的VM上加快速度附加GPU,如果可能,仅使用Python(即不手动写出详细的内核函数)。我已成功使用cudamat库加速密集矩阵矢量产品,但不适用于稀疏情况。有一些关于稀疏案例在线的建议,例如使用pycuda,或scikit-cuda,或者anaconda的加速包,但是没有大量的信息,所以很难知道从哪里开始。
我不需要非常详细的说明,但如果有人之前已经解决了这个问题,并且可以为最简单的方法提供“大图”路线图,或者想要加快稀疏GPU的速度 - 基于矩阵向量的产品将超过scipy的稀疏算法,这将非常有用。
答案 0 :(得分:2)
答案 1 :(得分:0)
感谢您的建议。
我设法让pyculib的csrmm(压缩稀疏行格式化矩阵的矩阵乘法)操作使用以下工作(在Google云平台上使用2个NVIDIA K80 GPU),但遗憾的是无法实现加速。
我认为这是因为csrmm函数中的大部分时间用于向/从GPU传输数据,而不是实际进行计算。不幸的是,我无法找到任何简单的pyculib
方法来首先将数组放到GPU上并在迭代中保持它们。我使用的代码是:
import numpy as np
from scipy.sparse import csr_matrix
from pyculib.sparse import Sparse
from time import time
def spmv_cuda(a_sparse, b, sp, count):
"""Compute a_sparse x b."""
# args to csrmm call
trans_a = 'N' # non-transpose, use 'T' for transpose or 'C' for conjugate transpose
m = a_sparse.shape[0] # num rows in a
n = b.shape[1] # num cols in b, c
k = a_sparse.shape[1] # num cols in a
nnz = len(a_sparse.data) # num nonzero in a
alpha = 1 # no scaling
descr_a = sp.matdescr( # matrix descriptor
indexbase=0, # 0-based indexing
matrixtype='G', # 'general': no symmetry or triangular structure
)
csr_val_a = a_sparse.data # csr data
csr_row_ptr_a = a_sparse.indptr # csr row pointers
csr_col_ind_a = a_sparse.indices # csr col idxs
ldb = b.shape[0]
beta = 0
c = np.empty((m, n), dtype=a_sparse.dtype)
ldc = b.shape[0]
# call function
tic = time()
for ii in range(count):
sp.csrmm(
transA=trans_a,
m=m,
n=n,
k=k,
nnz=nnz,
alpha=alpha,
descrA=descr_a,
csrValA=csr_val_a,
csrRowPtrA=csr_row_ptr_a,
csrColIndA=csr_col_ind_a,
B=b,
ldb=ldb,
beta=beta,
C=c,
ldc=ldc)
toc = time()
return c, toc - tic
# run benchmark
COUNT = 20
N = 5000
P = 0.1
print('Constructing objects...\n\n')
np.random.seed(0)
a_dense = np.random.rand(N, N).astype(np.float32)
a_dense[np.random.rand(N, N) >= P] = 0
a_sparse = csr_matrix(a_dense)
b = np.random.rand(N, 1).astype(np.float32)
sp = Sparse()
# scipy sparse
tic = time()
for ii in range(COUNT):
c = a_sparse.dot(b)
toc = time()
print('scipy sparse matrix multiplication took {} seconds\n'.format(toc - tic))
print('c = {}'.format(c[:5, 0]))
# pyculib sparse
c, t = spmv_cuda(a_sparse, b, sp, COUNT)
print('pyculib sparse matrix multiplication took {} seconds\n'.format(t))
print('c = {}'.format(c[:5, 0]))
产生输出:
Constructing objects...
scipy sparse matrix multiplication took 0.05158638954162598 seconds
c = [ 122.29484558 127.83656311 128.75004578 130.69120789 124.98323059]
Testing pyculib sparse matrix multiplication...
pyculib sparse matrix multiplication took 0.12598299980163574 seconds
c = [ 122.29483032 127.83659363 128.75003052 130.6912384 124.98326111]
正如您所看到的,即使矩阵乘法在GPU上,pyculib也是慢两倍。同样,可能是因为在每次迭代时向GPU传输数据或从GPU传输数据所涉及的开销。
然而,我找到的另一种解决方案是使用Andreas Kloeckner的pycuda库,它的速度提高了50倍!
import numpy as np
import pycuda.autoinit
import pycuda.driver as drv
import pycuda.gpuarray as gpuarray
from pycuda.sparse.packeted import PacketedSpMV
from pycuda.tools import DeviceMemoryPool
from scipy.sparse import csr_matrix
from time import time
def spmv_cuda(a_sparse, b, count):
dtype = a_sparse.dtype
m = a_sparse.shape[0]
print('moving objects to GPU...')
spmv = PacketedSpMV(a_sparse, is_symmetric=False, dtype=dtype)
dev_pool = DeviceMemoryPool()
d_b = gpuarray.to_gpu(b, dev_pool.allocate)
d_c = gpuarray.zeros(m, dtype=dtype, allocator=d_b.allocator)
print('executing spmv operation...\n')
tic = time()
for ii in range(count):
d_c.fill(0)
d_c = spmv(d_b, d_c)
toc = time()
return d_c.get(), toc - tic
# run benchmark
COUNT = 100
N = 5000
P = 0.1
DTYPE = np.float32
print('Constructing objects...\n\n')
np.random.seed(0)
a_dense = np.random.rand(N, N).astype(DTYPE)
a_dense[np.random.rand(N, N) >= P] = 0
a_sparse = csr_matrix(a_dense)
b = np.random.rand(N, 1).astype(DTYPE)
# numpy dense
tic = time()
for ii in range(COUNT):
c = np.dot(a_dense, b)
toc = time()
print('numpy dense matrix multiplication took {} seconds\n'.format(toc - tic))
print('c = {}\n'.format(c[:5, 0]))
# scipy sparse
tic = time()
for ii in range(COUNT):
c = a_sparse.dot(b)
toc = time()
print('scipy sparse matrix multiplication took {} seconds\n'.format(toc - tic))
print('c = {}\n'.format(c[:5, 0]))
# pycuda sparse
c, t = spmv_cuda(a_sparse, b, COUNT)
print('pycuda sparse matrix multiplication took {} seconds\n'.format(t))
print('c = {}\n'.format(c[:5]))
产生此输出:
numpy dense matrix multiplication took 0.2290663719177246 seconds
c = [ 122.29484558 127.83656311 128.75004578 130.69120789 124.98323059]
scipy sparse matrix multiplication took 0.24468040466308594 seconds
c = [ 122.29484558 127.83656311 128.75004578 130.69120789 124.98323059]
moving objects to GPU...
executing spmv operation...
pycuda sparse matrix multiplication took 0.004545450210571289 seconds
c = [ 122.29484558 127.83656311 128.75004578 130.69120789 124.98323059]
注1:pycuda需要以下依赖项:
pip install pymetis
sudo apt install nvidia-cuda-toolkit
注2:由于某些原因pip install pycuda
无法安装文件pkt_build_cython.pyx
,因此您需要自己从https://github.com/inducer/pycuda/blob/master/pycuda/sparse/pkt_build_cython.pyx下载/复制文件。
答案 2 :(得分:0)
另一种解决方案是使用tensorflow的矩阵乘法函数。一旦GPU启用的张量流启动并运行,这些就可以开箱即用。
安装CUDA和tensorflow-gpu(一些参与但简单明了的教程here和here)之后,您可以使用tensorflow的SparseTensor类和sparse_tensor_dense_matmul函数,如下所示:
import numpy as np
import tensorflow as tf
from tensorflow.python.client import device_lib
from time import time
确保检测到GPU:
gpus = [x.name for x in device_lib.list_local_devices() if x.device_type == 'GPU']
print('GPU DEVICES:\n {}'.format(gpus))
输出:
GPU DEVICES:
['/device:GPU:0']
基准:
from scipy.sparse import csr_matrix
ITERS = 30
N = 20000
P = 0.1 # matrix density
使用scipy:
np.random.seed(0)
a_dense = np.random.rand(N, N)
a_dense[a_dense > P] = 0
a_sparse = csr_matrix(a_dense)
b = np.random.rand(N)
tic = time()
for ii in range(ITERS):
c = a_sparse.dot(b)
toc = time()
elapsed = toc - tic
print('Scipy spmv product took {} seconds per iteration.'.format(elapsed/ITERS))
输出:
Scipy spmv product took 0.06693172454833984 seconds per iteration.
使用支持GPU的张量流:
with tf.device('/device:GPU:0'):
np.random.seed(0)
a_dense = np.random.rand(N, N)
a_dense[a_dense > P] = 0
indices = np.transpose(a_dense.nonzero())
values = a_dense[indices[:, 0], indices[:, 1]]
dense_shape = a_dense.shape
a_sparse = tf.SparseTensor(indices, values, dense_shape)
b = tf.constant(np.random.rand(N, 1))
tic = time()
for ii in range(ITERS):
c = tf.sparse_tensor_dense_matmul(a_sparse, b)
toc = time()
elapsed = toc - tic
print('GPU spmv product took {} seconds per iteration.'.format(elapsed/ITERS))
输出:
GPU spmv product took 0.0011811971664428711 seconds per iteration.
事实证明,加速非常好。
答案 3 :(得分:0)
另一种替代方法是使用CuPy
包。它具有与numpy / scipy相同的界面(非常不错),并且(至少对我而言),它的安装比pycuda
容易得多。
新代码看起来像这样:
import cupy as cp
from cupyx.scipy.sparse import csr_matrix as csr_gpu
A = some_sparse_matrix #(scipy.sparse.csr_matrix)
x = some_dense_vector #(numpy.ndarray)
A_gpu = csr_gpu(A) #moving A to the gpu
x_gpu = cp.array(x) #moving x to the gpu
for i in range(niter):
x_gpu = A_gpu.dot(x_gpu)
x = cp.asnumpy(x_gpu) #back to numpy object for fast indexing